Skip to content

Carefully consider the metadata for the epi_df resulting from sliding a forecaster over an epi_archive #213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rachlobay opened this issue Aug 18, 2022 · 5 comments
Assignees
Labels
bug Something isn't working P1 medium priority

Comments

@rachlobay
Copy link
Collaborator

As originally stated in Issue #208, after making the change to allow epix_slide to epi_df so that we can slide a forecaster using arx_forecaster() over an epi_archive, it is good to consider whether the resulting metadata (namely, as_of) is what we want.

In the below example, as_of is related to the first fc_time_values 2020-08-01… but do we really want that? Does it make the most sense to use here?

library(epipredict)
library(epiprocess)
library(covidcast)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)

y <- covidcast_signals(
  c("doctor-visits", "jhu-csse"),
  c("smoothed_adj_cli", "confirmed_7dav_incidence_prop"),
  start_day = "2020-06-01",
  end_day = "2021-12-01",
  issues = c("2020-06-01", "2021-12-01"),
  geo_type = "state",
  geo_values = c("ca", "fl"))

z <- y[[1]] %>%
  select(geo_value, time_value, version = issue, percent_cli = value) %>%
  as_epi_archive()

z <- epix_merge(
  z, y[[2]] %>%
    select(geo_value, time_value, version = issue, case_rate = value) %>%
    as_epi_archive(), sync = "locf")

fc_time_values <- seq(as.Date("2020-08-01"), as.Date("2021-12-01"),
                      by = "1 month")
ahead = 7

# Slide arx_forecaster over archive z
z %>% 
  epix_slide(function(x, ...) 
    arx_forecaster(x, outcome = "case_rate", 
                   predictors = "case_rate",
                   args_list = arx_args_list(ahead = ahead))$predictions %>% 
      select(-c(geo_value, time_value)),
    n = 120, ref_time_values = fc_time_values, new_col_name = "fc") 
@brookslogan
Copy link
Contributor

brookslogan commented Aug 18, 2022

as_of [of the final result, if that's what we are talking about here] should probably be set to z$versions_end [or we shouldn't output an epi_df at all! "right" class is probably a generalized archive object that doesn't require a geo_value or time_value column]. (This is the last version we "observed", and basically acts like the archive's "version". In most cases this is equivalent to max(z$DT$version); when it's different, it will be greater than this max value and express that we observed additional versions that contained no updates.)

[On the other hand, maybe it should be max(ref_time_values), which would be automatic if we were to make rbind, bind_rows, etc. on epi_dfs use the max as_of rather than the first one. Current ?dplyr::dplyr_extending may not make this easy. While we do want this fix as well, the output for this function should probably be based on versions_end, because if, say versions_end overwrote/clobbered some version data from max(ref_time_values), it'd be more appropriate to label the output as_of using the former rather than the latter. --- Counterpoint: user may use a single ref_time_value at a time, and expect as_of of the output to be that ref_time_value, not the versions_end. And default slide range should go through versions_end, so there's no conflict between the two for the default, and the user might understand that things will change if they change the ref_time_values. But we also need to consider all_versions = TRUE, where things might change. These questions hint that an epi_df might not be the right output class for a multi-ref_time_value epix_slide.]

@brookslogan
Copy link
Contributor

brookslogan commented Aug 18, 2022

If we are talking about the input to the group computations, then yes, it should be what we already get out of $as_of; the corresponding ref time value / fc time value / version. [This is in line with simulating production forecasting at previous dates, assuming that on forecast time value fctv, we were able to observe version fctv for all data sources involved, and then made a forecast based on the set of targets defined for forecast time value fctv. There could be some more special situations, but thinking about how to handle them seems like a separate issue.]

@brookslogan
Copy link
Contributor

brookslogan commented Mar 19, 2023

This is related to the discussion of what the output class and column names of epix_slide operations should be (#163, #49, #259, ...). Before we consider those larger changes, I plan to make an incremental improvement / temporary decision by assigning the output (if it's an epi_df) an as_of equal to the input archive's versions_end.

  • Individual calculations' versions are already available in the output time_value column; this just keeps around a different type of information (a proxy for the archive's as_of --- we might consider adding a timestamp as_of to archives and this might make even more sense, though it gets more into the details of version-clobbering which we want to hide (Version bound metadata: don't distract from basics, improve naming #176))
  • It's weird for the output as_of to change when expanding/changing the ref_time_values set, unless the user is working with one ref_time_value at a time.
  • This may avoid issues down the road if all_versions=TRUE later allows epi_df output; one would expect the ref_time_value for a computation, which will be the versions_end of the computation's input archive, to be the as_of of its output, not some shifting around based on the ref_time_values of whatever nested epix_slide-ing it does.

@brookslogan brookslogan self-assigned this Mar 19, 2023
@brookslogan
Copy link
Contributor

(We might also consider a special value of as_of indicating that it mixes data from multiple versions. Again, relates to #242.)

brookslogan pushed a commit to brookslogan/epiprocess that referenced this issue Mar 20, 2023
brookslogan pushed a commit to brookslogan/epiprocess that referenced this issue Mar 21, 2023
@brookslogan
Copy link
Contributor

--- CURRENT STATE: ---

  • We currently don't output epi_dfs from epix_slide(), to avoid the clearly bad behavior described in the first comment.
  • The bugs underlying this bad behavior with rbinding epi_dfs etc. still exist; however, even if they were fixed, we might still need some special logic in epix_slide() (bugfixes might get us max of the ref_time_values; we'd need to decide if that's right).
  • We may not want to ever return an epi_df even with the bugfixes. Related: time_value vs. version output column naming discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 medium priority
Projects
None yet
Development

No branches or pull requests

2 participants