Skip to content

Add the 7dav we talked about along with the std #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Dec 23, 2023
Merged

Conversation

dsweber2
Copy link
Contributor

@dsweber2 dsweber2 commented Dec 14, 2023

Closes #74. This adds a new forecaster that creates a 7 day average column and a 28 day moving standard deviation column, and uses those as predictors. @brookslogan would like to hear your feedback about implementation if you've got time.

@dsweber2 dsweber2 requested a review from brookslogan December 14, 2023 18:23
@dsweber2 dsweber2 self-assigned this Dec 14, 2023
Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review. Sorry, I've probably gone too in the weeds about random things. The main things:

  • update_predictors may be unnecessary and its use may produce undesired results. E.g., mimicking dev10 might need a workaround or extra postprocessing if update_predictors is used (since it uses lags of the original. [Nevermind, I missed a function and keep_mean. dev10 does seem easily representable.] [Still] an alternative to consider [taking advantage of tidymodels roles rather than externally tracking & assuming]:
    • Implement a step_rolling_mean (or some sort of step-like thing) to add grouped rolling mean columns & assign a requested role, defaulting to "predictor". (Might include a convenience feature to lag these as well.)
    • Implement a step_rolling_sd to add grouped rolling-sds-around-rolling-mean columns & assign a requested role, defaulting to "predictor".
  • If update_predictors is necessary, it needs clearer documentation&implementation.
  • cache_metadata can probably be avoided through use of dplyr_reconstruct or epi_slide.

Do you think this refactor would be sensible, and do you have time to try it, or should I try to look for more concrete issues?

A more concrete thing: it looks like this is missing a tidyr::complete step (and maybe an arrange?).

(I'm not sure if there are some recipes invariants about number/ordering of rows to worry about if combining the refactor above + tidyr::complete. If these are complicating things, there's mutate_scattered_ts. In contrast to complete + current / epi_slide approach, this doesn't change the number of rows, though it may re-order them. It also supports across.

tibble::tibble(g = 1, time_value = c(1, 3:10), value = time_value) %>%
    bind_rows(tibble::tibble(g = 2, time_value = 11, value = 100)) %>%
    group_by(g) %>%
    mutate_scattered_ts(time_value, 1L, across(c("value"), list(mean = ~ data.table::frollmean(.x, 7L)))) %>%
    ungroup()

#> # A tibble: 10 × 4
#>        g time_value value value_mean
#>    <dbl>      <dbl> <dbl>      <dbl>
#>  1     1          1     1         NA
#>  2     1          3     3         NA
#>  3     1          4     4         NA
#>  4     1          5     5         NA
#>  5     1          6     6         NA
#>  6     1          7     7         NA
#>  7     1          8     8         NA
#>  8     1          9     9          6
#>  9     1         10    10          7
#> 10     2         11   100         NA

)

)
}
# and need to make sure we exclude the original varialbes as predictors
predictors <- update_predictors(epi_data, c(smooth_cols, sd_cols), predictors)
Copy link
Contributor

@brookslogan brookslogan Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Do we always want to do this? E.g., dev10 pairs lags of separate rolling means of the original with the sd estimate. [nvm, this is what seems to be happening here as well, though I need to check where the lags are specified] And how are the requested lags derived for these predictors? May make more sense to just keep everything and have the model specify what it wants. Or, in epipredict framework, have a smooth mean step that adds (lags of) smoothed means & assigns predictor roles, and sd step that works in a similar way, so dev10 could be a step_epi_lag + step_rolling_sd, while the implied behavior above could be a step_rolling_mean + step_rolling_sd. (thought: I'm wishing for a step_epi_slide and a step_mutate that allow .by, but these concrete steps would be good too I think. And same thing in epiprocess, probably want specific rolling mean and rolling sd helpers..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lags are applied to all predictors, and are specified per-model in the ..., which can be any of the arx args.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we may want to apply different lagsets to the 7dav (7*(0:4) ish) and the sd covariate (maybe just 0) & ignore the original non-7dav'd predictors. Is that possible via the list-of-vectors form of lags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore the original non-7dav'd predictors

this is what the update_predictors function is for. It drops the non-7dav'd versions from the list of predictors.

Is that possible via the list-of-vectors form of lags?

Vector or List. Positive integers enumerating lags to use in autoregressive-type models (in days). By default, an unnamed list of lags will be set to correspond to the order of the predictors.

Apparently, though I haven't actually tried it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the current validation will probably complain if we try it.

issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.

@dsweber2
Copy link
Contributor Author

first, thanks for the detailed commentary!

Implement a step_rolling_mean (or some sort of step-like thing) to add grouped rolling mean columns & assign a requested role, defaulting to "predictor". (Might include a convenience feature to lag these as well.)
Implement a step_rolling_sd to add grouped rolling-sds-around-rolling-mean columns & assign a requested role, defaulting to "predictor".

I do think this is an eventual goal. As with the latency adjustment the turnaround time on adding things to epipredict is longer than writing them like this. If they perform well we should do as you're saying.

cache_metadata can probably be avoided through use of dplyr_reconstruct or epi_slide.

Only reason I defined this function was because select wasn't actually supported for epi_df's and epi_archives. Glad to see there are some built-in alternatives.

@dsweber2
Copy link
Contributor Author

re-implemented with epislide and added the stuff I marked as resolved.

Now I'm wondering; for days that are missing/NA, there are 3 main options I can see.

  1. what I was doing initially, which is ignore them completely and take 7 points regardless of date
  2. what epi_slide does by default, which is use them to determine the points used in the average
  3. in addition, impute the values for missing days based on the 7dav around them. This may be NA for particularly bad days.

@brookslogan
Copy link
Contributor

brookslogan commented Dec 20, 2023

  • For compatibility with dev10, I'd think the choice would be to complete with NAs or check for an nrow of 7 else output NA. Latter approach wouldn't impact the number of input&output rows and would automatically produce NAs for at least the first 6 times, similar to mutate_scattered_ts, so that might be nice. NA seems like the safe-ish-est easy option, and gaps probably aren't that common(?) so this might be preferred?
    • If latency varies by geo, then an imputation might let us do something less bad, but maybe this is the response of a latency-handling step/approach rather than a 7dav-er.
    • (Perhaps we're also missing a step_gap_fill or something like that if someone wants e.g. 0s?)
  • Would it be any easier/quicker to develop the steps if they don't have to immediately be put into epipredict? Maybe gradually port over there after testing here? [Re-reading above --- testing the model performance first also sounds like a plan.]

[Seems like a lot of this we may want to proceed & test before trying to refactor/tweak. I still need to finish reading some files & try to actually sanity check more concrete things.]

@brookslogan brookslogan self-requested a review December 20, 2023 19:43
@dsweber2
Copy link
Contributor Author

For compatibility with dev10...

Ok, so I made the imputation thing a standalone issue in the icebox. For now, I'm just using default epi_slide behavior, which fills with NA and then removes the NA fills in the returned result.

Would it be any easier/quicker to develop the steps if they don't have to immediately be put into epipredict? Maybe gradually port over there after testing here? [Re-reading above --- testing the model performance first also sounds like a plan.]

Definitely. This is basically ready to be tested on real data, modulo anything you catch that isn't currently enumerated in the tests, whereas I don't think we'd have this operation for another month in epipredict, between the coordination problems and dealing with the complications that come from extra framework requirements.

@brookslogan
Copy link
Contributor

For now, I'm just using default epi_slide behavior, which fills with NA and then removes the NA fills in the returned result.

Warning: we want that to be the epi_slide default, but it's not yet.

library(dplyr)
library(epiprocess)
tibble(geo_value = 1, time_value = c(1, 4,5,6), value = time_value) %>%
  as_epi_df() %>%
  epi_slide(mv = mean(value), before = 1)
#> An `epi_df` object, 4 x 4 with metadata:
#> * geo_type  = hhs
#> * time_type = custom
#> * as_of     = 2023-12-22 07:58:58
#> 
#> # A tibble: 4 × 4
#>   geo_value time_value value    mv
#>       <dbl>      <dbl> <dbl> <dbl>
#> 1         1          1     1   1  
#> 2         1          4     4   4  
#> 3         1          5     5   4.5
#> 4         1          6     6   5.5

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted a few issues in *_width treatment and in some particular cases, + various minor suggestions + brainstorming. Lingering thoughts:

  • Things get really hard when you start trying to break things into finer parts. Same difficulty when I was thinking of trying this on make_test_forecaster. Think we're still not to an ideal form yet.
  • mean_cols and sd_cols should maybe sometime be tidyselections instead of char/NULL?
  • update_predictors could potentially remove some variables we don't want it to when things have similarly names. Haven't thought enough to say it definitely happens since it will probably go away if/when moving to a step-based approach.

...) {
# perform any preprocessing not supported by epipredict
# this is a temp fix until a real fix gets put into epipredict
epi_data <- clear_lastminute_nas(epi_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why do we want to clear NAs? (Probably should add an explanatory comment. Note this is sort of an "anti-complete()" so even if we want complete()d stuff anywhere below then those things need to be careful. E.g., slide computation passed to epi_slide() should probably test nrow(.x) to decide whether it should output NA.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastminute here means NA in the final days. Epipredict just chokes on that, when it should be treating the NA's as a latency adjustment.

Copy link
Contributor

@brookslogan brookslogan Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense, though the current clear_lastminute_nas clears all NAs not just lastminute ones. Relates to lastminute-imputation & gap treatment stuff. I think we discussed this elsewhere but current approach probably works okay on hhs hosp data (and hopefully any chng data used?? idk there), but when adding to epipredict should probably be a bit more careful since it could be used on more problematic data sets.

)
}
# and need to make sure we exclude the original varialbes as predictors
predictors <- update_predictors(epi_data, c(smooth_cols, sd_cols), predictors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the current validation will probably complain if we try it.

issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.


# postprocessing supported by epipredict
postproc <- frosting()
postproc %<>% arx_postprocess(trainer, args_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (with arx_postprocess...): ignores forecast_date passed through ...; doesn't respect target_date passed through ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (with arx_postprocess...): ignores forecast_date passed through ...; doesn't respect target_date passed through ....

That's intended behavior. Both of those are set through adjusting ahead for target_date and the as_of of epi_data for the forecast_date.

issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.

Yeah, I should probably just completely refactor the way I wrote update_predictors. I think it should work for now though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think a quick fix here would be to just make sure forecast_date and target_date are NULL in args_list. Or to move away from ... forwarding to manually handling/forwarding/forbidding each parameter. I assume you're not using passing in either of these parameters downstream, so maybe this could also be deferred to a separate Issue.

@dsweber2 dsweber2 merged commit 993b571 into main Dec 23, 2023
@dshemetov dshemetov deleted the smoothedScaled branch December 23, 2023 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement smoothed+sd-aware forecaster
2 participants