Add the 7dav we talked about along with the std #76

dsweber2 · 2023-12-14T18:23:51Z

Closes #74. This adds a new forecaster that creates a 7 day average column and a 28 day moving standard deviation column, and uses those as predictors. @brookslogan would like to hear your feedback about implementation if you've got time.

brookslogan

This is a partial review. Sorry, I've probably gone too in the weeds about random things. The main things:

update_predictors may be unnecessary and its use may produce undesired results. ~~E.g., mimicking dev10 might need a workaround or extra postprocessing if update_predictors is used (since it uses lags of the original.~~ [Nevermind, I missed a function and keep_mean. dev10 does seem easily representable.] [Still] an alternative to consider [taking advantage of tidymodels roles rather than externally tracking & assuming]:
- Implement a step_rolling_mean (or some sort of step-like thing) to add grouped rolling mean columns & assign a requested role, defaulting to "predictor". (Might include a convenience feature to lag these as well.)
- Implement a step_rolling_sd to add grouped rolling-sds-around-rolling-mean columns & assign a requested role, defaulting to "predictor".
If update_predictors is necessary, it needs clearer documentation&implementation.
cache_metadata can probably be avoided through use of dplyr_reconstruct or epi_slide.

Do you think this refactor would be sensible, and do you have time to try it, or should I try to look for more concrete issues?

A more concrete thing: it looks like this is missing a tidyr::complete step (and maybe an arrange?).

(I'm not sure if there are some recipes invariants about number/ordering of rows to worry about if combining the refactor above + tidyr::complete. If these are complicating things, there's mutate_scattered_ts. In contrast to complete + current / epi_slide approach, this doesn't change the number of rows, though it may re-order them. It also supports across.

tibble::tibble(g = 1, time_value = c(1, 3:10), value = time_value) %>%
    bind_rows(tibble::tibble(g = 2, time_value = 11, value = 100)) %>%
    group_by(g) %>%
    mutate_scattered_ts(time_value, 1L, across(c("value"), list(mean = ~ data.table::frollmean(.x, 7L)))) %>%
    ungroup()

#> # A tibble: 10 × 4
#>        g time_value value value_mean
#>    <dbl>      <dbl> <dbl>      <dbl>
#>  1     1          1     1         NA
#>  2     1          3     3         NA
#>  3     1          4     4         NA
#>  4     1          5     5         NA
#>  5     1          6     6         NA
#>  6     1          7     7         NA
#>  7     1          8     8         NA
#>  8     1          9     9          6
#>  9     1         10    10          7
#> 10     2         11   100         NA

)

R/data_transforms.R

brookslogan · 2023-12-19T20:03:08Z

R/forecaster_smoothed_scaled.R

+    )
+  }
+  # and need to make sure we exclude the original varialbes as predictors
+  predictors <- update_predictors(epi_data, c(smooth_cols, sd_cols), predictors)


issue: Do we always want to do this? ~~E.g., dev10 pairs lags of separate rolling means of the original with the sd estimate.~~ [nvm, this is what seems to be happening here as well, though I need to check where the lags are specified] And how are the requested lags derived for these predictors? May make more sense to just keep everything and have the model specify what it wants. Or, in epipredict framework, have a smooth mean step that adds (lags of) smoothed means & assigns predictor roles, and sd step that works in a similar way, so dev10 could be a step_epi_lag + step_rolling_sd, while the implied behavior above could be a step_rolling_mean + step_rolling_sd. (thought: I'm wishing for a step_epi_slide and a step_mutate that allow .by, but these concrete steps would be good too I think. And same thing in epiprocess, probably want specific rolling mean and rolling sd helpers..)

Lags are applied to all predictors, and are specified per-model in the ..., which can be any of the arx args.

Here we may want to apply different lagsets to the 7dav (7*(0:4) ish) and the sd covariate (maybe just 0) & ignore the original non-7dav'd predictors. Is that possible via the list-of-vectors form of lags?

ignore the original non-7dav'd predictors

this is what the update_predictors function is for. It drops the non-7dav'd versions from the list of predictors.

Is that possible via the list-of-vectors form of lags?

Vector or List. Positive integers enumerating lags to use in autoregressive-type models (in days). By default, an unnamed list of lags will be set to correspond to the order of the predictors.

Apparently, though I haven't actually tried it.

Looks like the current validation will probably complain if we try it.

issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.

dsweber2 · 2023-12-19T22:23:45Z

first, thanks for the detailed commentary!

Implement a step_rolling_mean (or some sort of step-like thing) to add grouped rolling mean columns & assign a requested role, defaulting to "predictor". (Might include a convenience feature to lag these as well.)
Implement a step_rolling_sd to add grouped rolling-sds-around-rolling-mean columns & assign a requested role, defaulting to "predictor".

I do think this is an eventual goal. As with the latency adjustment the turnaround time on adding things to epipredict is longer than writing them like this. If they perform well we should do as you're saying.

cache_metadata can probably be avoided through use of dplyr_reconstruct or epi_slide.

Only reason I defined this function was because select wasn't actually supported for epi_df's and epi_archives. Glad to see there are some built-in alternatives.

Co-authored-by: brookslogan <[email protected]>

dsweber2 · 2023-12-20T19:09:58Z

re-implemented with epislide and added the stuff I marked as resolved.

Now I'm wondering; for days that are missing/NA, there are 3 main options I can see.

what I was doing initially, which is ignore them completely and take 7 points regardless of date
what epi_slide does by default, which is use them to determine the points used in the average
in addition, impute the values for missing days based on the 7dav around them. This may be NA for particularly bad days.

brookslogan · 2023-12-20T19:31:57Z

For compatibility with dev10, I'd think the choice would be to complete with NAs or check for an nrow of 7 else output NA. Latter approach wouldn't impact the number of input&output rows and would automatically produce NAs for at least the first 6 times, similar to mutate_scattered_ts, so that might be nice. NA seems like the safe-ish-est easy option, and gaps probably aren't that common(?) so this might be preferred?
- If latency varies by geo, then an imputation might let us do something less bad, but maybe this is the response of a latency-handling step/approach rather than a 7dav-er.
- (Perhaps we're also missing a step_gap_fill or something like that if someone wants e.g. 0s?)
Would it be any easier/quicker to develop the steps if they don't have to immediately be put into epipredict? Maybe gradually port over there after testing here? [Re-reading above --- testing the model performance first also sounds like a plan.]

[Seems like a lot of this we may want to proceed & test before trying to refactor/tweak. I still need to finish reading some files & try to actually sanity check more concrete things.]

dsweber2 · 2023-12-20T21:06:48Z

For compatibility with dev10...

Ok, so I made the imputation thing a standalone issue in the icebox. For now, I'm just using default epi_slide behavior, which fills with NA and then removes the NA fills in the returned result.

Would it be any easier/quicker to develop the steps if they don't have to immediately be put into epipredict? Maybe gradually port over there after testing here? [Re-reading above --- testing the model performance first also sounds like a plan.]

Definitely. This is basically ready to be tested on real data, modulo anything you catch that isn't currently enumerated in the tests, whereas I don't think we'd have this operation for another month in epipredict, between the coordination problems and dealing with the complications that come from extra framework requirements.

brookslogan · 2023-12-22T15:59:50Z

For now, I'm just using default epi_slide behavior, which fills with NA and then removes the NA fills in the returned result.

Warning: we want that to be the epi_slide default, but it's not yet.

library(dplyr)
library(epiprocess)
tibble(geo_value = 1, time_value = c(1, 4,5,6), value = time_value) %>%
  as_epi_df() %>%
  epi_slide(mv = mean(value), before = 1)
#> An `epi_df` object, 4 x 4 with metadata:
#> * geo_type  = hhs
#> * time_type = custom
#> * as_of     = 2023-12-22 07:58:58
#> 
#> # A tibble: 4 × 4
#>   geo_value time_value value    mv
#>       <dbl>      <dbl> <dbl> <dbl>
#> 1         1          1     1   1  
#> 2         1          4     4   4  
#> 3         1          5     5   4.5
#> 4         1          6     6   5.5

R/data_transforms.R

brookslogan

Noted a few issues in *_width treatment and in some particular cases, + various minor suggestions + brainstorming. Lingering thoughts:

Things get really hard when you start trying to break things into finer parts. Same difficulty when I was thinking of trying this on make_test_forecaster. Think we're still not to an ideal form yet.
mean_cols and sd_cols should maybe sometime be tidyselections instead of char/NULL?
update_predictors could potentially remove some variables we don't want it to when things have similarly names. Haven't thought enough to say it definitely happens since it will probably go away if/when moving to a step-based approach.

R/forecaster_smoothed_scaled.R

brookslogan · 2023-12-22T16:35:30Z

R/forecaster_smoothed_scaled.R

+                          ...) {
+  # perform any preprocessing not supported by epipredict
+  # this is a temp fix until a real fix gets put into epipredict
+  epi_data <- clear_lastminute_nas(epi_data)


question: why do we want to clear NAs? (Probably should add an explanatory comment. Note this is sort of an "anti-complete()" so even if we want complete()d stuff anywhere below then those things need to be careful. E.g., slide computation passed to epi_slide() should probably test nrow(.x) to decide whether it should output NA.)

lastminute here means NA in the final days. Epipredict just chokes on that, when it should be treating the NA's as a latency adjustment.

Okay, that makes sense, though the current clear_lastminute_nas clears all NAs not just lastminute ones. Relates to lastminute-imputation & gap treatment stuff. I think we discussed this elsewhere but current approach probably works okay on hhs hosp data (and hopefully any chng data used?? idk there), but when adding to epipredict should probably be a bit more careful since it could be used on more problematic data sets.

R/forecaster_smoothed_scaled.R

brookslogan · 2023-12-22T17:42:56Z

R/forecaster_smoothed_scaled.R

+    )
+  }
+  # and need to make sure we exclude the original varialbes as predictors
+  predictors <- update_predictors(epi_data, c(smooth_cols, sd_cols), predictors)


Looks like the current validation will probably complain if we try it.

issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.

brookslogan · 2023-12-22T18:05:45Z

R/forecaster_smoothed_scaled.R

+
+  # postprocessing supported by epipredict
+  postproc <- frosting()
+  postproc %<>% arx_postprocess(trainer, args_list)


issue (with arx_postprocess...): ignores forecast_date passed through ...; doesn't respect target_date passed through ....

issue (with arx_postprocess...): ignores forecast_date passed through ...; doesn't respect target_date passed through ....

That's intended behavior. Both of those are set through adjusting ahead for target_date and the as_of of epi_data for the forecast_date.

issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.

Yeah, I should probably just completely refactor the way I wrote update_predictors. I think it should work for now though.

Think a quick fix here would be to just make sure forecast_date and target_date are NULL in args_list. Or to move away from ... forwarding to manually handling/forwarding/forbidding each parameter. I assume you're not using passing in either of these parameters downstream, so maybe this could also be deferred to a separate Issue.

R/targets_utils.R

dsweber2 requested a review from brookslogan December 14, 2023 18:23

dsweber2 added 5 commits December 14, 2023 10:41

feat: rolling_mean/sd for a new forecaster

e4b3d45

consistent name, only smooth non-smoothed, init forecaster

4a78810

smoothed_scaled passes all forecaster tests

1edc6f6

smoothed_scaled data tests

e63be89

docfix, points no longer oversized

ea335c9

dsweber2 force-pushed the smoothedScaled branch from 65cd9b4 to ea335c9 Compare December 14, 2023 18:43

docs only tell one thing at a time

8df5f05

dsweber2 self-assigned this Dec 14, 2023

dsweber2 added this to the experiment with new Forecasters milestone Dec 14, 2023

brookslogan reviewed Dec 19, 2023

View reviewed changes

dsweber2 and others added 3 commits December 19, 2023 14:26

Update R/data_transforms.R

36608f6

Co-authored-by: brookslogan <[email protected]>

fix: warnings are one at a time, apparently

0eb61e3

switch to epi_slide, add logan's suggestions, NA tests

c50249b

test: fix updated assumption

0b16c81

brookslogan self-requested a review December 20, 2023 19:43

test: make sure keep is off by default

514219e

dsweber2 added 2 commits December 20, 2023 13:36

docs: slightly better

b976103

continuing to clarify update_predictors

c4c7430

brookslogan reviewed Dec 22, 2023

View reviewed changes

R/data_transforms.R Outdated Show resolved Hide resolved

brookslogan requested changes Dec 22, 2023

View reviewed changes

dsweber2 added 4 commits December 22, 2023 10:46

fix before behavior: mean tests simplified

d6d4cdd

various suggestions from logan, before=n_points-1

d542be2

fix tests (sd lag should only be 0)

16dd527

include smoothed_scaling in the targets

4b34428

dsweber2 added 3 commits December 22, 2023 16:41

perform_sanity_checks -> sanitize_args_predictors_trainer

af3ff95

zeallot (%<-%) needs all args

ebb9db3

fix %<-% usage

15d35d7

dsweber2 merged commit 993b571 into main Dec 23, 2023

dshemetov deleted the smoothedScaled branch December 23, 2023 22:38

Add the 7dav we talked about along with the std #76

Add the 7dav we talked about along with the std #76

Uh oh!

Conversation

dsweber2 commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brookslogan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brookslogan Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsweber2 commented Dec 19, 2023

Uh oh!

dsweber2 commented Dec 20, 2023

Uh oh!

brookslogan commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsweber2 commented Dec 20, 2023

Uh oh!

brookslogan commented Dec 22, 2023

Uh oh!

Uh oh!

brookslogan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brookslogan Jan 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsweber2 commented Dec 14, 2023 •

edited

Loading

brookslogan left a comment •

edited

Loading

brookslogan Dec 19, 2023 •

edited

Loading

brookslogan commented Dec 20, 2023 •

edited

Loading

brookslogan Jan 3, 2024 •

edited

Loading