-
Notifications
You must be signed in to change notification settings - Fork 0
Add the 7dav we talked about along with the std #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
65cd9b4
to
ea335c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a partial review. Sorry, I've probably gone too in the weeds about random things. The main things:
update_predictors
may be unnecessary and its use may produce undesired results.E.g., mimicking dev10 might need a workaround or extra postprocessing if[Nevermind, I missed a function andupdate_predictors
is used (since it uses lags of the original.keep_mean
. dev10 does seem easily representable.] [Still] an alternative to consider [taking advantage of tidymodels roles rather than externally tracking & assuming]:- Implement a
step_rolling_mean
(or some sort of step-like thing) to add grouped rolling mean columns & assign a requested role, defaulting to "predictor". (Might include a convenience feature to lag these as well.) - Implement a
step_rolling_sd
to add grouped rolling-sds-around-rolling-mean columns & assign a requested role, defaulting to "predictor".
- Implement a
- If
update_predictors
is necessary, it needs clearer documentation&implementation. cache_metadata
can probably be avoided through use ofdplyr_reconstruct
orepi_slide
.
Do you think this refactor would be sensible, and do you have time to try it, or should I try to look for more concrete issues?
A more concrete thing: it looks like this is missing a tidyr::complete
step (and maybe an arrange
?).
(I'm not sure if there are some recipes
invariants about number/ordering of rows to worry about if combining the refactor above + tidyr::complete
. If these are complicating things, there's mutate_scattered_ts
. In contrast to complete
+ current / epi_slide
approach, this doesn't change the number of rows, though it may re-order them. It also supports across
.
tibble::tibble(g = 1, time_value = c(1, 3:10), value = time_value) %>%
bind_rows(tibble::tibble(g = 2, time_value = 11, value = 100)) %>%
group_by(g) %>%
mutate_scattered_ts(time_value, 1L, across(c("value"), list(mean = ~ data.table::frollmean(.x, 7L)))) %>%
ungroup()
#> # A tibble: 10 × 4
#> g time_value value value_mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NA
#> 2 1 3 3 NA
#> 3 1 4 4 NA
#> 4 1 5 5 NA
#> 5 1 6 6 NA
#> 6 1 7 7 NA
#> 7 1 8 8 NA
#> 8 1 9 9 6
#> 9 1 10 10 7
#> 10 2 11 100 NA
)
) | ||
} | ||
# and need to make sure we exclude the original varialbes as predictors | ||
predictors <- update_predictors(epi_data, c(smooth_cols, sd_cols), predictors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Do we always want to do this? E.g., dev10 pairs lags of separate rolling means of the original with the sd estimate. [nvm, this is what seems to be happening here as well, though I need to check where the lags are specified] And how are the requested lags derived for these predictors? May make more sense to just keep everything and have the model specify what it wants. Or, in epipredict framework, have a smooth mean step that adds (lags of) smoothed means & assigns predictor roles, and sd step that works in a similar way, so dev10 could be a step_epi_lag + step_rolling_sd, while the implied behavior above could be a step_rolling_mean + step_rolling_sd. (thought: I'm wishing for a step_epi_slide
and a step_mutate
that allow .by
, but these concrete steps would be good too I think. And same thing in epiprocess
, probably want specific rolling mean and rolling sd helpers..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lags are applied to all predictors, and are specified per-model in the ...
, which can be any of the arx args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we may want to apply different lagsets to the 7dav (7*(0:4) ish) and the sd covariate (maybe just 0) & ignore the original non-7dav'd predictors. Is that possible via the list-of-vectors form of lags
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore the original non-7dav'd predictors
this is what the update_predictors
function is for. It drops the non-7dav'd versions from the list of predictors.
Is that possible via the list-of-vectors form of lags?
Vector or List. Positive integers enumerating lags to use in autoregressive-type models (in days). By default, an unnamed list of lags will be set to correspond to the order of the predictors.
Apparently, though I haven't actually tried it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the current validation will probably complain if we try it.
issue: c(smooth_cols, sd_cols)
is probably wrong when exactly one of smooth_cols
and sd_cols
is NULL
. These variables have not been overwritten with get_trainable_names
in this function.
first, thanks for the detailed commentary!
I do think this is an eventual goal. As with the latency adjustment the turnaround time on adding things to epipredict is longer than writing them like this. If they perform well we should do as you're saying.
Only reason I defined this function was because |
re-implemented with epislide and added the stuff I marked as resolved. Now I'm wondering; for days that are missing/NA, there are 3 main options I can see.
|
[Seems like a lot of this we may want to proceed & test before trying to refactor/tweak. I still need to finish reading some files & try to actually sanity check more concrete things.] |
Ok, so I made the imputation thing a standalone issue in the icebox. For now, I'm just using default
Definitely. This is basically ready to be tested on real data, modulo anything you catch that isn't currently enumerated in the tests, whereas I don't think we'd have this operation for another month in epipredict, between the coordination problems and dealing with the complications that come from extra framework requirements. |
Warning: we want that to be the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted a few issues in *_width
treatment and in some particular cases, + various minor suggestions + brainstorming. Lingering thoughts:
- Things get really hard when you start trying to break things into finer parts. Same difficulty when I was thinking of trying this on
make_test_forecaster
. Think we're still not to an ideal form yet. mean_cols
andsd_cols
should maybe sometime be tidyselections instead of char/NULL?update_predictors
could potentially remove some variables we don't want it to when things have similarly names. Haven't thought enough to say it definitely happens since it will probably go away if/when moving to astep
-based approach.
...) { | ||
# perform any preprocessing not supported by epipredict | ||
# this is a temp fix until a real fix gets put into epipredict | ||
epi_data <- clear_lastminute_nas(epi_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why do we want to clear NA
s? (Probably should add an explanatory comment. Note this is sort of an "anti-complete()
" so even if we want complete()
d stuff anywhere below then those things need to be careful. E.g., slide computation passed to epi_slide()
should probably test nrow(.x)
to decide whether it should output NA.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastminute here means NA
in the final days. Epipredict just chokes on that, when it should be treating the NA
's as a latency adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense, though the current clear_lastminute_nas
clears all NAs not just lastminute ones. Relates to lastminute-imputation & gap treatment stuff. I think we discussed this elsewhere but current approach probably works okay on hhs hosp data (and hopefully any chng data used?? idk there), but when adding to epipredict should probably be a bit more careful since it could be used on more problematic data sets.
) | ||
} | ||
# and need to make sure we exclude the original varialbes as predictors | ||
predictors <- update_predictors(epi_data, c(smooth_cols, sd_cols), predictors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the current validation will probably complain if we try it.
issue: c(smooth_cols, sd_cols)
is probably wrong when exactly one of smooth_cols
and sd_cols
is NULL
. These variables have not been overwritten with get_trainable_names
in this function.
|
||
# postprocessing supported by epipredict | ||
postproc <- frosting() | ||
postproc %<>% arx_postprocess(trainer, args_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (with arx_postprocess
...): ignores forecast_date
passed through ...
; doesn't respect target_date
passed through ...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (with
arx_postprocess
...): ignoresforecast_date
passed through...
; doesn't respecttarget_date
passed through...
.
That's intended behavior. Both of those are set through adjusting ahead
for target_date
and the as_of
of epi_data
for the forecast_date
.
issue: c(smooth_cols, sd_cols) is probably wrong when exactly one of smooth_cols and sd_cols is NULL. These variables have not been overwritten with get_trainable_names in this function.
Yeah, I should probably just completely refactor the way I wrote update_predictors
. I think it should work for now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think a quick fix here would be to just make sure forecast_date
and target_date
are NULL
in args_list
. Or to move away from ...
forwarding to manually handling/forwarding/forbidding each parameter. I assume you're not using passing in either of these parameters downstream, so maybe this could also be deferred to a separate Issue.
Closes #74. This adds a new forecaster that creates a 7 day average column and a 28 day moving standard deviation column, and uses those as predictors. @brookslogan would like to hear your feedback about implementation if you've got time.