-
Notifications
You must be signed in to change notification settings - Fork 8
Ndefries/autocompletion slide #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
date_seq_list$pad_early_dates, | ||
date_seq_list$all_dates, | ||
date_seq_list$pad_late_dates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full_date_seq
returns all_dates
and pad dates separately to support current epi_slide_mean
implementation, but epi_slide_mean
completion can be changed (combine all_dates and pad dates upfront, with or without switching to tidyr::complete
) to simplify the bit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes as I ran through. Unfortunately, there are more design questions I forgot about than I expected; maybe we can brainstorm on Slack/call.
# `complete` checks that fill types match existing column types. | ||
fill = fill, | ||
# Existing missings will be replaced by `fill`, too. | ||
explicit = TRUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicit = TRUE
doesn't seem very natural to me. Could we also make this a parameter? (And maybe use a more sane name&default? Tradeoff: I guess that could also surprise people if fill
makes them think we're mirroring tidyr defaults, or make them read more docs. Not sure what's preferable.)
) | ||
|
||
# `complete` strips epi_df format and metadata. Restore them. | ||
x <- reclass(x, attributes(x)$metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to use pre-recorded metadata beforehand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does/could metadata get modified by prior processing?
# - If a geo first appears halfway through the dataset, it will be | ||
# completed all the way back to the beginning of the data. | ||
x <- tidyr::complete(x, | ||
expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, this is complicated. Two issues. They feel a bit edge-case-y, but matter more if we're trying to have some consistency with epix_slide
, or maybe a future epi_slide_reframe
that doesn't try to match the existing keys in the output.
- If reporting starts for one epikey later on, or if there are large gaps in its reporting, then even without data revisions, you could have later epi_slides on expanded epi_dfs outputting different results for earlier ref_time_values, as they will be considering additional epikeys. If grouping by all epikeys, then maybe this can't happen for
epi_slide
(but could for an analogousepix_slide
). But if you don't group by anything, you might turn earlier national sums into a bunch of NAs. - Should we be using only the epikeys that have appeared, or the cross product of all seen geos x all seen age groups x etc.? If we have to pick one, I think the current approach of using only epikeys that have appeared is right; user can
complete
or join or whatever if they don't like that. (Again, forepi_slide
, this probably only matters about whether we make not-grouped-by-all-epikeys epi_slides output NAs...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: I couldn't find where the matching-old-epikey-time_value/broadcasting part is. I wonder if it's using this completed x
. If it is, then the two questions above actually are more relevant for epi_slide
since it involves also changing the epikeysets output for earlier time values simply by adding an epikey in rows for later time_values, and would also mean outputting values for gaps, and maybe also NA's at the leading and trailing edges. Ideally we should consciously make decisions here, or make it configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(you may be able to get rid of the expand
call)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A full_seq may be missing in definition of all_dates. If there is a gap for all epikeys then completion may not fill it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full_date_seq
fn below should take care of that. The three cases (tsibble class or numeric, else time_step
not defined, else time_step
defined) all generate date sequences by stepping by time_step
(or similar) between the input data's min and max time_values. They don't look at unique, existing time_value
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key cols are broadcast backwards here:
expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates))
This wouldn't change output epikeysets for earlier time values. In the non-aggregating case, computations are both completed across and computed across all keycols, and we later (at the end of this function) filter out not-.real
obs. Epikeysets that were broadcast back in time will be removed by the filter.
In the aggregating case (e.g. states -> national sum), we're outputting all-new epikeysets. We should end up with one row per group. As long as the input group data contains one .real = TRUE
obs, the output should be kept (?)(I'm not sure how .real
is handled right now -- need to check/implement that logic).
# A helper column marking real observations. | ||
x$.real <- TRUE | ||
# Fill the `.real` column with `FALSE` for rows added during completion. | ||
fill$.real <- FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is .real
handling still computationally significant? Wonder if in autocomplete case it could just go away. Maybe an optimization to pursue in a later version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Individual calls like this are quite fast. The prior .real
handling was in f_wrapper
around the user-passed computation; it was slow in aggregate because it got called a ton. This should be pretty fast since it only happens once.
I'm curious what the alternative to the .real
col would be, if we still plan on dropping completed rows.
adbdcd4
to
6be7cff
Compare
a5c429a
to
56bed8c
Compare
Many of the ideas here were re-implemented in #519 (though the implementation there is closer to epi_slide_opt's complete-by-default). Move to close? |
Checklist
Please:
brookslogan, nmdefries.
DESCRIPTION
andNEWS.md
.Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
Change explanations for reviewer
By default, complete input data so that computations are always handed
n obs * m groups
. This currently only adds this feature forepi_slide
; we want to also add it forepi_slide_mean
andepix_slide
.Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch
Closes #256.