Skip to content

Ndefries/autocompletion slide #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Feb 7, 2024

Checklist

Please:

  • Make sure this PR is against "dev", not "main".
  • Request a review from one of the current epiprocess main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION and NEWS.md.
    Always increment the patch version number (the third number), unless you are
    making a release PR from dev to main, in which case increment the minor
    version number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    0.7.2, then write your changes under the 0.8 heading).

Change explanations for reviewer

By default, complete input data so that computations are always handed n obs * m groups. This currently only adds this feature for epi_slide; we want to also add it for epi_slide_mean and epix_slide.

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Closes #256.

Comment on lines +282 to +273
date_seq_list$pad_early_dates,
date_seq_list$all_dates,
date_seq_list$pad_late_dates
Copy link
Contributor Author

@nmdefries nmdefries Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_date_seq returns all_dates and pad dates separately to support current epi_slide_mean implementation, but epi_slide_mean completion can be changed (combine all_dates and pad dates upfront, with or without switching to tidyr::complete) to simplify the bit here.

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes as I ran through. Unfortunately, there are more design questions I forgot about than I expected; maybe we can brainstorm on Slack/call.

# `complete` checks that fill types match existing column types.
fill = fill,
# Existing missings will be replaced by `fill`, too.
explicit = TRUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit = TRUE doesn't seem very natural to me. Could we also make this a parameter? (And maybe use a more sane name&default? Tradeoff: I guess that could also surprise people if fill makes them think we're mirroring tidyr defaults, or make them read more docs. Not sure what's preferable.)

)

# `complete` strips epi_df format and metadata. Restore them.
x <- reclass(x, attributes(x)$metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use pre-recorded metadata beforehand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does/could metadata get modified by prior processing?

# - If a geo first appears halfway through the dataset, it will be
# completed all the way back to the beginning of the data.
x <- tidyr::complete(x,
expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, this is complicated. Two issues. They feel a bit edge-case-y, but matter more if we're trying to have some consistency with epix_slide, or maybe a future epi_slide_reframe that doesn't try to match the existing keys in the output.

  • If reporting starts for one epikey later on, or if there are large gaps in its reporting, then even without data revisions, you could have later epi_slides on expanded epi_dfs outputting different results for earlier ref_time_values, as they will be considering additional epikeys. If grouping by all epikeys, then maybe this can't happen for epi_slide (but could for an analogous epix_slide). But if you don't group by anything, you might turn earlier national sums into a bunch of NAs.
  • Should we be using only the epikeys that have appeared, or the cross product of all seen geos x all seen age groups x etc.? If we have to pick one, I think the current approach of using only epikeys that have appeared is right; user can complete or join or whatever if they don't like that. (Again, for epi_slide, this probably only matters about whether we make not-grouped-by-all-epikeys epi_slides output NAs...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: I couldn't find where the matching-old-epikey-time_value/broadcasting part is. I wonder if it's using this completed x. If it is, then the two questions above actually are more relevant for epi_slide since it involves also changing the epikeysets output for earlier time values simply by adding an epikey in rows for later time_values, and would also mean outputting values for gaps, and maybe also NA's at the leading and trailing edges. Ideally we should consciously make decisions here, or make it configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you may be able to get rid of the expand call)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A full_seq may be missing in definition of all_dates. If there is a gap for all epikeys then completion may not fill it in.

Copy link
Contributor Author

@nmdefries nmdefries Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_date_seq fn below should take care of that. The three cases (tsibble class or numeric, else time_step not defined, else time_step defined) all generate date sequences by stepping by time_step (or similar) between the input data's min and max time_values. They don't look at unique, existing time_values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key cols are broadcast backwards here:

expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates))

This wouldn't change output epikeysets for earlier time values. In the non-aggregating case, computations are both completed across and computed across all keycols, and we later (at the end of this function) filter out not-.real obs. Epikeysets that were broadcast back in time will be removed by the filter.

In the aggregating case (e.g. states -> national sum), we're outputting all-new epikeysets. We should end up with one row per group. As long as the input group data contains one .real = TRUE obs, the output should be kept (?)(I'm not sure how .real is handled right now -- need to check/implement that logic).

# A helper column marking real observations.
x$.real <- TRUE
# Fill the `.real` column with `FALSE` for rows added during completion.
fill$.real <- FALSE
Copy link
Contributor

@brookslogan brookslogan Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is .real handling still computationally significant? Wonder if in autocomplete case it could just go away. Maybe an optimization to pursue in a later version.

Copy link
Contributor Author

@nmdefries nmdefries Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual calls like this are quite fast. The prior .real handling was in f_wrapper around the user-passed computation; it was slow in aggregate because it got called a ton. This should be pretty fast since it only happens once.

I'm curious what the alternative to the .real col would be, if we still plan on dropping completed rows.

@nmdefries nmdefries force-pushed the ndefries/autocompletion-slide branch from adbdcd4 to 6be7cff Compare February 8, 2024 23:12
@nmdefries nmdefries force-pushed the ndefries/specialized-slide-mean branch 2 times, most recently from a5c429a to 56bed8c Compare March 23, 2024 15:42
Base automatically changed from ndefries/specialized-slide-mean to dev March 26, 2024 17:34
@dshemetov
Copy link
Contributor

dshemetov commented Sep 18, 2024

Many of the ideas here were re-implemented in #519 (though the implementation there is closer to epi_slide_opt's complete-by-default). Move to close?

@nmdefries nmdefries closed this Sep 18, 2024
@dshemetov dshemetov deleted the ndefries/autocompletion-slide branch January 24, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide option to complete/skip incomplete windows, fill gaps in epi[x]_slide windows
3 participants