Ndefries/autocompletion slide #415

nmdefries · 2024-02-07T18:40:13Z

Checklist

Please:

Make sure this PR is against "dev", not "main".
Request a review from one of the current epiprocess main reviewers:
brookslogan, nmdefries.
Makes sure to bump the version number in DESCRIPTION and NEWS.md.
Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).

Change explanations for reviewer

By default, complete input data so that computations are always handed n obs * m groups. This currently only adds this feature for epi_slide; we want to also add it for epi_slide_mean and epix_slide.

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Closes #256.

R/slide.R

nmdefries · 2024-02-07T19:06:07Z

R/slide.R

+      date_seq_list$pad_early_dates,
+      date_seq_list$all_dates,
+      date_seq_list$pad_late_dates


full_date_seq returns all_dates and pad dates separately to support current epi_slide_mean implementation, but epi_slide_mean completion can be changed (combine all_dates and pad dates upfront, with or without switching to tidyr::complete) to simplify the bit here.

brookslogan

Some notes as I ran through. Unfortunately, there are more design questions I forgot about than I expected; maybe we can brainstorm on Slack/call.

R/slide.R

brookslogan · 2024-02-07T19:07:30Z

R/slide.R

+        # `complete` checks that fill types match existing column types.
+        fill = fill,
+        # Existing missings will be replaced by `fill`, too.
+        explicit = TRUE


explicit = TRUE doesn't seem very natural to me. Could we also make this a parameter? (And maybe use a more sane name&default? Tradeoff: I guess that could also surprise people if fill makes them think we're mirroring tidyr defaults, or make them read more docs. Not sure what's preferable.)

brookslogan · 2024-02-07T19:08:26Z

R/slide.R

+      )
+
+    # `complete` strips epi_df format and metadata. Restore them.
+    x <- reclass(x, attributes(x)$metadata)


do we need to use pre-recorded metadata beforehand?

Does/could metadata get modified by prior processing?

brookslogan · 2024-02-07T19:23:34Z

R/slide.R

+    #  - If a geo first appears halfway through the dataset, it will be
+    #    completed all the way back to the beginning of the data.
+    x <- tidyr::complete(x,
+        expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates)),


Ouch, this is complicated. Two issues. They feel a bit edge-case-y, but matter more if we're trying to have some consistency with epix_slide, or maybe a future epi_slide_reframe that doesn't try to match the existing keys in the output.

If reporting starts for one epikey later on, or if there are large gaps in its reporting, then even without data revisions, you could have later epi_slides on expanded epi_dfs outputting different results for earlier ref_time_values, as they will be considering additional epikeys. If grouping by all epikeys, then maybe this can't happen for epi_slide (but could for an analogous epix_slide). But if you don't group by anything, you might turn earlier national sums into a bunch of NAs.

Should we be using only the epikeys that have appeared, or the cross product of all seen geos x all seen age groups x etc.? If we have to pick one, I think the current approach of using only epikeys that have appeared is right; user can complete or join or whatever if they don't like that. (Again, for epi_slide, this probably only matters about whether we make not-grouped-by-all-epikeys epi_slides output NAs...)

One more thing: I couldn't find where the matching-old-epikey-time_value/broadcasting part is. I wonder if it's using this completed x. If it is, then the two questions above actually are more relevant for epi_slide since it involves also changing the epikeysets output for earlier time values simply by adding an epikey in rows for later time_values, and would also mean outputting values for gaps, and maybe also NA's at the leading and trailing edges. Ideally we should consciously make decisions here, or make it configurable.

(you may be able to get rid of the expand call)

A full_seq may be missing in definition of all_dates. If there is a gap for all epikeys then completion may not fill it in.

full_date_seq fn below should take care of that. The three cases (tsibble class or numeric, else time_step not defined, else time_step defined) all generate date sequences by stepping by time_step (or similar) between the input data's min and max time_values. They don't look at unique, existing time_values.

Key cols are broadcast backwards here:

expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates))

This wouldn't change output epikeysets for earlier time values. In the non-aggregating case, computations are both completed across and computed across all keycols, and we later (at the end of this function) filter out not-.real obs. Epikeysets that were broadcast back in time will be removed by the filter.

In the aggregating case (e.g. states -> national sum), we're outputting all-new epikeysets. We should end up with one row per group. As long as the input group data contains one .real = TRUE obs, the output should be kept (?)(I'm not sure how .real is handled right now -- need to check/implement that logic).

R/slide.R

brookslogan · 2024-02-08T00:32:07Z

R/slide.R

+    # A helper column marking real observations.
+    x$.real <- TRUE
+    # Fill the `.real` column with `FALSE` for rows added during completion.
+    fill$.real <- FALSE


is .real handling still computationally significant? Wonder if in autocomplete case it could just go away. Maybe an optimization to pursue in a later version.

Individual calls like this are quite fast. The prior .real handling was in f_wrapper around the user-passed computation; it was slow in aggregate because it got called a ton. This should be pretty fast since it only happens once.

I'm curious what the alternative to the .real col would be, if we still plan on dropping completed rows.

dshemetov · 2024-09-18T18:24:57Z

Many of the ideas here were re-implemented in #519 (though the implementation there is closer to epi_slide_opt's complete-by-default). Move to close?

nmdefries commented Feb 7, 2024

View reviewed changes

R/slide.R Outdated Show resolved Hide resolved

nmdefries commented Feb 7, 2024

View reviewed changes

brookslogan reviewed Feb 7, 2024

View reviewed changes

nmdefries commented Feb 7, 2024

View reviewed changes

R/slide.R Outdated Show resolved Hide resolved

brookslogan reviewed Feb 8, 2024

View reviewed changes

nmdefries added 8 commits February 8, 2024 17:58

comments

ec1d347

basic date completion for epi_slide

c5f75f5

mark completed rows as not real; filter to remove before returning

8c28ca2

warnings and type checking

582f56a

check for unique key combos

8db2227

use reclass fn instead of bind_rows

7759985

remove "default" language from missing colnames warning

4147248

generalize full_date_seq unmappable type message

6be7cff

nmdefries force-pushed the ndefries/autocompletion-slide branch from adbdcd4 to 6be7cff Compare February 8, 2024 23:12

nmdefries force-pushed the ndefries/specialized-slide-mean branch 2 times, most recently from a5c429a to 56bed8c Compare March 23, 2024 15:42

Base automatically changed from ndefries/specialized-slide-mean to dev March 26, 2024 17:34

dshemetov mentioned this pull request Sep 12, 2024

wip: enforce epi_slide to work on grouped epi_dfs only #519

Merged

4 tasks

nmdefries closed this Sep 18, 2024

dshemetov deleted the ndefries/autocompletion-slide branch January 24, 2025 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ndefries/autocompletion slide #415

Ndefries/autocompletion slide #415

nmdefries commented Feb 7, 2024 •

edited

Loading

nmdefries Feb 7, 2024 •

edited

Loading

brookslogan left a comment

brookslogan Feb 7, 2024

brookslogan Feb 7, 2024

nmdefries Feb 7, 2024

brookslogan Feb 7, 2024

brookslogan Feb 7, 2024

brookslogan Feb 7, 2024

brookslogan Feb 7, 2024

nmdefries Feb 7, 2024 •

edited

Loading

nmdefries Feb 15, 2024

brookslogan Feb 8, 2024 •

edited

Loading

nmdefries Feb 8, 2024 •

edited

Loading

dshemetov commented Sep 18, 2024 •

edited

Loading

Ndefries/autocompletion slide #415

Ndefries/autocompletion slide #415

Conversation

nmdefries commented Feb 7, 2024 • edited Loading

Checklist

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

nmdefries Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

brookslogan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmdefries Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brookslogan Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

nmdefries Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

dshemetov commented Sep 18, 2024 • edited Loading

nmdefries commented Feb 7, 2024 •

edited

Loading

nmdefries Feb 7, 2024 •

edited

Loading

nmdefries Feb 7, 2024 •

edited

Loading

brookslogan Feb 8, 2024 •

edited

Loading

nmdefries Feb 8, 2024 •

edited

Loading

dshemetov commented Sep 18, 2024 •

edited

Loading