Skip to content

Commit 10d5f55

Browse files
committed
Use dplyr_extending, don't ungroup in epix_slide, back to group_modify
* Implement `?dplyr_extending` and remove some now-unnecessary S3 methods for dplyr verbs, addressing #195, #223, and failing `epix_slide` test. * Don't ungroup `epix_slide` result. Update corresponding test. * Update NEWS.md. * Explicate `epix_slide` `count` derivation in comments and variable names. * Fix some desynced duplicated code in `epix_slide` and use `group_modify` again instead of `summarize` in order to keep slide computation input available as an `epi_df`.
1 parent 6e3f554 commit 10d5f55

9 files changed

+181
-161
lines changed

NAMESPACE

+4-7
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,22 @@
11
# Generated by roxygen2: do not edit by hand
22

33
S3method("[",epi_df)
4-
S3method(arrange,epi_df)
4+
S3method("names<-",epi_df)
55
S3method(as_epi_df,data.frame)
66
S3method(as_epi_df,epi_df)
77
S3method(as_epi_df,tbl_df)
88
S3method(as_epi_df,tbl_ts)
99
S3method(as_tsibble,epi_df)
1010
S3method(dplyr_col_modify,col_modify_recorder_df)
11-
S3method(filter,epi_df)
11+
S3method(dplyr_col_modify,epi_df)
12+
S3method(dplyr_reconstruct,epi_df)
1213
S3method(group_by,epi_archive)
1314
S3method(group_by,epi_df)
1415
S3method(group_by,grouped_epi_archive)
1516
S3method(group_by_drop_default,grouped_epi_archive)
16-
S3method(group_modify,epi_df)
17-
S3method(mutate,epi_df)
1817
S3method(next_after,Date)
1918
S3method(next_after,integer)
2019
S3method(print,epi_df)
21-
S3method(relocate,epi_df)
22-
S3method(rename,epi_df)
23-
S3method(slice,epi_df)
2420
S3method(summary,epi_df)
2521
S3method(ungroup,epi_df)
2622
S3method(unnest,epi_df)
@@ -66,6 +62,7 @@ importFrom(data.table,set)
6662
importFrom(data.table,setkeyv)
6763
importFrom(dplyr,arrange)
6864
importFrom(dplyr,dplyr_col_modify)
65+
importFrom(dplyr,dplyr_reconstruct)
6966
importFrom(dplyr,filter)
7067
importFrom(dplyr,group_by)
7168
importFrom(dplyr,group_by_drop_default)

NEWS.md

+18
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,27 @@ Note that `epiprocess` uses the [Semantic Versioning
44
("semver")](https://semver.org/) scheme for all release versions, but not for
55
development versions. A ".9999" suffix indicates a development version.
66

7+
## Breaking changes:
8+
9+
* `epix_slide` now keeps any grouping of `x` in its results, matching
10+
`epi_slide`. To obtain the old behavior, `dplyr::ungroup` the `epix_slide`
11+
result immediately.
12+
13+
## Potentially-breaking changes:
14+
15+
* Fixed `[` on grouped `epi_df`s to maintain the grouping if possible when
16+
dropping the `epi_df` class (e.g., when removing the `time_value` column).
17+
* Fixed `epi_df` operations to be more consistent about decaying into
18+
non-`epi_df`s when the result of the operation doesn't make sense as an
19+
`epi_df` (e.g., when removing the `time_value` column).
20+
* Changed `bind_rows` on grouped `epi_df`s to not drop the `epi_df` class. Like
21+
with ungrouped `epi_df`s, the metadata of the result is still simply taken
22+
from the first result, and may be inappropriate.
23+
724
## Cleanup:
825

926
* Added a `NEWS.md` file to track changes to the package.
27+
* Implemented `?dplyr::dplyr_extending` for `epi_df`s (#223).
1028

1129
# epiprocess 0.5.0:
1230

R/grouped_epi_archive.R

+59-29
Original file line numberDiff line numberDiff line change
@@ -215,24 +215,55 @@ grouped_epi_archive =
215215
# Symbolize column name
216216
new_col = sym(new_col_name)
217217

218-
# Key variable names, apart from time value and version
219-
key_vars = setdiff(key(private$ungrouped$DT), c("time_value", "version"))
218+
# Each computation is expected to output a data frame with either
219+
# one element/row total or one element/row per encountered
220+
# nongrouping, nontime, nonversion key value. These nongrouping,
221+
# nontime, nonversion key columns can be seen as the "effective" key
222+
# of the computation; the computation might return an object that
223+
# reports a different key or no key, but the "effective" key should
224+
# still be a valid unique key for the data, and is something that we
225+
# could use even with `.keep = FALSE`.
226+
comp_effective_key_vars =
227+
setdiff(key(private$ungrouped$DT),
228+
c(private$vars, "time_value", "version"))
220229

221230
# Computation for one group, one time value
222-
comp_one_grp = function(.data_group,
231+
comp_one_grp = function(.data_group, .group_key,
223232
f, ...,
224-
time_value,
225-
key_vars,
233+
ref_time_value,
234+
comp_effective_key_vars,
226235
new_col) {
227236
# Carry out the specified computation
228-
comp_value = f(.data_group, ...)
237+
comp_value = f(.data_group, .group_key, ...)
229238

230-
# Count the number of appearances of the reference time value.
231-
# Note: ideally, we want to directly count occurrences of the ref
232-
# time value but due to latency, this will often not appear in the
233-
# data group. So we count the number of unique key values, outside
234-
# of the time value column
235-
count = sum(!duplicated(.data_group[, key_vars]))
239+
# Calculate the number of output elements/rows we expect the
240+
# computation to output: one per distinct "effective computation
241+
# key variable" value encountered in the input. Note: this mirrors
242+
# how `epi_slide` does things if we're using unique keys, but can
243+
# diverge if using nonunique keys. The `epi_slide` approach of
244+
# counting occurrences of the `ref_time_value` in the `time_value`
245+
# column, which helps lines up the computation results with
246+
# corresponding rows of the input data, wouldn't quite apply here:
247+
# we'd want to line up with rows (from the same group) with
248+
# `version` matching the `ref_time_value`, but would still need to
249+
# summarize these rows somehow and drop the `time_value` input
250+
# column, but this summarization requires something like a
251+
# to-be-unique output key to determine a sensible number of rows
252+
# to output (and the contents of those rows).
253+
count =
254+
if (length(comp_effective_key_vars) != 0L) {
255+
sum(!duplicated(.data_group[, comp_effective_key_vars]))
256+
} else {
257+
# Same idea as above, but accounting for `duplicated` not
258+
# working as we want on 0 columns. (Should be the same as if
259+
# we were counting distinct values of a column defined as
260+
# `rep(val, target_n_rows)`.)
261+
if (nrow(.data_group) == 0L) {
262+
0L
263+
} else {
264+
1L
265+
}
266+
}
236267

237268
# If we get back an atomic vector
238269
if (is.atomic(comp_value)) {
@@ -241,7 +272,7 @@ grouped_epi_archive =
241272
}
242273
# If not a singleton, should be the right length, else abort
243274
else if (length(comp_value) != count) {
244-
Abort("If the slide computation returns an atomic vector, then it must have a single element, or else one element per appearance of the reference time value in the local window.")
275+
Abort('If the slide computation returns an atomic vector, then it must have either (a) a single element, or (b) one element per distinct combination of key variables, excluding the `time_value`, `version`, and grouping variables, that is present in the first argument to the computation.')
245276
}
246277
}
247278

@@ -256,7 +287,7 @@ grouped_epi_archive =
256287
}
257288
# Make into a list
258289
else {
259-
comp_value = split(comp_value, 1:nrow(comp_value))
290+
comp_value = split(comp_value, seq_len(nrow(comp_value)))
260291
}
261292
}
262293

@@ -265,10 +296,9 @@ grouped_epi_archive =
265296
Abort("The slide computation must return an atomic vector or a data frame.")
266297
}
267298

268-
# Note that we've already recycled comp value to make size stable,
269-
# so tibble() will just recycle time value appropriately
270-
return(tibble::tibble(time_value = time_value,
271-
!!new_col := comp_value))
299+
# Label every result row with the `ref_time_value`:
300+
return(tibble::tibble(time_value = rep(.env$ref_time_value, count),
301+
!!new_col := .env$comp_value))
272302
}
273303

274304
# If f is not missing, then just go ahead, slide by group
@@ -278,12 +308,12 @@ grouped_epi_archive =
278308
private$ungrouped$as_of(ref_time_value, min_time_value = ref_time_value - before) %>%
279309
dplyr::group_by(dplyr::across(tidyselect::all_of(private$vars)),
280310
.drop=private$drop) %>%
281-
dplyr::summarize(comp_one_grp(dplyr::cur_data_all(),
282-
f = f, ...,
283-
time_value = ref_time_value,
284-
key_vars = key_vars,
285-
new_col = new_col),
286-
.groups = groups)
311+
dplyr::group_modify(comp_one_grp,
312+
f = f, ...,
313+
ref_time_value = ref_time_value,
314+
comp_effective_key_vars = comp_effective_key_vars,
315+
new_col = new_col,
316+
.keep = TRUE)
287317
})
288318
}
289319

@@ -303,14 +333,14 @@ grouped_epi_archive =
303333

304334
x = purrr::map_dfr(ref_time_values, function(ref_time_value) {
305335
private$ungrouped$as_of(ref_time_value, min_time_value = ref_time_value - before) %>%
306-
dplyr::group_by(dplyr::across(tidyselect::all_of(private$vars))) %>%
336+
dplyr::group_by(dplyr::across(tidyselect::all_of(private$vars)),
337+
.drop=private$drop) %>%
307338
dplyr::group_modify(comp_one_grp,
308339
f = f, quo = quo,
309-
time_value = ref_time_value,
310-
key_vars = key_vars,
340+
ref_time_value = ref_time_value,
341+
comp_effective_key_vars = comp_effective_key_vars,
311342
new_col = new_col,
312-
.keep = TRUE) %>%
313-
dplyr::ungroup()
343+
.keep = TRUE)
314344
})
315345
}
316346

R/methods-epi_archive.R

+7-6
Original file line numberDiff line numberDiff line change
@@ -595,15 +595,16 @@ group_by.epi_archive = function(.data, ..., .add=FALSE, .drop=dplyr::group_by_dr
595595
#' from `before` time steps before a given `ref_time_value` through the last
596596
#' `time_value` available as of version `ref_time_value` (typically, this
597597
#' won't include `ref_time_value` itself, as observations about a particular
598-
#' time interval (e.g., day) are only published after that time interval ends);
599-
#' `epi_slide` windows extend from `before` time steps before a
598+
#' time interval (e.g., day) are only published after that time interval
599+
#' ends); `epi_slide` windows extend from `before` time steps before a
600600
#' `ref_time_value` through `after` time steps after `ref_time_value`.
601601
#' 2. Note that the outputs are a similar but different: `epix_slide()`
602-
#' returns only the grouping variables, `time_value`, and the new column(s)
603-
#' from the slide computation `f`, whereas `epi_slide()` returns all original
604-
#' variables plus the new columns from the slide computation.
602+
#' returns a tibble containing only the grouping variables, `time_value`, and
603+
#' the new column(s) from the slide computation `f`, whereas `epi_slide()`
604+
#' returns an `epi_df` with all original variables plus the new columns from
605+
#' the slide computation.
605606
#' Apart from this, the interfaces between `epix_slide()` and `epi_slide()` are
606-
#' the same.
607+
#' the same.
607608
#'
608609
#' Furthermore, the current function can be considerably slower than
609610
#' `epi_slide()`, for two reasons: (1) it must repeatedly fetch

0 commit comments

Comments
 (0)