-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathepix_slide.Rd
267 lines (245 loc) · 13.1 KB
/
epix_slide.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/methods-epi_archive.R
\name{epix_slide}
\alias{epix_slide}
\title{Slide a function over variables in an \code{epi_archive} or \code{grouped_epi_archive}}
\usage{
epix_slide(
x,
f,
...,
before,
ref_time_values,
time_step,
new_col_name = "slide_value",
as_list_col = FALSE,
names_sep = "_",
all_versions = FALSE
)
}
\arguments{
\item{x}{An \code{\link{epi_archive}} or \code{\link{grouped_epi_archive}} object. If ungrouped,
all data in \code{x} will be treated as part of a single data group.}
\item{f}{Function, formula, or missing; together with \code{...} specifies the
computation to slide. To "slide" means to apply a computation over a
sliding (a.k.a. "rolling") time window for each data group. The window is
determined by the \code{before} parameter described below. One time step is
typically one day or one week; see \code{\link{epi_slide}} details for more
explanation. If a function, \code{f} must take an \code{epi_df} with the same
column names as the archive's \code{DT}, minus the \code{version} column; followed
by a one-row tibble containing the values of the grouping variables for
the associated group; followed by a reference time value, usually as a
\code{Date} object; followed by any number of named arguments. If a formula,
\code{f} can operate directly on columns accessed via \code{.x$var} or \code{.$var}, as
in \code{~ mean (.x$var)} to compute a mean of a column \code{var} for each
group-\code{ref_time_value} combination. The group key can be accessed via
\code{.y} or \code{.group_key}, and the reference time value can be accessed via
\code{.z} or \code{.ref_time_value}. If \code{f} is missing, then \code{...} will specify the
computation.}
\item{...}{Additional arguments to pass to the function or formula specified
via \code{f}. Alternatively, if \code{f} is missing, then \code{...} is interpreted as an
expression for tidy evaluation. See details of \code{\link{epi_slide}}.}
\item{before}{How far \code{before} each \code{ref_time_value} should the sliding
window extend? If provided, should be a single, non-NA,
\link[vctrs:vec_cast]{integer-compatible} number of time steps. This window
endpoint is inclusive. For example, if \code{before = 7}, and one time step is
one day, then to produce a value for a \code{ref_time_value} of January 8, we
apply the given function or formula to data (for each group present) with
\code{time_value}s from January 1 onward, as they were reported on January 8.
For typical disease surveillance sources, this will not include any data
with a \code{time_value} of January 8, and, depending on the amount of reporting
latency, may not include January 7 or even earlier \code{time_value}s. (If
instead the archive were to hold nowcasts instead of regular surveillance
data, then we would indeed expect data for \code{time_value} January 8. If it
were to hold forecasts, then we would expect data for \code{time_value}s after
January 8, and the sliding window would extend as far after each
\code{ref_time_value} as needed to include all such \code{time_value}s.)}
\item{ref_time_values}{Reference time values / versions for sliding
computations; each element of this vector serves both as the anchor point
for the \code{time_value} window for the computation and the \code{max_version}
\code{as_of} which we fetch data in this window. If missing, then this will set
to a regularly-spaced sequence of values set to cover the range of
\code{version}s in the \code{DT} plus the \code{versions_end}; the spacing of values will
be guessed (using the GCD of the skips between values).}
\item{time_step}{Optional function used to define the meaning of one time
step, which if specified, overrides the default choice based on the
\code{time_value} column. This function must take a positive integer and return
an object of class \code{lubridate::period}. For example, we can use \code{time_step = lubridate::hours} in order to set the time step to be one hour (this
would only be meaningful if \code{time_value} is of class \code{POSIXct}).}
\item{new_col_name}{String indicating the name of the new column that will
contain the derivative values. Default is "slide_value"; note that setting
\code{new_col_name} equal to an existing column name will overwrite this column.}
\item{as_list_col}{Should the slide results be held in a list column, or be
\link[tidyr:chop]{unchopped}/\link[tidyr:nest]{unnested}? Default is \code{FALSE},
in which case a list object returned by \code{f} would be unnested (using
\code{\link[tidyr:nest]{tidyr::unnest()}}), and, if the slide computations output data frames,
the names of the resulting columns are given by prepending \code{new_col_name}
to the names of the list elements.}
\item{names_sep}{String specifying the separator to use in \code{tidyr::unnest()}
when \code{as_list_col = FALSE}. Default is "_". Using \code{NULL} drops the prefix
from \code{new_col_name} entirely.}
\item{all_versions}{(Not the same as \code{all_rows} parameter of \code{epi_slide}.) If
\code{all_versions = TRUE}, then \code{f} will be passed the version history (all
\code{version <= ref_time_value}) for rows having \code{time_value} between
\code{ref_time_value - before} and \code{ref_time_value}. Otherwise, \code{f} will be
passed only the most recent \code{version} for every unique \code{time_value}.
Default is \code{FALSE}.}
}
\value{
A tibble whose columns are: the grouping variables, \code{time_value},
containing the reference time values for the slide computation, and a
column named according to the \code{new_col_name} argument, containing the slide
values.
}
\description{
Slides a given function over variables in an \code{epi_archive} object. This
behaves similarly to \code{epi_slide()}, with the key exception that it is
version-aware: the sliding computation at any given reference time t is
performed on \strong{data that would have been available as of t}. See the
\href{https://cmu-delphi.github.io/epiprocess/articles/archive.html}{archive vignette} for
examples.
}
\details{
A few key distinctions between the current function and \code{epi_slide()}:
\enumerate{
\item In \code{f} functions for \code{epix_slide}, one should not assume that the input
data to contain any rows with \code{time_value} matching the computation's
\code{ref_time_value} (accessible via \verb{attributes(<data>)$metadata$as_of}); for
typical epidemiological surveillance data, observations pertaining to a
particular time period (\code{time_value}) are first reported \code{as_of} some
instant after that time period has ended.
\item \code{epix_slide()} doesn't accept an \code{after} argument; its windows extend
from \code{before} time steps before a given \code{ref_time_value} through the last
\code{time_value} available as of version \code{ref_time_value} (typically, this
won't include \code{ref_time_value} itself, as observations about a particular
time interval (e.g., day) are only published after that time interval
ends); \code{epi_slide} windows extend from \code{before} time steps before a
\code{ref_time_value} through \code{after} time steps after \code{ref_time_value}.
\item The input class and columns are similar but different: \code{epix_slide}
(with the default \code{all_versions=FALSE}) keeps all columns and the
\code{epi_df}-ness of the first argument to each computation; \code{epi_slide} only
provides the grouping variables in the second input, and will convert the
first input into a regular tibble if the grouping variables include the
essential \code{geo_value} column. (With \code{all_versions=TRUE}, \code{epix_slide} will
will provide an \code{epi_archive} rather than an \code{epi-df} to each
computation.)
\item The output class and columns are similar but different: \code{epix_slide()}
returns a tibble containing only the grouping variables, \code{time_value}, and
the new column(s) from the slide computations, whereas \code{epi_slide()}
returns an \code{epi_df} with all original variables plus the new columns from
the slide computations. (Both will mirror the grouping or ungroupedness of
their input, with one exception: \code{epi_archive}s can have trivial
(zero-variable) groupings, but these will be dropped in \code{epix_slide}
results as they are not supported by tibbles.)
\item There are no size stability checks or element/row recycling to maintain
size stability in \code{epix_slide}, unlike in \code{epi_slide}. (\code{epix_slide} is
roughly analogous to \code{\link[dplyr:group_map]{dplyr::group_modify}}, while \code{epi_slide} is roughly
analogous to \code{dplyr::mutate} followed by \code{dplyr::arrange}) This is detailed
in the "advanced" vignette.
\item \code{all_rows} is not supported in \code{epix_slide}; since the slide
computations are allowed more flexibility in their outputs than in
\code{epi_slide}, we can't guess a good representation for missing computations
for excluded group-\code{ref_time_value} pairs.
\item The \code{ref_time_values} default for \code{epix_slide} is based on making an
evenly-spaced sequence out of the \code{version}s in the \code{DT} plus the
\code{versions_end}, rather than the \code{time_value}s.
}
Apart from the above distinctions, the interfaces between \code{epix_slide()} and
\code{epi_slide()} are the same.
Furthermore, the current function can be considerably slower than
\code{epi_slide()}, for two reasons: (1) it must repeatedly fetch
properly-versioned snapshots from the data archive (via its \code{as_of()}
method), and (2) it performs a "manual" sliding of sorts, and does not
benefit from the highly efficient \code{slider} package. For this reason, it
should never be used in place of \code{epi_slide()}, and only used when
version-aware sliding is necessary (as it its purpose).
Finally, this is simply a wrapper around the \code{slide()} method of the
\code{epi_archive} and \code{grouped_epi_archive} classes, so if \code{x} is an
object of either of these classes, then:
\if{html}{\out{<div class="sourceCode">}}\preformatted{epix_slide(x, new_var = comp(old_var), before = 119)
}\if{html}{\out{</div>}}
is equivalent to:
\if{html}{\out{<div class="sourceCode">}}\preformatted{x$slide(new_var = comp(old_var), before = 119)
}\if{html}{\out{</div>}}
}
\examples{
library(dplyr)
# Reference time points for which we want to compute slide values:
ref_time_values <- seq(as.Date("2020-06-01"),
as.Date("2020-06-15"),
by = "1 day")
# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset \%>\%
group_by(geo_value) \%>\%
epix_slide(f = ~ mean(.x$case_rate_7d_av),
before = 2,
ref_time_values = ref_time_values,
new_col_name = 'case_rate_7d_av_recent_av') \%>\%
ungroup()
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example). In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
# discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
# of data latency, we'll never have an observation
# `time_value == ref_time_value` as of `ref_time_value`.
# The example below shows this type of behavior in more detail.
# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset \%>\%
group_by(geo_value) \%>\%
epix_slide(
function(x, gk, rtv) {
tibble(
time_range = if(nrow(x) == 0L) {
"0 `time_value`s"
} else {
sprintf("\%s -- \%s", min(x$time_value), max(x$time_value))
},
n = nrow(x),
class1 = class(x)[[1L]]
)
},
before = 5, all_versions = FALSE,
ref_time_values = ref_time_values, names_sep=NULL) \%>\%
ungroup() \%>\%
arrange(geo_value, time_value)
# --- Advanced: ---
# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:
archive_cases_dv_subset \%>\%
group_by(geo_value) \%>\%
epix_slide(
function(x, gk, rtv) {
tibble(
versions_start = if (nrow(x$DT) == 0L) {
"NA (0 rows)"
} else {
toString(min(x$DT$version))
},
versions_end = x$versions_end,
time_range = if(nrow(x$DT) == 0L) {
"0 `time_value`s"
} else {
sprintf("\%s -- \%s", min(x$DT$time_value), max(x$DT$time_value))
},
n = nrow(x$DT),
class1 = class(x)[[1L]]
)
},
before = 5, all_versions = TRUE,
ref_time_values = ref_time_values, names_sep=NULL) \%>\%
ungroup() \%>\%
# Focus on one geo_value so we can better see the columns above:
filter(geo_value == "ca") \%>\%
select(-geo_value)
}