man/epix_slide.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/methods-epi_archive.R
\name{epix_slide}
\alias{epix_slide}
\title{Slide a function over variables in an \code{epi_archive} or \code{grouped_epi_archive}}
\usage{
epix_slide(
  x,
  f,
  ...,
  before,
  ref_time_values,
  time_step,
  new_col_name = "slide_value",
  as_list_col = FALSE,
  names_sep = "_",
  all_versions = FALSE
)
}
\arguments{
\item{x}{An \code{\link{epi_archive}} or \code{\link{grouped_epi_archive}} object. If ungrouped,
all data in \code{x} will be treated as part of a single data group.}

\item{f}{Function, formula, or missing; together with \code{...} specifies the
computation to slide. To "slide" means to apply a computation over a
sliding (a.k.a. "rolling") time window for each data group. The window is
determined by the \code{before} parameter described below. One time step is
typically one day or one week; see \code{\link{epi_slide}} details for more
explanation. If a function, \code{f} must take an \code{epi_df} with the same
column names as the archive's \code{DT}, minus the \code{version} column; followed
by a one-row tibble containing the values of the grouping variables for
the associated group; followed by a reference time value, usually as a
\code{Date} object; followed by any number of named arguments. If a formula,
\code{f} can operate directly on columns accessed via \code{.x$var} or \code{.$var}, as
in \code{~ mean (.x$var)} to compute a mean of a column \code{var} for each
group-\code{ref_time_value} combination. The group key can be accessed via
\code{.y} or \code{.group_key}, and the reference time value can be accessed via
\code{.z} or \code{.ref_time_value}. If \code{f} is missing, then \code{...} will specify the
computation.}

\item{...}{Additional arguments to pass to the function or formula specified
via \code{f}. Alternatively, if \code{f} is missing, then \code{...} is interpreted as an
expression for tidy evaluation. See details of \code{\link{epi_slide}}.}

\item{before}{How far \code{before} each \code{ref_time_value} should the sliding
window extend? If provided, should be a single, non-NA,
\link[vctrs:vec_cast]{integer-compatible} number of time steps. This window
endpoint is inclusive. For example, if \code{before = 7}, and one time step is
one day, then to produce a value for a \code{ref_time_value} of January 8, we
apply the given function or formula to data (for each group present) with
\code{time_value}s from January 1 onward, as they were reported on January 8.
For typical disease surveillance sources, this will not include any data
with a \code{time_value} of January 8, and, depending on the amount of reporting
latency, may not include January 7 or even earlier \code{time_value}s. (If
instead the archive were to hold nowcasts instead of regular surveillance
data, then we would indeed expect data for \code{time_value} January 8. If it
were to hold forecasts, then we would expect data for \code{time_value}s after
January 8, and the sliding window would extend as far after each
\code{ref_time_value} as needed to include all such \code{time_value}s.)}

\item{ref_time_values}{Reference time values / versions for sliding
computations; each element of this vector serves both as the anchor point
for the \code{time_value} window for the computation and the \code{max_version}
\code{as_of} which we fetch data in this window. If missing, then this will set
to a regularly-spaced sequence of values set to cover the range of
\code{version}s in the \code{DT} plus the \code{versions_end}; the spacing of values will
be guessed (using the GCD of the skips between values).}

\item{time_step}{Optional function used to define the meaning of one time
step, which if specified, overrides the default choice based on the
\code{time_value} column. This function must take a positive integer and return
an object of class \code{lubridate::period}. For example, we can use \code{time_step = lubridate::hours} in order to set the time step to be one hour (this
would only be meaningful if \code{time_value} is of class \code{POSIXct}).}

\item{new_col_name}{String indicating the name of the new column that will
contain the derivative values. Default is "slide_value"; note that setting
\code{new_col_name} equal to an existing column name will overwrite this column.}

\item{as_list_col}{Should the slide results be held in a list column, or be
\link[tidyr:chop]{unchopped}/\link[tidyr:nest]{unnested}? Default is \code{FALSE},
in which case a list object returned by \code{f} would be unnested (using
\code{\link[tidyr:nest]{tidyr::unnest()}}), and, if the slide computations output data frames,
the names of the resulting columns are given by prepending \code{new_col_name}
to the names of the list elements.}

\item{names_sep}{String specifying the separator to use in \code{tidyr::unnest()}
when \code{as_list_col = FALSE}. Default is "_". Using \code{NULL} drops the prefix
from \code{new_col_name} entirely.}

\item{all_versions}{(Not the same as \code{all_rows} parameter of \code{epi_slide}.) If
\code{all_versions = TRUE}, then \code{f} will be passed the version history (all
\code{version <= ref_time_value}) for rows having \code{time_value} between
\code{ref_time_value - before} and \code{ref_time_value}. Otherwise, \code{f} will be
passed only the most recent \code{version} for every unique \code{time_value}.
Default is \code{FALSE}.}
}
\value{
A tibble whose columns are: the grouping variables, \code{time_value},
containing the reference time values for the slide computation, and a
column named according to the \code{new_col_name} argument, containing the slide
values.
}
\description{
Slides a given function over variables in an \code{epi_archive} object. This
behaves similarly to \code{epi_slide()}, with the key exception that it is
version-aware: the sliding computation at any given reference time t is
performed on \strong{data that would have been available as of t}. See the
\href{https://cmu-delphi.github.io/epiprocess/articles/archive.html}{archive vignette} for
examples.
}
\details{
A few key distinctions between the current function and \code{epi_slide()}:
\enumerate{
\item In \code{f} functions for \code{epix_slide}, one should not assume that the input
data to contain any rows with \code{time_value} matching the computation's
\code{ref_time_value} (accessible via \verb{attributes(<data>)$metadata$as_of}); for
typical epidemiological surveillance data, observations pertaining to a
particular time period (\code{time_value}) are first reported \code{as_of} some
instant after that time period has ended.
\item \code{epix_slide()} doesn't accept an \code{after} argument; its windows extend
from \code{before} time steps before a given \code{ref_time_value} through the last
\code{time_value} available as of version \code{ref_time_value} (typically, this
won't include \code{ref_time_value} itself, as observations about a particular
time interval (e.g., day) are only published after that time interval
ends); \code{epi_slide} windows extend from \code{before} time steps before a
\code{ref_time_value} through \code{after} time steps after \code{ref_time_value}.
\item The input class and columns are similar but different: \code{epix_slide}
(with the default \code{all_versions=FALSE}) keeps all columns and the
\code{epi_df}-ness of the first argument to each computation; \code{epi_slide} only
provides the grouping variables in the second input, and will convert the
first input into a regular tibble if the grouping variables include the
essential \code{geo_value} column. (With \code{all_versions=TRUE}, \code{epix_slide} will
will provide an \code{epi_archive} rather than an \code{epi-df} to each
computation.)
\item The output class and columns are similar but different: \code{epix_slide()}
returns a tibble containing only the grouping variables, \code{time_value}, and
the new column(s) from the slide computations, whereas \code{epi_slide()}
returns an \code{epi_df} with all original variables plus the new columns from
the slide computations. (Both will mirror the grouping or ungroupedness of
their input, with one exception: \code{epi_archive}s can have trivial
(zero-variable) groupings, but these will be dropped in \code{epix_slide}
results as they are not supported by tibbles.)
\item There are no size stability checks or element/row recycling to maintain
size stability in \code{epix_slide}, unlike in \code{epi_slide}. (\code{epix_slide} is
roughly analogous to \code{\link[dplyr:group_map]{dplyr::group_modify}}, while \code{epi_slide} is roughly
analogous to \code{dplyr::mutate} followed by \code{dplyr::arrange}) This is detailed
in the "advanced" vignette.
\item \code{all_rows} is not supported in \code{epix_slide}; since the slide
computations are allowed more flexibility in their outputs than in
\code{epi_slide}, we can't guess a good representation for missing computations
for excluded group-\code{ref_time_value} pairs.
\item The \code{ref_time_values} default for \code{epix_slide} is based on making an
evenly-spaced sequence out of the \code{version}s in the \code{DT} plus the
\code{versions_end}, rather than the \code{time_value}s.
}

Apart from the above distinctions, the interfaces between \code{epix_slide()} and
\code{epi_slide()} are the same.

Furthermore, the current function can be considerably slower than
\code{epi_slide()}, for two reasons: (1) it must repeatedly fetch
properly-versioned snapshots from the data archive (via its \code{as_of()}
method), and (2) it performs a "manual" sliding of sorts, and does not
benefit from the highly efficient \code{slider} package. For this reason, it
should never be used in place of \code{epi_slide()}, and only used when
version-aware sliding is necessary (as it its purpose).

Finally, this is simply a wrapper around the \code{slide()} method of the
\code{epi_archive} and \code{grouped_epi_archive} classes, so if \code{x} is an
object of either of these classes, then:

\if{html}{\out{<div class="sourceCode">}}\preformatted{epix_slide(x, new_var = comp(old_var), before = 119)
}\if{html}{\out{</div>}}

is equivalent to:

\if{html}{\out{<div class="sourceCode">}}\preformatted{x$slide(new_var = comp(old_var), before = 119)
}\if{html}{\out{</div>}}
}
\examples{
library(dplyr)

# Reference time points for which we want to compute slide values:
ref_time_values <- seq(as.Date("2020-06-01"),
                       as.Date("2020-06-15"),
                       by = "1 day")

# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset \%>\%
  group_by(geo_value) \%>\%
  epix_slide(f = ~ mean(.x$case_rate_7d_av),
             before = 2,
             ref_time_values = ref_time_values,
             new_col_name = 'case_rate_7d_av_recent_av') \%>\%
  ungroup()
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example).  In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
#                                                discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
#   of data latency, we'll never have an observation
#   `time_value == ref_time_value` as of `ref_time_value`.
# The example below shows this type of behavior in more detail.

# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset \%>\%
 group_by(geo_value) \%>\%
 epix_slide(
   function(x, gk, rtv) {
     tibble(
       time_range = if(nrow(x) == 0L) {
         "0 `time_value`s"
       } else {
         sprintf("\%s -- \%s", min(x$time_value), max(x$time_value))
       },
       n = nrow(x),
       class1 = class(x)[[1L]]
     )
   },
   before = 5, all_versions = FALSE,
   ref_time_values = ref_time_values, names_sep=NULL) \%>\%
 ungroup() \%>\%
 arrange(geo_value, time_value)

# --- Advanced: ---

# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:

archive_cases_dv_subset \%>\%
  group_by(geo_value) \%>\%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        versions_start = if (nrow(x$DT) == 0L) {
          "NA (0 rows)"
        } else {
          toString(min(x$DT$version))
        },
        versions_end = x$versions_end,
        time_range = if(nrow(x$DT) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("\%s -- \%s", min(x$DT$time_value), max(x$DT$time_value))
        },
        n = nrow(x$DT),
        class1 = class(x)[[1L]]
      )
    },
    before = 5, all_versions = TRUE,
    ref_time_values = ref_time_values, names_sep=NULL) \%>\%
  ungroup() \%>\%
  # Focus on one geo_value so we can better see the columns above:
  filter(geo_value == "ca") \%>\%
  select(-geo_value)

}