-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathepi_archive.Rd
217 lines (193 loc) · 8.2 KB
/
epi_archive.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/archive.R
\name{epi_archive}
\alias{epi_archive}
\alias{new_epi_archive}
\alias{validate_epi_archive}
\alias{as_epi_archive}
\title{\code{epi_archive} object}
\usage{
new_epi_archive(
x,
geo_type = NULL,
time_type = NULL,
other_keys = NULL,
additional_metadata = NULL,
compactify = NULL,
clobberable_versions_start = NULL,
versions_end = NULL
)
validate_epi_archive(
x,
geo_type = NULL,
time_type = NULL,
other_keys = NULL,
additional_metadata = NULL,
compactify = NULL,
clobberable_versions_start = NULL,
versions_end = NULL
)
as_epi_archive(
x,
geo_type = NULL,
time_type = NULL,
other_keys = NULL,
additional_metadata = NULL,
compactify = NULL,
clobberable_versions_start = NULL,
.versions_end = NULL,
...,
versions_end = .versions_end
)
}
\arguments{
\item{x}{A data.frame, data.table, or tibble, with columns \code{geo_value},
\code{time_value}, \code{version}, and then any additional number of columns.}
\item{geo_type}{Type for the geo values. If missing, then the function will
attempt to infer it from the geo values present; if this fails, then it
will be set to "custom".}
\item{time_type}{Type for the time values. If missing, then the function will
attempt to infer it from the time values present; if this fails, then it
will be set to "custom".}
\item{other_keys}{Character vector specifying the names of variables in \code{x}
that should be considered key variables (in the language of \code{data.table})
apart from "geo_value", "time_value", and "version".}
\item{additional_metadata}{List of additional metadata to attach to the
\code{epi_archive} object. The metadata will have \code{geo_type} and \code{time_type}
fields; named entries from the passed list or will be included as well.}
\item{compactify}{Optional; Boolean or \code{NULL}. \code{TRUE} will remove some
redundant rows, \code{FALSE} will not, and missing or \code{NULL} will remove
redundant rows, but issue a warning. See more information at \code{compactify}.}
\item{clobberable_versions_start}{Optional; \code{length}-1; either a value of the
same \code{class} and \code{typeof} as \code{x$version}, or an \code{NA} of any \code{class} and
\code{typeof}: specifically, either (a) the earliest version that could be
subject to "clobbering" (being overwritten with different update data, but
using the \emph{same} version tag as the old update data), or (b) \code{NA}, to
indicate that no versions are clobberable. There are a variety of reasons
why versions could be clobberable under routine circumstances, such as (a)
today's version of one/all of the columns being published after initially
being filled with \code{NA} or LOCF, (b) a buggy version of today's data being
published but then fixed and republished later in the day, or (c) data
pipeline delays (e.g., publisher uploading, periodic scraping, database
syncing, periodic fetching, etc.) that make events (a) or (b) reflected
later in the day (or even on a different day) than expected; potential
causes vary between different data pipelines. The default value is \code{NA},
which doesn't consider any versions to be clobberable. Another setting that
may be appropriate for some pipelines is \code{max_version_with_row_in(x)}.}
\item{versions_end}{Optional; length-1, same \code{class} and \code{typeof} as
\code{x$version}: what is the last version we have observed? The default is
\code{max_version_with_row_in(x)}, but values greater than this could also be
valid, and would indicate that we observed additional versions of the data
beyond \code{max(x$version)}, but they all contained empty updates. (The default
value of \code{clobberable_versions_start} does not fully trust these empty
updates, and assumes that any version \verb{>= max(x$version)} could be
clobbered.) If \code{nrow(x) == 0}, then this argument is mandatory.}
\item{.versions_end}{location based versions_end, used to avoid prefix
\code{version = issue} from being assigned to \code{versions_end} instead of being
used to rename columns.}
\item{...}{used for specifying column names, as in \code{\link[dplyr:rename]{dplyr::rename}}. For
example \code{version = release_date}}
}
\value{
An \code{epi_archive} object.
}
\description{
An \code{epi_archive} is an S3 class which contains a data table
along with several relevant pieces of metadata. The data table can be seen
as the full archive (version history) for some signal variables of
interest.
}
\details{
Epi Archive
An \code{epi_archive} contains a data table \code{DT}, of class \code{data.table}
from the \code{data.table} package, with (at least) the following columns:
\itemize{
\item \code{geo_value}: the geographic value associated with each row of measurements.
\item \code{time_value}: the time value associated with each row of measurements.
\item \code{version}: the time value specifying the version for each row of
measurements. For example, if in a given row the \code{version} is January 15,
2022 and \code{time_value} is January 14, 2022, then this row contains the
measurements of the data for January 14, 2022 that were available one day
later.
}
The data table \code{DT} has key variables \code{geo_value}, \code{time_value}, \code{version},
as well as any others (these can be specified when instantiating the
\code{epi_archive} object via the \code{other_keys} argument, and/or set by operating
on \code{DT} directly). Refer to the documentation for \code{as_epi_archive()} for
information and examples of relevant parameter names for an \code{epi_archive}
object. Note that there can only be a single row per unique combination of
key variables, and thus the key variables are critical for figuring out how
to generate a snapshot of data from the archive, as of a given version.
}
\section{Metadata}{
The following pieces of metadata are included as fields in an \code{epi_archive}
object:
\itemize{
\item \code{geo_type}: the type for the geo values.
\item \code{time_type}: the type for the time values.
\item \code{additional_metadata}: list of additional metadata for the data archive.
}
Unlike an \code{epi_df} object, metadata for an \code{epi_archive} object \code{x} can be
accessed (and altered) directly, as in \code{x$geo_type} or \code{x$time_type},
etc. Like an \code{epi_df} object, the \code{geo_type} and \code{time_type} fields in the
metadata of an \code{epi_archive} object are not currently used by any
downstream functions in the \code{epiprocess} package, and serve only as useful
bits of information to convey about the data set at hand.
}
\section{Generating Snapshots}{
An \code{epi_archive} object can be used to generate a snapshot of the data in
\code{epi_df} format, which represents the most up-to-date values of the signal
variables, as of the specified version. This is accomplished by calling
\code{epix_as_of()}.
}
\section{Sliding Computations}{
We can run a sliding computation over an \code{epi_archive} object, much like
\code{epi_slide()} does for an \code{epi_df} object. This is accomplished by calling
the \code{slide()} method for an \code{epi_archive} object, which works similarly to
the way \code{epi_slide()} works for an \code{epi_df} object, but with one key
difference: it is version-aware. That is, for an \code{epi_archive} object, the
sliding computation at any given reference time point t is performed on
\strong{data that would have been available as of t}.
}
\examples{
# Simple ex. with necessary keys
tib <- tibble::tibble(
geo_value = rep(c("ca", "hi"), each = 5),
time_value = rep(seq(as.Date("2020-01-01"),
by = 1, length.out = 5
), times = 2),
version = rep(seq(as.Date("2020-01-02"),
by = 1, length.out = 5
), times = 2),
value = rnorm(10, mean = 2, sd = 1)
)
toy_epi_archive <- tib \%>\% as_epi_archive(
geo_type = "state",
time_type = "day"
)
toy_epi_archive
# Ex. with an additional key for county
df <- data.frame(
geo_value = c(replicate(2, "ca"), replicate(2, "fl")),
county = c(1, 3, 2, 5),
time_value = c(
"2020-06-01",
"2020-06-02",
"2020-06-01",
"2020-06-02"
),
version = c(
"2020-06-02",
"2020-06-03",
"2020-06-02",
"2020-06-03"
),
cases = c(1, 2, 3, 4),
cases_rate = c(0.01, 0.02, 0.01, 0.05)
)
x <- df \%>\% as_epi_archive(
geo_type = "state",
time_type = "day",
other_keys = "county"
)
}