1
1
2
2
<!-- README.md is generated from README.Rmd. Please edit that file -->
3
3
4
- # epipredict
4
+ # Epipredict
5
5
6
6
<!-- badges: start -->
7
7
8
8
[ ![ R-CMD-check] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg )] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml )
9
9
<!-- badges: end -->
10
10
11
- ** Note:** This package is currently in development and may not work as
12
- expected. Please file bug reports as issues in this repo, and we will do
13
- our best to address them quickly.
11
+ Epipredict is a framework for building transformation and forecasting
12
+ pipelines for epidemiological and other panel time-series datasets. In
13
+ addition to tools for building forecasting pipelines, it contains a
14
+ number of “canned” forecasters meant to run with little modification as
15
+ an easy way to get started forecasting.
16
+
17
+ It is designed to work well with
18
+ [ ` epiprocess ` ] ( https://cmu-delphi.github.io/epiprocess/ ) , a utility for
19
+ handling various time series and geographic processing tools in an
20
+ epidemiological context. Both of the packages are meant to work well
21
+ with the panel data provided by
22
+ [ ` epidatr ` ] ( https://cmu-delphi.github.io/epidatr/ ) .
23
+
24
+ If you are looking for more detail beyond the package documentation, see
25
+ our [ forecasting
26
+ book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
14
27
15
28
## Installation
16
29
17
- To install (unless you’re making changes to the package, use the stable
18
- version):
30
+ To install (unless you’re planning on contributing to package
31
+ development, we suggest using the stable version):
19
32
20
33
``` r
21
34
# Stable version
@@ -25,52 +38,14 @@ pak::pkg_install("cmu-delphi/epipredict@main")
25
38
pak :: pkg_install(" cmu-delphi/epipredict@dev" )
26
39
```
27
40
28
- ## Documentation
29
-
30
- You can view documentation for the ` main ` branch at
31
- < https://cmu-delphi.github.io/epipredict > .
32
-
33
- ## Goals for ` epipredict `
34
-
35
- ** We hope to provide:**
36
-
37
- 1 . A set of basic, easy-to-use forecasters that work out of the box.
38
- You should be able to do a reasonably limited amount of
39
- customization on them. For the basic forecasters, we currently
40
- provide:
41
- - Baseline flatline forecaster
42
- - Autoregressive forecaster
43
- - Autoregressive classifier
44
- - CDC FluSight flatline forecaster
45
- 2 . A framework for creating custom forecasters out of modular
46
- components. There are four types of components:
47
- - Preprocessor: do things to the data before model training
48
- - Trainer: train a model on data, resulting in a fitted model object
49
- - Predictor: make predictions, using a fitted model object
50
- - Postprocessor: do things to the predictions before returning
41
+ The documentation for the stable version is at
42
+ < https://cmu-delphi.github.io/epipredict > , while the development version
43
+ is at < https://cmu-delphi.github.io/epipredict/dev > .
51
44
52
- ** Target audiences: **
45
+ ## Motivating example
53
46
54
- - Basic. Has data, calls forecaster with default arguments.
55
- - Intermediate. Wants to examine changes to the arguments, take
56
- advantage of built in flexibility.
57
- - Advanced. Wants to write their own forecasters. Maybe willing to build
58
- up from some components.
59
-
60
- The Advanced user should find their task to be relatively easy. Examples
61
- of these tasks are illustrated in the [ vignettes and
62
- articles] ( https://cmu-delphi.github.io/epipredict ) .
63
-
64
- See also the (in progress) [ Forecasting
65
- Book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
66
-
67
- ## Intermediate example
68
-
69
- The package comes with some built-in historical data for illustration,
70
- but up-to-date versions of this could be downloaded with the
71
- [ ` {epidatr} ` package] ( https://cmu-delphi.github.io/epidatr/ ) and
72
- processed using
73
- [ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) .[ ^ 1 ]
47
+ To demonstrate the kind of forecast epipredict can make, say we’re
48
+ predicting COVID deaths per 100k for each state on
74
49
75
50
``` r
76
51
forecast_date <- as.Date(" 2021-08-01" )
@@ -95,17 +70,19 @@ cases <- pub_covidcast(
95
70
signals = " confirmed_incidence_prop" ,
96
71
time_type = " day" ,
97
72
geo_type = " state" ,
98
- time_values = epirange(20200601 , 20220101 ),
99
- geo_values = " *" ) | >
73
+ time_values = epirange(20200601 , 20211231 ),
74
+ geo_values = " *"
75
+ ) | >
100
76
select(geo_value , time_value , case_rate = value )
101
77
102
78
deaths <- pub_covidcast(
103
79
source = " jhu-csse" ,
104
80
signals = " deaths_incidence_prop" ,
105
81
time_type = " day" ,
106
82
geo_type = " state" ,
107
- time_values = epirange(20200601 , 20220101 ),
108
- geo_values = " *" ) | >
83
+ time_values = epirange(20200601 , 20211231 ),
84
+ geo_values = " *"
85
+ ) | >
109
86
select(geo_value , time_value , death_rate = value )
110
87
cases_deaths <-
111
88
full_join(cases , deaths , by = c(" time_value" , " geo_value" )) | >
@@ -123,6 +100,7 @@ cases_deaths |>
123
100
```
124
101
125
102
<img src =" man/figures/README-case_death-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
103
+
126
104
As with basically any dataset, there is some cleaning that we will need
127
105
to do to make it actually usable; we’ll use some utilities from
128
106
[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) for this.
@@ -131,7 +109,7 @@ First, to eliminate some of the noise coming from daily reporting, we do
131
109
132
110
``` r
133
111
cases_deaths <-
134
- cases_deaths | >
112
+ cases_deaths | >
135
113
group_by(geo_value ) | >
136
114
epi_slide(
137
115
cases_7dav = mean(case_rate , na.rm = TRUE ),
@@ -150,47 +128,54 @@ cases_deaths <-
150
128
cases_deaths | >
151
129
group_by(geo_value ) | >
152
130
mutate(
153
- outlr_death_rate = detect_outlr_rm(time_value , death_rate , detect_negatives = TRUE ),
154
- outlr_case_rate = detect_outlr_rm(time_value , case_rate , detect_negatives = TRUE )
131
+ outlr_death_rate = detect_outlr_rm(
132
+ time_value , death_rate , detect_negatives = TRUE
133
+ ),
134
+ outlr_case_rate = detect_outlr_rm(
135
+ time_value , case_rate , detect_negatives = TRUE
136
+ )
155
137
) | >
156
138
unnest(cols = starts_with(" outlr" ), names_sep = " _" ) | >
157
139
ungroup() | >
158
140
mutate(
159
141
death_rate = outlr_death_rate_replacement ,
160
- case_rate = outlr_case_rate_replacement ) | >
142
+ case_rate = outlr_case_rate_replacement
143
+ ) | >
161
144
select(geo_value , time_value , case_rate , death_rate )
162
145
cases_deaths
163
- # > An `epi_df` object, 32,480 x 4 with metadata:
146
+ # > An `epi_df` object, 32,424 x 4 with metadata:
164
147
# > * geo_type = state
165
148
# > * time_type = day
166
- # > * as_of = 2022-05-31 12:08:25.791826
149
+ # > * as_of = 2022-01-01
167
150
# >
168
- # > # A tibble: 20,496 × 4
169
- # > geo_value time_value case_rate death_rate
170
- # > * <chr> <date> <dbl> <dbl>
171
- # > 1 ak 2020-12-31 35.9 0.158
172
- # > 2 al 2020-12-31 65.1 0.438
173
- # > 3 ar 2020-12-31 66.0 1.27
174
- # > 4 as 2020-12-31 0 0
175
- # > 5 az 2020-12-31 76.8 1.10
176
- # > 6 ca 2020-12-31 96.0 0.751
177
- # > 7 co 2020-12-31 35.8 0.649
178
- # > 8 ct 2020-12-31 52.1 0.819
179
- # > 9 dc 2020-12-31 31.0 0.601
180
- # > 10 de 2020-12-31 65.2 0.807
181
- # > # ℹ 20,486 more rows
151
+ # > # A tibble: 32,424 × 4
152
+ # > geo_value time_value case_rate death_rate
153
+ # > * <chr> <date> <dbl> <dbl>
154
+ # > 1 ak 2020-06-01 2.31 0
155
+ # > 2 ak 2020-06-02 1.94 0
156
+ # > 3 ak 2020-06-03 2.63 0
157
+ # > 4 ak 2020-06-04 2.59 0
158
+ # > 5 ak 2020-06-05 2.43 0
159
+ # > 6 ak 2020-06-06 2.35 0
160
+ # > # ℹ 32,418 more rows
182
161
```
183
162
184
- To create and train a simple auto-regressive forecaster to predict the
185
- death rate two weeks into the future using past (lagged) deaths and
186
- cases, we could use the following function.
163
+ </details >
164
+
165
+ After having downloaded and cleaned the data in ` cases_deaths ` , we plot
166
+ a subset of the states, noting the actual forecast date:
167
+
168
+ <details >
169
+ <summary >
170
+ Plot
171
+ </summary >
187
172
188
173
``` r
189
174
forecast_date_label <-
190
175
tibble(
191
176
geo_value = rep(plot_locations , 2 ),
192
- source = c(rep(" case_rate" ,4 ), rep(" death_rate" , 4 )),
193
- dates = rep(forecast_date - 7 * 2 , 2 * length(plot_locations )),
177
+ source = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
178
+ dates = rep(forecast_date - 7 * 2 , 2 * length(plot_locations )),
194
179
heights = c(rep(150 , 4 ), rep(1.0 , 4 ))
195
180
)
196
181
processed_data_plot <-
@@ -202,7 +187,10 @@ processed_data_plot <-
202
187
facet_grid(source ~ geo_value , scale = " free" ) +
203
188
geom_vline(aes(xintercept = forecast_date )) +
204
189
geom_text(
205
- data = forecast_date_label , aes(x = dates , label = " forecast\n date" , y = heights ), size = 3 , hjust = " right" ) +
190
+ data = forecast_date_label ,
191
+ aes(x = dates , label = " forecast\n date" , y = heights ),
192
+ size = 3 , hjust = " right"
193
+ ) +
206
194
scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
207
195
theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
208
196
```
@@ -222,25 +210,26 @@ four_week_ahead <- arx_forecaster(
222
210
predictors = c(" case_rate" , " death_rate" ),
223
211
args_list = arx_args_list(
224
212
lags = list (c(0 , 1 , 2 , 3 , 7 , 14 ), c(0 , 7 , 14 )),
225
- ahead = 14
213
+ ahead = 4 * 7
226
214
)
227
215
)
228
- two_week_ahead
229
- # > ══ A basic forecaster of type ARX Forecaster ═══════════════════════════════
216
+ four_week_ahead
217
+ # > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
230
218
# >
231
- # > This forecaster was fit on 2024-11-11 11:38:31 .
219
+ # > This forecaster was fit on 2025-01-24 14:47:38 .
232
220
# >
233
221
# > Training data was an <epi_df> with:
234
222
# > • Geography: state,
235
223
# > • Time type: day,
236
- # > • Using data up-to-date as of: 2022-05-31 12:08:25.
224
+ # > • Using data up-to-date as of: 2022-01-01.
225
+ # > • With the last data available on 2021-08-01
237
226
# >
238
- # > ── Predictions ─────────────────────────────────────────────────────────────
227
+ # > ── Predictions ──────────────────────────────────────────────────────────────
239
228
# >
240
229
# > A total of 56 predictions are available for
241
230
# > • 56 unique geographic regions,
242
- # > • At forecast date: 2021-12-31 ,
243
- # > • For target date: 2022-01-14.
231
+ # > • At forecast date: 2021-08-01 ,
232
+ # > • For target date: 2021-08-29,
244
233
# >
245
234
```
246
235
@@ -271,15 +260,16 @@ narrow_data_plot <-
271
260
facet_grid(source ~ geo_value , scale = " free" ) +
272
261
geom_vline(aes(xintercept = forecast_date )) +
273
262
geom_text(
274
- data = forecast_date_label , aes(x = dates , label = " forecast\n date" , y = heights ), size = 3 , hjust = " right" ) +
263
+ data = forecast_date_label ,
264
+ aes(x = dates , label = " forecast\n date" , y = heights ),
265
+ size = 3 , hjust = " right"
266
+ ) +
275
267
scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
276
268
theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
277
269
```
278
270
279
- The fitted model here involved preprocessing the data to appropriately
280
- generate lagged predictors, estimating a linear model with ` stats::lm() `
281
- and then postprocessing the results to be meaningful for epidemiological
282
- tasks. We can also examine the predictions.
271
+ Putting that together with a plot of the bands, and a plot of the median
272
+ prediction.
283
273
284
274
``` r
285
275
epiworkflow <- four_week_ahead $ epi_workflow
@@ -293,16 +283,17 @@ forecast_plot <-
293
283
epipredict ::: plot_bands(
294
284
restricted_predictions ,
295
285
levels = 0.9 ,
296
- fill = primary ) +
297
- geom_point(data = restricted_predictions , aes(y = .data $ value ), color = secondary )
286
+ fill = primary
287
+ ) +
288
+ geom_point(data = restricted_predictions ,
289
+ aes(y = .data $ value ),
290
+ color = secondary )
298
291
```
299
292
300
- The results above show a distributional forecast produced using data
301
- through the end of 2021 for the 14th of January 2022. A prediction for
302
- the death rate per 100K inhabitants is available for every state
303
- (` geo_value ` ) along with a 90% predictive interval.
293
+ </details >
304
294
305
295
<img src =" man/figures/README-show-single-forecast-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
296
+
306
297
The yellow dot gives the median prediction, while the red interval gives
307
298
the 5-95% inter-quantile range. For this particular day and these
308
299
locations, the forecasts are relatively accurate, with the true data
0 commit comments