@@ -59,7 +59,7 @@ from JHU data.
59
59
Creating the dataset using ` {epidatr} ` and ` {epiprocess} `
60
60
</summary >
61
61
62
- This dataset can be found in the package as \< TODO DOESN’T EXIST \> ; we
62
+ This dataset can be found in the package as ` covid_case_death_rates ` ; we
63
63
demonstrate some of the typically ubiquitous cleaning operations needed
64
64
to be able to forecast. First we pull both jhu-csse cases and deaths
65
65
from [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) package:
@@ -84,26 +84,35 @@ deaths <- pub_covidcast(
84
84
geo_values = " *"
85
85
) | >
86
86
select(geo_value , time_value , death_rate = value )
87
+ ```
88
+
89
+ Since visualizing the results on every geography is somewhat
90
+ overwhelming, we’ll only train on a subset of 5.
91
+
92
+ ``` r
93
+ used_locations <- c(" ca" , " ma" , " ny" , " tx" )
87
94
cases_deaths <-
88
95
full_join(cases , deaths , by = c(" time_value" , " geo_value" )) | >
96
+ filter(geo_value %in% used_locations ) | >
89
97
as_epi_df(as_of = as.Date(" 2022-01-01" ))
90
- plot_locations <- c(" ca" , " ma" , " ny" , " tx" )
91
98
# plotting the data as it was downloaded
92
99
cases_deaths | >
93
- filter(geo_value %in% plot_locations ) | >
94
- pivot_longer(cols = c(" case_rate" , " death_rate" ), names_to = " source" ) | >
95
- ggplot(aes(x = time_value , y = value )) +
96
- geom_line() +
97
- facet_grid(source ~ geo_value , scale = " free" ) +
100
+ autoplot(
101
+ case_rate ,
102
+ death_rate ,
103
+ .color_by = " none"
104
+ ) +
105
+ facet_grid(.response_name ~ geo_value , scale = " free" ) +
98
106
scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
99
107
theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
100
108
```
101
109
102
- <img src =" man/figures/README-case_death -1.png " width =" 90% " style =" display : block ; margin : auto ;" />
110
+ <img src =" man/figures/README-date -1.png " width =" 90% " style =" display : block ; margin : auto ;" />
103
111
104
112
As with basically any dataset, there is some cleaning that we will need
105
113
to do to make it actually usable; we’ll use some utilities from
106
114
[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) for this.
115
+
107
116
First, to eliminate some of the noise coming from daily reporting, we do
108
117
7 day averaging over a trailing window[ ^ 1 ] :
109
118
@@ -129,10 +138,12 @@ cases_deaths <-
129
138
group_by(geo_value ) | >
130
139
mutate(
131
140
outlr_death_rate = detect_outlr_rm(
132
- time_value , death_rate , detect_negatives = TRUE
141
+ time_value , death_rate ,
142
+ detect_negatives = TRUE
133
143
),
134
144
outlr_case_rate = detect_outlr_rm(
135
- time_value , case_rate , detect_negatives = TRUE
145
+ time_value , case_rate ,
146
+ detect_negatives = TRUE
136
147
)
137
148
) | >
138
149
unnest(cols = starts_with(" outlr" ), names_sep = " _" ) | >
@@ -142,22 +153,6 @@ cases_deaths <-
142
153
case_rate = outlr_case_rate_replacement
143
154
) | >
144
155
select(geo_value , time_value , case_rate , death_rate )
145
- cases_deaths
146
- # > An `epi_df` object, 32,424 x 4 with metadata:
147
- # > * geo_type = state
148
- # > * time_type = day
149
- # > * as_of = 2022-01-01
150
- # >
151
- # > # A tibble: 32,424 × 4
152
- # > geo_value time_value case_rate death_rate
153
- # > * <chr> <date> <dbl> <dbl>
154
- # > 1 ak 2020-06-01 2.31 0
155
- # > 2 ak 2020-06-02 1.94 0
156
- # > 3 ak 2020-06-03 2.63 0
157
- # > 4 ak 2020-06-04 2.59 0
158
- # > 5 ak 2020-06-05 2.43 0
159
- # > 6 ak 2020-06-06 2.35 0
160
- # > # ℹ 32,418 more rows
161
156
```
162
157
163
158
</details >
@@ -173,18 +168,19 @@ Plot
173
168
``` r
174
169
forecast_date_label <-
175
170
tibble(
176
- geo_value = rep(plot_locations , 2 ),
177
- source = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
178
- dates = rep(forecast_date - 7 * 2 , 2 * length(plot_locations )),
171
+ geo_value = rep(used_locations , 2 ),
172
+ .response_name = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
173
+ dates = rep(forecast_date - 7 * 2 , 2 * length(used_locations )),
179
174
heights = c(rep(150 , 4 ), rep(1.0 , 4 ))
180
175
)
181
176
processed_data_plot <-
182
177
cases_deaths | >
183
- filter(geo_value %in% plot_locations ) | >
184
- pivot_longer(cols = c(" case_rate" , " death_rate" ), names_to = " source" ) | >
185
- ggplot(aes(x = time_value , y = value )) +
186
- geom_line() +
187
- facet_grid(source ~ geo_value , scale = " free" ) +
178
+ autoplot(
179
+ case_rate ,
180
+ death_rate ,
181
+ .color_by = " none"
182
+ ) +
183
+ facet_grid(.response_name ~ geo_value , scale = " free" ) +
188
184
geom_vline(aes(xintercept = forecast_date )) +
189
185
geom_text(
190
186
data = forecast_date_label ,
@@ -216,7 +212,7 @@ four_week_ahead <- arx_forecaster(
216
212
four_week_ahead
217
213
# > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
218
214
# >
219
- # > This forecaster was fit on 2025-01-24 15:31:46 .
215
+ # > This forecaster was fit on 2025-01-27 16:36:10 .
220
216
# >
221
217
# > Training data was an <epi_df> with:
222
218
# > • Geography: state,
@@ -226,8 +222,8 @@ four_week_ahead
226
222
# >
227
223
# > ── Predictions ──────────────────────────────────────────────────────────────
228
224
# >
229
- # > A total of 56 predictions are available for
230
- # > • 56 unique geographic regions,
225
+ # > A total of 4 predictions are available for
226
+ # > • 4 unique geographic regions,
231
227
# > • At forecast date: 2021-08-01,
232
228
# > • For target date: 2021-08-29,
233
229
# >
@@ -246,58 +242,34 @@ Plotting the prediction intervals on our subset above[^3]:
246
242
Plot
247
243
</summary >
248
244
249
- This is the same kind of plot as ` processed_data_plot ` above, but with
250
- the past data narrowed somewhat
251
-
252
245
``` r
253
- narrow_data_plot <-
254
- cases_deaths | >
255
- filter( time_value > " 2021-04-01 " ) | >
256
- filter( geo_value %in% plot_locations ) | >
257
- pivot_longer( cols = c( " case_rate " , " death_rate" ), names_to = " source " ) | >
258
- ggplot(aes( x = time_value , y = value )) +
259
- geom_line() +
260
- facet_grid( source ~ geo_value , scale = " free " ) +
246
+ epiworkflow <- four_week_ahead $ epi_workflow
247
+ restricted_predictions <-
248
+ four_week_ahead $ predictions | >
249
+ rename( time_value = target_date , value = .pred ) | >
250
+ mutate( .response_name = " death_rate" )
251
+ forecast_plot <-
252
+ four_week_ahead | >
253
+ autoplot( plot_data = cases_deaths ) +
261
254
geom_vline(aes(xintercept = forecast_date )) +
262
255
geom_text(
263
- data = forecast_date_label ,
256
+ data = forecast_date_label % > % filter( .response_name == " death_rate " ) ,
264
257
aes(x = dates , label = " forecast\n date" , y = heights ),
265
258
size = 3 , hjust = " right"
266
259
) +
267
260
scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
268
261
theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
269
262
```
270
263
271
- Putting that together with a plot of the bands, and a plot of the median
272
- prediction.
273
-
274
- ``` r
275
- epiworkflow <- four_week_ahead $ epi_workflow
276
- restricted_predictions <-
277
- four_week_ahead $ predictions | >
278
- filter(geo_value %in% plot_locations ) | >
279
- rename(time_value = target_date , value = .pred ) | >
280
- mutate(source = " death_rate" )
281
- forecast_plot <-
282
- narrow_data_plot | >
283
- epipredict ::: plot_bands(
284
- restricted_predictions ,
285
- levels = 0.9
286
- ) +
287
- geom_point(
288
- data = restricted_predictions ,
289
- aes(y = .data $ value )
290
- )
291
- ```
292
-
293
264
</details >
294
265
295
266
<img src =" man/figures/README-show-single-forecast-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
296
267
297
- The yellow dot gives the median prediction, while the red interval gives
298
- the 5-95% inter-quantile range. For this particular day and these
299
- locations, the forecasts are relatively accurate, with the true data
300
- being within the 25-75% interval. A couple of things to note:
268
+ The black dot gives the median prediction, while the blue intervals give
269
+ the 25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges. For this
270
+ particular day and these locations, the forecasts are relatively
271
+ accurate, with the true data being within the 25-75% interval. A couple
272
+ of things to note:
301
273
302
274
1 . Our methods are primarily direct forecasters; this means we don’t
303
275
need to predict 1, 2,…, 27 days ahead to then predict 28 days ahead
@@ -312,10 +284,10 @@ being within the 25-75% interval. A couple of things to note:
312
284
If you encounter a bug or have a feature request, feel free to file an
313
285
[ issue on our github
314
286
page] ( https://github.com/cmu-delphi/epipredict/issues ) . For other
315
- questions, feel free to
contact [ Daniel ] ( [email protected] ) ,
316
-
317
- [ Logan ] ( [email protected] ) , either via email or on the Insightnet
318
- slack.
287
+ questions, feel free to reach out to the authors, either via this
288
+ [ contact
289
+ form ] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) ,
290
+ email, or the Insightnet slack.
319
291
320
292
[ ^ 1 ] : This makes it so that any given day of the processed timeseries
321
293
only depends on the previous week, which means that we avoid leaking
0 commit comments