Skip to content

Commit 34b018a

Browse files
committed
second readthrough of README.Rmd
1 parent 38b86e8 commit 34b018a

File tree

2 files changed

+45
-38
lines changed

2 files changed

+45
-38
lines changed

README.Rmd

+20-15
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Below the fold, we construct this dataset as an `epiprocess::epi_df` from JHU da
116116
<details>
117117
<summary> Creating the dataset using `{epidatr}` and `{epiprocess}` </summary>
118118
This dataset can be found in the package as <TODO DOESN'T EXIST>; we demonstrate some of the typically ubiquitous cleaning operations needed to be able to forecast.
119-
First we pull both jhu-csse cases and deaths from [`{epidatr}` package](https://cmu-delphi.github.io/epidatr/):
119+
First we pull both jhu-csse cases and deaths from [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
120120
```{r case_death}
121121
cases <- pub_covidcast(
122122
source = "jhu-csse",
@@ -141,7 +141,7 @@ cases_deaths <-
141141
plot_locations <- c("ca", "ma", "ny", "tx")
142142
# plotting the data as it was downloaded
143143
cases_deaths |>
144-
filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
144+
filter(geo_value %in% plot_locations) |>
145145
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
146146
ggplot(aes(x = time_value, y = value)) +
147147
geom_line() +
@@ -152,7 +152,7 @@ cases_deaths |>
152152
As with basically any dataset, there is some cleaning that we will need to do to make it actually usable; we'll use some utilities from [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this.
153153
First, to eliminate some of the noise coming from daily reporting, we do 7 day averaging over a trailing window[^1]:
154154

155-
[^1]: This makes it so that any given day of the new dataset only depends on the previous week, which means that we avoid leaking future values when making a forecast.
155+
[^1]: This makes it so that any given day of the processed timeseries only depends on the previous week, which means that we avoid leaking future values when making a forecast.
156156

157157
```{r smooth}
158158
cases_deaths <-
@@ -193,7 +193,6 @@ of the states, noting the actual forecast date:
193193
<details>
194194
<summary> Plot </summary>
195195
```{r plot_locs}
196-
plot_locations <- c("ca", "ma", "ny", "tx")
197196
forecast_date_label <-
198197
tibble(
199198
geo_value = rep(plot_locations, 2),
@@ -203,7 +202,7 @@ forecast_date_label <-
203202
)
204203
processed_data_plot <-
205204
cases_deaths |>
206-
filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
205+
filter(geo_value %in% plot_locations) |>
207206
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
208207
ggplot(aes(x = time_value, y = value)) +
209208
geom_line() +
@@ -219,7 +218,10 @@ processed_data_plot <-
219218
processed_data_plot
220219
```
221220

222-
To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using past (lagged) deaths and cases
221+
To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using lagged[^3] deaths and cases
222+
223+
[^3]: lagged by 3 in this context meaning using the value from 3 days ago.
224+
223225
```{r make-forecasts, warning=FALSE}
224226
four_week_ahead <- arx_forecaster(
225227
cases_deaths |> filter(time_value <= forecast_date),
@@ -233,11 +235,13 @@ four_week_ahead <- arx_forecaster(
233235
four_week_ahead
234236
```
235237

236-
In this case, we have used a number of different lags for the case rate, while
237-
using zero, one and two weekly lags for the death rate (as predictors). `four_week_ahead` is both
238-
a fitted model object which could be used any time in the future to create
239-
different forecasts, as well as a set of predicted values (and prediction
240-
intervals) for each location 28 days after the forecast date.
238+
In this case, we have used 0-3 days, a week, and two week lags for the case
239+
rate, while using only zero, one and two weekly lags for the death rate (as
240+
predictors).
241+
The result `four_week_ahead` is both a fitted model object which could be used
242+
any time in the future to create different forecasts, as well as a set of
243+
predicted values (and prediction intervals) for each location 28 days after the
244+
forecast date.
241245
Plotting the prediction intervals on our subset above[^2]:
242246

243247
[^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get the full collection of forecasts. This is too busy for the space we have for plotting here.
@@ -249,7 +253,7 @@ This is the same kind of plot as `processed_data_plot` above, but with the past
249253
narrow_data_plot <-
250254
cases_deaths |>
251255
filter(time_value > "2021-04-01") |>
252-
filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
256+
filter(geo_value %in% plot_locations) |>
253257
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
254258
ggplot(aes(x = time_value, y = value)) +
255259
geom_line() +
@@ -273,15 +277,16 @@ forecast_plot <-
273277
narrow_data_plot |>
274278
epipredict:::plot_bands(
275279
restricted_predictions,
276-
fill = "dodgerblue4") +
277-
geom_point(data = restricted_predictions, aes(y = .data$value), color = "orange")
280+
levels = 0.9,
281+
fill = primary) +
282+
geom_point(data = restricted_predictions, aes(y = .data$value), color = secondary)
278283
```
279284
</details>
280285

281286
```{r show-single-forecast, warning=FALSE, echo=FALSE}
282287
forecast_plot
283288
```
284-
The orange dot gives the median prediction, while the blue intervals give the 25-75%, 10-90%, and 2.5%-97.5% inter-quantile ranges.
289+
The yellow dot gives the median prediction, while the red interval gives the 5-95% inter-quantile range.
285290
For this particular day and these locations, the forecasts are relatively accurate, with the true data being within the 25-75% interval.
286291
A couple of things to note:
287292

README.md

+25-23
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ Creating the dataset using `{epidatr}` and `{epiprocess}`
6262
This dataset can be found in the package as \<TODO DOESN’T EXIST\>; we
6363
demonstrate some of the typically ubiquitous cleaning operations needed
6464
to be able to forecast. First we pull both jhu-csse cases and deaths
65-
from [`{epidatr}` package](https://cmu-delphi.github.io/epidatr/):
65+
from [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
6666

6767
``` r
6868
cases <- pub_covidcast(
@@ -88,7 +88,7 @@ cases_deaths <-
8888
plot_locations <- c("ca", "ma", "ny", "tx")
8989
# plotting the data as it was downloaded
9090
cases_deaths |>
91-
filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
91+
filter(geo_value %in% plot_locations) |>
9292
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
9393
ggplot(aes(x = time_value, y = value)) +
9494
geom_line() +
@@ -163,7 +163,6 @@ Plot
163163
</summary>
164164

165165
``` r
166-
plot_locations <- c("ca", "ma", "ny", "tx")
167166
forecast_date_label <-
168167
tibble(
169168
geo_value = rep(plot_locations, 2),
@@ -173,7 +172,7 @@ forecast_date_label <-
173172
)
174173
processed_data_plot <-
175174
cases_deaths |>
176-
filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
175+
filter(geo_value %in% plot_locations) |>
177176
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
178177
ggplot(aes(x = time_value, y = value)) +
179178
geom_line() +
@@ -191,7 +190,7 @@ processed_data_plot <-
191190

192191
To make a forecast, we will use a “canned” simple auto-regressive
193192
forecaster to predict the death rate four weeks into the future using
194-
past (lagged) deaths and cases
193+
lagged[^2] deaths and cases
195194

196195
``` r
197196
four_week_ahead <- arx_forecaster(
@@ -223,13 +222,13 @@ four_week_ahead
223222
#>
224223
```
225224

226-
In this case, we have used a number of different lags for the case rate,
227-
while using zero, one and two weekly lags for the death rate (as
228-
predictors). `four_week_ahead` is both a fitted model object which could
229-
be used any time in the future to create different forecasts, as well as
230-
a set of predicted values (and prediction intervals) for each location
231-
28 days after the forecast date. Plotting the prediction intervals on
232-
our subset above[^2]:
225+
In this case, we have used 0-3 days, a week, and two week lags for the
226+
case rate, while using only zero, one and two weekly lags for the death
227+
rate (as predictors). The result `four_week_ahead` is both a fitted
228+
model object which could be used any time in the future to create
229+
different forecasts, as well as a set of predicted values (and
230+
prediction intervals) for each location 28 days after the forecast date.
231+
Plotting the prediction intervals on our subset above[^3]:
233232

234233
<details>
235234
<summary>
@@ -243,7 +242,7 @@ the past data narrowed somewhat
243242
narrow_data_plot <-
244243
cases_deaths |>
245244
filter(time_value > "2021-04-01") |>
246-
filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
245+
filter(geo_value %in% plot_locations) |>
247246
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
248247
ggplot(aes(x = time_value, y = value)) +
249248
geom_line() +
@@ -269,18 +268,18 @@ forecast_plot <-
269268
narrow_data_plot |>
270269
epipredict:::plot_bands(
271270
restricted_predictions,
272-
fill = "dodgerblue4") +
273-
geom_point(data = restricted_predictions, aes(y = .data$value), color = "orange")
271+
levels = 0.9,
272+
fill = primary) +
273+
geom_point(data = restricted_predictions, aes(y = .data$value), color = secondary)
274274
```
275275

276276
</details>
277277

278278
<img src="man/figures/README-show-single-forecast-1.png" width="90%" style="display: block; margin: auto;" />
279-
The orange dot gives the median prediction, while the blue intervals
280-
give the 25-75%, 10-90%, and 2.5%-97.5% inter-quantile ranges. For this
281-
particular day and these locations, the forecasts are relatively
282-
accurate, with the true data being within the 25-75% interval. A couple
283-
of things to note:
279+
The yellow dot gives the median prediction, while the red interval gives
280+
the 5-95% inter-quantile range. For this particular day and these
281+
locations, the forecasts are relatively accurate, with the true data
282+
being within the 25-75% interval. A couple of things to note:
284283

285284
1. Our methods are primarily direct forecasters; this means we don’t
286285
need to predict 1, 2,…, 27 days ahead to then predict 28 days ahead
@@ -300,10 +299,13 @@ questions, feel free to contact [Daniel]([email protected]),
300299
[Logan]([email protected]), either via email or on the Insightnet
301300
slack.
302301

303-
[^1]: This makes it so that any given day of the new dataset only
304-
depends on the previous week, which means that we avoid leaking
302+
[^1]: This makes it so that any given day of the processed timeseries
303+
only depends on the previous week, which means that we avoid leaking
305304
future values when making a forecast.
306305

307-
[^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get
306+
[^2]: lagged by 3 in this context meaning using the value from 3 days
307+
ago.
308+
309+
[^3]: Alternatively, you could call `auto_plot(four_week_ahead)` to get
308310
the full collection of forecasts. This is too busy for the space we
309311
have for plotting here.

0 commit comments

Comments
 (0)