Skip to content

Commit 6ae998c

Browse files
committed
second readthrough of README.Rmd
1 parent 604173a commit 6ae998c

File tree

2 files changed

+318
-104
lines changed

2 files changed

+318
-104
lines changed

README.Rmd

+123-34
Original file line numberDiff line numberDiff line change
@@ -40,20 +40,46 @@ You can view documentation for the `main` branch at <https://cmu-delphi.github.i
4040

4141
## Goals for `epipredict`
4242

43-
**We hope to provide:**
44-
45-
1. A set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. For the basic forecasters, we currently provide:
46-
* Baseline flatline forecaster
47-
* Autoregressive forecaster
48-
* Autoregressive classifier
49-
* CDC FluSight flatline forecaster
50-
2. A framework for creating custom forecasters out of modular components. There are four types of components:
51-
* Preprocessor: do things to the data before model training
52-
* Trainer: train a model on data, resulting in a fitted model object
53-
* Predictor: make predictions, using a fitted model object
54-
* Postprocessor: do things to the predictions before returning
43+
<details>
44+
<summary> Creating the dataset using `{epidatr}` and `{epiprocess}` </summary>
45+
This dataset can be found in the package as <TODO DOESN'T EXIST>; we demonstrate some of the typically ubiquitous cleaning operations needed to be able to forecast.
46+
First we pull both jhu-csse cases and deaths from [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
47+
```{r case_death}
48+
cases <- pub_covidcast(
49+
source = "jhu-csse",
50+
signals = "confirmed_incidence_prop",
51+
time_type = "day",
52+
geo_type = "state",
53+
time_values = epirange(20200601, 20220101),
54+
geo_values = "*") |>
55+
select(geo_value, time_value, case_rate = value)
56+
57+
deaths <- pub_covidcast(
58+
source = "jhu-csse",
59+
signals = "deaths_incidence_prop",
60+
time_type = "day",
61+
geo_type = "state",
62+
time_values = epirange(20200601, 20220101),
63+
geo_values = "*") |>
64+
select(geo_value, time_value, death_rate = value)
65+
cases_deaths <-
66+
full_join(cases, deaths, by = c("time_value", "geo_value")) |>
67+
as_epi_df(as_of = as.Date("2022-01-01"))
68+
plot_locations <- c("ca", "ma", "ny", "tx")
69+
# plotting the data as it was downloaded
70+
cases_deaths |>
71+
filter(geo_value %in% plot_locations) |>
72+
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
73+
ggplot(aes(x = time_value, y = value)) +
74+
geom_line() +
75+
facet_grid(source ~ geo_value, scale = "free") +
76+
scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
77+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
78+
```
79+
As with basically any dataset, there is some cleaning that we will need to do to make it actually usable; we'll use some utilities from [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this.
80+
First, to eliminate some of the noise coming from daily reporting, we do 7 day averaging over a trailing window[^1]:
5581

56-
**Target audiences:**
82+
[^1]: This makes it so that any given day of the processed timeseries only depends on the previous week, which means that we avoid leaking future values when making a forecast.
5783

5884
* Basic. Has data, calls forecaster with default arguments.
5985
* Intermediate. Wants to examine changes to the arguments, take advantage of
@@ -86,6 +112,41 @@ covid_case_death_rates
86112

87113
To create and train a simple auto-regressive forecaster to predict the death rate two weeks into the future using past (lagged) deaths and cases, we could use the following function.
88114

115+
After having downloaded and cleaned the data in `cases_deaths`, we plot a subset
116+
of the states, noting the actual forecast date:
117+
118+
<details>
119+
<summary> Plot </summary>
120+
```{r plot_locs}
121+
forecast_date_label <-
122+
tibble(
123+
geo_value = rep(plot_locations, 2),
124+
source = c(rep("case_rate",4), rep("death_rate", 4)),
125+
dates = rep(forecast_date - 7*2, 2 * length(plot_locations)),
126+
heights = c(rep(150, 4), rep(1.0, 4))
127+
)
128+
processed_data_plot <-
129+
cases_deaths |>
130+
filter(geo_value %in% plot_locations) |>
131+
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
132+
ggplot(aes(x = time_value, y = value)) +
133+
geom_line() +
134+
facet_grid(source ~ geo_value, scale = "free") +
135+
geom_vline(aes(xintercept = forecast_date)) +
136+
geom_text(
137+
data = forecast_date_label, aes(x=dates, label = "forecast\ndate", y = heights), size = 3, hjust = "right") +
138+
scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
139+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
140+
```
141+
</details>
142+
```{r show-processed-data, warning=FALSE, echo=FALSE}
143+
processed_data_plot
144+
```
145+
146+
To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using lagged[^3] deaths and cases
147+
148+
[^3]: lagged by 3 in this context meaning using the value from 3 days ago.
149+
89150
```{r make-forecasts, warning=FALSE}
90151
two_week_ahead <- arx_forecaster(
91152
covid_case_death_rates,
@@ -99,28 +160,56 @@ two_week_ahead <- arx_forecaster(
99160
two_week_ahead
100161
```
101162

102-
In this case, we have used a number of different lags for the case rate, while
103-
only using 3 weekly lags for the death rate (as predictors). The result is both
104-
a fitted model object which could be used any time in the future to create
105-
different forecasts, as well as a set of predicted values (and prediction
106-
intervals) for each location 14 days after the last available time value in the
107-
data.
108-
109-
```{r print-model}
110-
two_week_ahead$epi_workflow
163+
In this case, we have used 0-3 days, a week, and two week lags for the case
164+
rate, while using only zero, one and two weekly lags for the death rate (as
165+
predictors).
166+
The result `four_week_ahead` is both a fitted model object which could be used
167+
any time in the future to create different forecasts, as well as a set of
168+
predicted values (and prediction intervals) for each location 28 days after the
169+
forecast date.
170+
Plotting the prediction intervals on our subset above[^2]:
171+
172+
[^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get the full collection of forecasts. This is too busy for the space we have for plotting here.
173+
174+
<details>
175+
<summary> Plot </summary>
176+
This is the same kind of plot as `processed_data_plot` above, but with the past data narrowed somewhat
177+
```{r}
178+
narrow_data_plot <-
179+
cases_deaths |>
180+
filter(time_value > "2021-04-01") |>
181+
filter(geo_value %in% plot_locations) |>
182+
pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
183+
ggplot(aes(x = time_value, y = value)) +
184+
geom_line() +
185+
facet_grid(source ~ geo_value, scale = "free") +
186+
geom_vline(aes(xintercept = forecast_date)) +
187+
geom_text(
188+
data = forecast_date_label, aes(x=dates, label = "forecast\ndate", y = heights), size = 3, hjust = "right") +
189+
scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
190+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
111191
```
112192

113-
The fitted model here involved preprocessing the data to appropriately generate
114-
lagged predictors, estimating a linear model with `stats::lm()` and then
115-
postprocessing the results to be meaningful for epidemiological tasks. We can
116-
also examine the predictions.
117-
118-
```{r show-preds}
119-
two_week_ahead$predictions
193+
Putting that together with a plot of the bands, and a plot of the median prediction.
194+
```{r plotting_forecast, warning=FALSE}
195+
epiworkflow <- four_week_ahead$epi_workflow
196+
restricted_predictions <-
197+
four_week_ahead$predictions |>
198+
filter(geo_value %in% plot_locations) |>
199+
rename(time_value = target_date, value = .pred) |>
200+
mutate(source = "death_rate")
201+
forecast_plot <-
202+
narrow_data_plot |>
203+
epipredict:::plot_bands(
204+
restricted_predictions,
205+
levels = 0.9,
206+
fill = primary) +
207+
geom_point(data = restricted_predictions, aes(y = .data$value), color = secondary)
120208
```
121209

122-
The results above show a distributional forecast produced using data through
123-
the end of 2021 for the 14th of January 2022. A prediction for the death rate
124-
per 100K inhabitants is available for every state (`geo_value`) along with a
125-
90% predictive interval.
126-
210+
```{r show-single-forecast, warning=FALSE, echo=FALSE}
211+
forecast_plot
212+
```
213+
The yellow dot gives the median prediction, while the red interval gives the 5-95% inter-quantile range.
214+
For this particular day and these locations, the forecasts are relatively accurate, with the true data being within the 25-75% interval.
215+
A couple of things to note:

0 commit comments

Comments
 (0)