Skip to content

Commit 359f324

Browse files
committed
finished custom_workflows, reviewing backtesting
1 parent fc4dad6 commit 359f324

File tree

5 files changed

+148
-687
lines changed

5 files changed

+148
-687
lines changed

_pkgdown.yml

+1-2
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,11 @@ articles:
1212
contents:
1313
- epipredict
1414
- custom_epiworkflows
15-
- preprocessing-and-models
1615
- backtesting
17-
- arx-classifier
1816
- update
1917
- title: Advanced methods
2018
contents:
19+
- arx-classifier
2120
- articles/smooth-qr
2221
- panel-data
2322

vignettes/backtesting.Rmd

+23-60
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,6 @@ library(magrittr)
2222
library(purrr)
2323
```
2424

25-
# Accurately backtesting forecasters
26-
2725
Backtesting is a crucial step in the development of forecasting models. It
2826
involves testing the model on historical data to see how well it performs. This
2927
is important because it allows us to see how well the model generalizes to new
@@ -45,22 +43,17 @@ therein).
4543

4644
In the `{epiprocess}` package, we provide `epix_slide()`, a function that allows
4745
a convenient way to perform version-aware forecasting by only using the data as
48-
it would have been available at forecast reference time. In
49-
`vignette("epi_archive", package = "epiprocess")`, we introduced the concept of
50-
an `epi_archive` and we demonstrated how to use `epix_slide()` to forecast the
51-
future using a simple quantile regression model. In this vignette, we will
52-
demonstrate how to use `epix_slide()` to backtest an auto-regressive forecaster
53-
on historical COVID-19 case data from the US and Canada. Instead of building a
54-
forecaster from scratch as we did in the previous vignette, we will use the
55-
`arx_forecaster()` function from the `{epipredict}` package.
46+
it would have been available at forecast reference time.
47+
In this vignette, we will demonstrate how to use `epix_slide()` to backtest an
48+
auto-regressive forecaster constructed using `arx_forecaster()` on historical
49+
COVID-19 case data from the US and Canada.
5650

5751
## Getting case data from US states into an `epi_archive`
5852

59-
First, we download the version history (ie. archive) of the percentage of
60-
doctor's visits with CLI (COVID-like illness) computed from medical insurance
61-
claims and the number of new confirmed COVID-19 cases per 100,000 population
62-
(daily) for 6 states from the COVIDcast API (as used in the `epiprocess`
63-
vignette mentioned above).
53+
First, we create an `epi_archive()` to store the version history of the
54+
percentage of doctor's visits with CLI (COVID-like illness) computed from
55+
medical insurance claims and the number of new confirmed COVID-19 cases per
56+
100,000 population (daily) for 4 states
6457

6558
```{r grab-epi-data}
6659
# Select the `percent_cli` column from the data archive
@@ -88,59 +81,25 @@ doctor_visits <- pub_covidcast(
8881
as_epi_archive(compactify = TRUE)
8982
```
9083

91-
## Backtesting a simple autoregressive forecaster
84+
`issues` is the name for `version` in the Epidata API.
9285

93-
One of the most common use cases of `epiprocess::epi_archive()` object
94-
is for accurate model backtesting.
86+
__Note__: In the interest of computational speed, we only use the 4 state
87+
dataset here, but the full archive can be used in the same way and has performed
88+
well in the past.
9589

96-
In this section we will:
90+
## Backtesting a simple autoregressive forecaster
9791

98-
- develop a simple autoregressive forecaster that predicts the next value of the
99-
signal based on the current and past values of the signal itself, and
100-
- demonstrate how to slide this forecaster over the `epi_archive` object to
101-
produce forecasts at a few dates date, using version-unaware and -aware
102-
computations,
103-
- compare the two approaches.
92+
One of the most common use cases of `epiprocess::epi_archive()` object is for
93+
accurate model backtesting.
10494

10595
To start, let's use a simple autoregressive forecaster to predict the percentage
10696
of doctor's hospital visits with CLI (COVID-like illness) (`percent_cli`) in the
107-
future (we choose this target because of the dataset's pattern of substantial
108-
revisions; forecasting doctor's visits is an unusual forecasting target
109-
otherwise). While some AR models output single point forecasts, we will use
110-
quantile regression to produce a point prediction along with an 90\% uncertainty
111-
band, represented by a predictive quantiles at the 5\% and 95\% levels (lower
112-
and upper endpoints of the uncertainty band).
97+
future[^1].
98+
For increased accuracy we will use quantile regression.
11399

114-
The `arx_forecaster()` function wraps the autoregressive forecaster we need and
115-
comes with sensible defaults:
116100

117-
- we specify the predicted outcome to be the percentage of doctor's visits with
118-
CLI (`percent_cli`),
119-
- we use a linear regression model as the engine,
120-
- the autoregressive features assume lags of 0, 7, and 14 days,
121-
- we forecast 7 days ahead.
122-
123-
All these default settings and more can be seen by calling `arx_args_list()`:
124-
125-
```{r}
126-
arx_args_list()
127-
```
128-
129-
These can be modified as needed, by sending your desired arguments into
130-
`arx_forecaster(args_list = arx_args_list())`. For now we will use the defaults.
131-
132-
__Note__: We will use a __geo-pooled approach__, where we train the model on
133-
data from all states and territories combined. This is because the data is quite
134-
similar across states, and pooling the data can help improve the accuracy of the
135-
forecasts, while also reducing the susceptibility of the model to noise. In the
136-
interest of computational speed, we only use the 6 state dataset here, but the
137-
full archive can be used in the same way and has performed well in the past.
138-
Implementation-wise, geo-pooling is achieved by not using `group_by(geo_value)`
139-
prior to `epix_slide()`. In other cases, grouping may be preferrable, so we
140-
leave it to the user to decide, but flag this modeling decision here.
141-
142-
Let's use the `epix_as_of()` method to generate a snapshot of the archive at the
143-
last date, and then run the forecaster.
101+
As truth data, we'll use `epix_as_of()` to generate a snapshot of the archive at
102+
the last date, and then run the forecaster.
144103

145104
```{r}
146105
# Let's forecast 14 days prior to the last date in the archive, to compare.
@@ -464,3 +423,7 @@ ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
464423
```
465424

466425
</details>
426+
427+
[^1]: (we choose this target because of the dataset's pattern of substantial
428+
revisions; forecasting doctor's visits is an unusual forecasting target
429+
otherwise)

vignettes/custom_epiworkflows.Rmd

+121-3
Original file line numberDiff line numberDiff line change
@@ -301,8 +301,8 @@ growth_rate_recipe <- epi_recipe(
301301
step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) |>
302302
step_epi_lag(death_rate, lag = c(0, 7, 14)) |>
303303
step_epi_ahead(death_rate, ahead = 4 * 7) |>
304-
step_growth_rate(death_rate) |>
305304
step_epi_naomit() |>
305+
step_growth_rate(death_rate) |>
306306
step_training_window()
307307
```
308308

@@ -317,7 +317,9 @@ growth_rate_recipe |>
317317
death_rate, gr_7_rel_change_death_rate
318318
)
319319
```
320+
320321
And the role:
322+
321323
```{r growth_rate_roles}
322324
prepped <- growth_rate_recipe |>
323325
prep(training_data)
@@ -329,7 +331,9 @@ To demonstrate the changes in the layers that come along with it, we will use
329331
```{r layer_and_fit}
330332
growth_rate_layers <- frosting() |>
331333
layer_predict() |>
332-
layer_quantile_distn(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |>
334+
layer_quantile_distn(
335+
quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)
336+
) |>
333337
layer_point_from_distn() |>
334338
layer_add_forecast_date() |>
335339
layer_add_target_date() |>
@@ -429,7 +433,121 @@ which are 2-3 orders of magnitude larger than the corresponding rates above.
429433
while `rate_rescaling` gives the denominator of the rate (our fit values were
430434
per 100,000).
431435

432-
[^1]: Think of baking a cake, where adding the frosting is the last step in the process of actually baking.
436+
# Custom classifier workflow
437+
438+
As a more complicated example of the kind of pipeline that you can build using
439+
this framework, here is an example of a hotspot prediction model, which predicts
440+
whether the case rates are increasing (`up`), decreasing (`down`) or flat
441+
(`flat`).
442+
This comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly
443+
serves as an extension of `arx_classifier()`.
444+
445+
First, we need to add a factor version of the `geo_value`, so that it can
446+
actually be used as a feature.
447+
448+
```{r training_factor}
449+
training_data <-
450+
training_data %>%
451+
mutate(geo_value_factor = as.factor(geo_value))
452+
```
453+
454+
Then we put together the recipe, using a combination of base `{recipe}`
455+
functions such as `add_role()` and `step_dummy()`, and `{epipreict}` functions
456+
such as `step_growth_rate()`.
457+
458+
```{r class_recipe}
459+
classifier_recipe <- epi_recipe(training_data) %>%
460+
add_role(time_value, new_role = "predictor") %>%
461+
step_dummy(geo_value_factor) %>%
462+
step_growth_rate(case_rate, role = "none", prefix = "gr_") %>%
463+
step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>%
464+
step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>%
465+
# note recipes::step_cut() has a bug in it, or we could use that here
466+
step_mutate(
467+
response = cut(
468+
ahead_7_gr_7_rel_change_case_rate,
469+
breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily
470+
labels = c("down", "flat", "up")
471+
),
472+
role = "outcome"
473+
) %>%
474+
step_rm(has_role("none"), has_role("raw")) %>%
475+
step_epi_naomit()
476+
```
477+
478+
479+
Roughly, this adds as predictors:
480+
481+
1. the time value (via `add_role()`)
482+
2. the `geo_value` (via `step_dummy()` and the `as.factor()` above)
483+
3. the growth rate, both at prediction time and lagged by one and two weeks
484+
485+
The outcome is created by composing several steps together: `step_epi_ahead()`
486+
creates a column with the growth rate one week into the future, while
487+
`step_mutate()` creates a factor with the 3 values:
488+
489+
$$
490+
Z_{\ell, t}=
491+
\begin{cases}
492+
\text{up}, & \text{if}\ Y^{\Delta}_{\ell, t} > 0.25 \\
493+
\text{down}, & \text{if}\ Y^{\Delta}_{\ell, t} < -0.20\\
494+
\text{flat}, & \text{otherwise}
495+
\end{cases}
496+
$$
497+
498+
where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$.
499+
`up` means that the `case_rate` is has increased by at least 25%, while `down`
500+
means it has decreased by at least 20%.
501+
502+
Note that both `step_growth_rate()` and `step_epi_ahead()` assign the role
503+
`none` explicitly; this is because they're used as intermediate steps to create
504+
both predictors and the outcome.
505+
`step_rm()` drops them after they're done, along with the original `raw` columns
506+
`death_rate` and `case_rate` (both `geo_value` and `time_value` are retained
507+
because their roles have been reassigned).
508+
509+
510+
To fit a classification model like this, we will need to use a parsnip model
511+
with mode classification; the simplest example is `multinomial_reg()`.
512+
We don't actually need to apply any layers, so we can skip adding one to the `epiworkflow()`:
513+
514+
```{r, warning=FALSE}
515+
wf <- epi_workflow(
516+
classifier_recipe,
517+
multinom_reg()
518+
) %>%
519+
fit(training_data)
520+
521+
forecast(wf) %>% filter(!is.na(.pred_class))
522+
```
523+
524+
And comparing the result with the actual growth rates at that point:
525+
```{r growth_rate_results}
526+
growth_rates <- covid_case_death_rates |>
527+
filter(geo_value %in% used_locations) |>
528+
group_by(geo_value) |>
529+
mutate(
530+
# multiply by 7 to get to weekly equivalents
531+
case_gr = growth_rate(x = time_value, y = case_rate) * 7
532+
) |>
533+
ungroup()
534+
535+
growth_rates |> filter(time_value == "2021-08-01")
536+
```
537+
538+
So they're all increasing at significantly higher than 25% per week (36%-62%),
539+
which matches the classification.
540+
541+
542+
See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in depth discussion of this example.
543+
544+
545+
[^1]: Think of baking a cake, where adding the frosting is the last step in the
546+
process of actually baking.
433547

434548
[^2]: Note that the frosting doesn't require any information about the training
435549
data, since the output of the model only depends on the model used.
550+
551+
[^3]: McDonald, Bien, Green, Hu, et al. “Can auxiliary indicators improve
552+
COVID-19 forecasting and hotspot prediction?.” Proceedings of the National
553+
Academy of Sciences 118.51 (2021): e2111453118. doi:10.1073/pnas.2111453118

vignettes/epipredict.Rmd

+3
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,9 @@ autoplot(
330330

331331
The 8 graphs are all pairs of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`).
332332

333+
## Fitting a non-geo-pooled model
334+
The primary difference to avoid geo-pooling is to first `group_by(geo_value)`
335+
before forecasting
333336
# Anatomy of a canned forecaster
334337
## Code object
335338
Let's dissect the forecaster we trained back on the [landing

0 commit comments

Comments
 (0)