finished custom_workflows, reviewing backtesting

dsweber2 · dsweber2 · commit 359f324b92b0 · 2025-02-11T17:53:55.000-06:00
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -12,12 +12,11 @@ articles:
     contents:
       - epipredict
       - custom_epiworkflows
-      - preprocessing-and-models
       - backtesting
-      - arx-classifier
       - update
   - title: Advanced methods
     contents:
+      - arx-classifier
       - articles/smooth-qr
       - panel-data
 
diff --git a/vignettes/backtesting.Rmd b/vignettes/backtesting.Rmd
@@ -22,8 +22,6 @@ library(magrittr)
 library(purrr)
 ```
 
-# Accurately backtesting forecasters
-
 Backtesting is a crucial step in the development of forecasting models. It
 involves testing the model on historical data to see how well it performs. This
 is important because it allows us to see how well the model generalizes to new
@@ -45,22 +43,17 @@ therein).
 
 In the `{epiprocess}` package, we provide `epix_slide()`, a function that allows
 a convenient way to perform version-aware forecasting by only using the data as
-it would have been available at forecast reference time. In
-`vignette("epi_archive", package = "epiprocess")`, we introduced the concept of
-an `epi_archive` and we demonstrated how to use `epix_slide()` to forecast the
-future using a simple quantile regression model. In this vignette, we will
-demonstrate how to use `epix_slide()` to backtest an auto-regressive forecaster
-on historical COVID-19 case data from the US and Canada. Instead of building a
-forecaster from scratch as we did in the previous vignette, we will use the
-`arx_forecaster()` function from the `{epipredict}` package.
+it would have been available at forecast reference time.
+In this vignette, we will demonstrate how to use `epix_slide()` to backtest an
+auto-regressive forecaster constructed using `arx_forecaster()` on historical
+COVID-19 case data from the US and Canada. 
 
 ## Getting case data from US states into an `epi_archive`
 
-First, we download the version history (ie. archive) of the percentage of
-doctor's visits with CLI (COVID-like illness) computed from medical insurance
-claims and the number of new confirmed COVID-19 cases per 100,000 population
-(daily) for 6 states from the COVIDcast API (as used in the `epiprocess`
-vignette mentioned above).
+First, we create an `epi_archive()` to store the version history of the
+percentage of doctor's visits with CLI (COVID-like illness) computed from
+medical insurance claims and the number of new confirmed COVID-19 cases per
+100,000 population (daily) for 4 states 
 
 ```{r grab-epi-data}
 # Select the `percent_cli` column from the data archive
@@ -88,59 +81,25 @@ doctor_visits <- pub_covidcast(
   as_epi_archive(compactify = TRUE)
 ```
 
-## Backtesting a simple autoregressive forecaster
+`issues` is the name for `version` in the Epidata API.
 
-One of the most common use cases of `epiprocess::epi_archive()` object
-is for accurate model backtesting.
+__Note__: In the interest of computational speed, we only use the 4 state
+dataset here, but the full archive can be used in the same way and has performed
+well in the past.
 
-In this section we will:
+## Backtesting a simple autoregressive forecaster
 
-- develop a simple autoregressive forecaster that predicts the next value of the
-signal based on the current and past values of the signal itself, and
-- demonstrate how to slide this forecaster over the `epi_archive` object to
-produce forecasts at a few dates date, using version-unaware and -aware
-computations,
-- compare the two approaches.
+One of the most common use cases of `epiprocess::epi_archive()` object is for
+accurate model backtesting.
 
 To start, let's use a simple autoregressive forecaster to predict the percentage
 of doctor's hospital visits with CLI (COVID-like illness) (`percent_cli`) in the
-future (we choose this target because of the dataset's pattern of substantial
-revisions; forecasting doctor's visits is an unusual forecasting target
-otherwise). While some AR models output single point forecasts, we will use
-quantile regression to produce a point prediction along with an 90\% uncertainty
-band, represented by a predictive quantiles at the 5\% and 95\% levels (lower
-and upper endpoints of the uncertainty band).
+future[^1]. 
+For increased accuracy we will use quantile regression.
 
-The `arx_forecaster()` function wraps the autoregressive forecaster we need and
-comes with sensible defaults:
 
-- we specify the predicted outcome to be the percentage of doctor's visits with
-  CLI (`percent_cli`),
-- we use a linear regression model as the engine,
-- the autoregressive features assume lags of 0, 7, and 14 days,
-- we forecast 7 days ahead.
-
-All these default settings and more can be seen by calling `arx_args_list()`:
-
-```{r}
-arx_args_list()
-```
-
-These can be modified as needed, by sending your desired arguments into
-`arx_forecaster(args_list = arx_args_list())`. For now we will use the defaults.
-
-__Note__: We will use a __geo-pooled approach__, where we train the model on
-data from all states and territories combined. This is because the data is quite
-similar across states, and pooling the data can help improve the accuracy of the
-forecasts, while also reducing the susceptibility of the model to noise. In the
-interest of computational speed, we only use the 6 state dataset here, but the
-full archive can be used in the same way and has performed well in the past.
-Implementation-wise, geo-pooling is achieved by not using `group_by(geo_value)`
-prior to `epix_slide()`. In other cases, grouping may be preferrable, so we
-leave it to the user to decide, but flag this modeling decision here.
-
-Let's use the `epix_as_of()` method to generate a snapshot of the archive at the
-last date, and then run the forecaster.
+As truth data, we'll use `epix_as_of()` to generate a snapshot of the archive at
+the last date, and then run the forecaster.
 
 ```{r}
 # Let's forecast 14 days prior to the last date in the archive, to compare.
@@ -464,3 +423,7 @@ ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
 ```
 
 </details>
+
+[^1]: (we choose this target because of the dataset's pattern of substantial
+revisions; forecasting doctor's visits is an unusual forecasting target
+otherwise)
diff --git a/vignettes/custom_epiworkflows.Rmd b/vignettes/custom_epiworkflows.Rmd
@@ -301,8 +301,8 @@ growth_rate_recipe <- epi_recipe(
   step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) |>
   step_epi_lag(death_rate, lag = c(0, 7, 14)) |>
   step_epi_ahead(death_rate, ahead = 4 * 7) |>
-  step_growth_rate(death_rate) |>
   step_epi_naomit() |>
+  step_growth_rate(death_rate) |>
   step_training_window()
 ```
 
@@ -317,7 +317,9 @@ growth_rate_recipe |>
     death_rate, gr_7_rel_change_death_rate
   )
 ```
+
 And the role:
+
 ```{r growth_rate_roles}
 prepped <- growth_rate_recipe |>
   prep(training_data)
@@ -329,7 +331,9 @@ To demonstrate the changes in the layers that come along with it, we will use
 ```{r layer_and_fit}
 growth_rate_layers <- frosting() |>
   layer_predict() |>
-  layer_quantile_distn(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |>
+  layer_quantile_distn(
+    quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)
+  ) |>
   layer_point_from_distn() |>
   layer_add_forecast_date() |>
   layer_add_target_date() |>
@@ -429,7 +433,121 @@ which are 2-3 orders of magnitude larger than the corresponding rates above.
 while `rate_rescaling` gives the denominator of the rate (our fit values were
 per 100,000).
 
-[^1]: Think of baking a cake, where adding the frosting is the last step in the process of actually baking.
+# Custom classifier workflow
+
+As a more complicated example of the kind of pipeline that you can build using
+this framework, here is an example of a hotspot prediction model, which predicts
+whether the case rates are increasing (`up`), decreasing (`down`) or flat
+(`flat`).
+This comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly
+serves as an extension of `arx_classifier()`.
+
+First, we need to add a factor version of the `geo_value`, so that it can
+actually be used as a feature.
+
+```{r training_factor}
+training_data <-
+  training_data %>%
+  mutate(geo_value_factor = as.factor(geo_value))
+```
+
+Then we put together the recipe, using a combination of base `{recipe}`
+functions such as `add_role()` and `step_dummy()`, and `{epipreict}` functions
+such as `step_growth_rate()`.
+
+```{r class_recipe}
+classifier_recipe <- epi_recipe(training_data) %>%
+  add_role(time_value, new_role = "predictor") %>%
+  step_dummy(geo_value_factor) %>%
+  step_growth_rate(case_rate, role = "none", prefix = "gr_") %>%
+  step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>%
+  step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>%
+  # note recipes::step_cut() has a bug in it, or we could use that here
+  step_mutate(
+    response = cut(
+      ahead_7_gr_7_rel_change_case_rate,
+      breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily
+      labels = c("down", "flat", "up")
+    ),
+    role = "outcome"
+  ) %>%
+  step_rm(has_role("none"), has_role("raw")) %>%
+  step_epi_naomit()
+```
+
+
+Roughly, this adds as predictors:
+
+1. the time value (via `add_role()`)
+2. the `geo_value` (via `step_dummy()` and the `as.factor()` above)
+3. the growth rate, both at prediction time and lagged by one and two weeks
+
+The outcome is created by composing several steps together: `step_epi_ahead()`
+creates a column with the growth rate one week into the future, while
+`step_mutate()` creates a factor with the 3 values:
+
+$$
+ Z_{\ell, t}=
+    \begin{cases}
+      \text{up}, & \text{if}\ Y^{\Delta}_{\ell, t} > 0.25 \\
+      \text{down}, & \text{if}\  Y^{\Delta}_{\ell, t} < -0.20\\
+      \text{flat}, & \text{otherwise}
+    \end{cases}
+$$
+
+where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$.
+`up` means that the `case_rate` is has increased by at least 25%, while `down`
+means it has decreased by at least 20%.
+
+Note that both `step_growth_rate()` and `step_epi_ahead()` assign the role
+`none` explicitly; this is because they're used as intermediate steps to create
+both predictors and the outcome.
+`step_rm()` drops them after they're done, along with the original `raw` columns
+`death_rate` and `case_rate` (both `geo_value` and `time_value` are retained
+because their roles have been reassigned).
+
+
+To fit a classification model like this, we will need to use a parsnip model
+with mode classification; the simplest example is `multinomial_reg()`.
+We don't actually need to apply any layers, so we can skip adding one to the `epiworkflow()`:
+
+```{r, warning=FALSE}
+wf <- epi_workflow(
+  classifier_recipe,
+  multinom_reg()
+) %>%
+  fit(training_data)
+
+forecast(wf) %>% filter(!is.na(.pred_class))
+```
+
+And comparing the result with the actual growth rates at that point:
+```{r growth_rate_results}
+growth_rates <- covid_case_death_rates |>
+  filter(geo_value %in% used_locations) |>
+  group_by(geo_value) |>
+  mutate(
+    # multiply by 7 to get to weekly equivalents
+    case_gr = growth_rate(x = time_value, y = case_rate) * 7
+  ) |>
+  ungroup()
+
+growth_rates |> filter(time_value == "2021-08-01")
+```
+
+So they're all increasing at significantly higher than 25% per week (36%-62%),
+which matches the classification.
+
+
+See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in depth discussion of this example.
+
+
+[^1]: Think of baking a cake, where adding the frosting is the last step in the
+    process of actually baking.
 
 [^2]: Note that the frosting doesn't require any information about the training
     data, since the output of the model only depends on the model used.
+
+[^3]: McDonald, Bien, Green, Hu, et al. “Can auxiliary indicators improve
+    COVID-19 forecasting and hotspot prediction?.” Proceedings of the National
+    Academy of Sciences 118.51 (2021): e2111453118. doi:10.1073/pnas.2111453118
diff --git a/vignettes/epipredict.Rmd b/vignettes/epipredict.Rmd
@@ -330,6 +330,9 @@ autoplot(
 
 The 8 graphs are all pairs of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`).
 
+## Fitting a non-geo-pooled model
+The primary difference to avoid geo-pooling is to first `group_by(geo_value)`
+before forecasting
 # Anatomy of a canned forecaster
 ## Code object
 Let's dissect the forecaster we trained back on the [landing
diff --git a/vignettes/preprocessing-and-models.Rmd b/vignettes/preprocessing-and-models.Rmd