second readthrough of README.Rmd

dsweber2 · dsweber2 · commit 34b018a85436 · 2025-01-23T14:56:20.000-06:00
diff --git a/README.Rmd b/README.Rmd
@@ -116,7 +116,7 @@ Below the fold, we construct this dataset as an `epiprocess::epi_df` from JHU da
 <details>
 <summary> Creating the dataset using `{epidatr}` and `{epiprocess}` </summary>
 This dataset can be found in the package as <TODO DOESN'T EXIST>; we demonstrate some of the typically ubiquitous cleaning operations needed to be able to forecast.
-First we pull both jhu-csse cases and deaths from [`{epidatr}` package](https://cmu-delphi.github.io/epidatr/):
+First we pull both jhu-csse cases and deaths from [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
 ```{r case_death}
 cases <- pub_covidcast(
   source = "jhu-csse",
@@ -141,7 +141,7 @@ cases_deaths <-
 plot_locations <- c("ca", "ma", "ny", "tx")
 # plotting the data as it was downloaded
 cases_deaths |>
-  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  filter(geo_value %in% plot_locations) |>
   pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
   ggplot(aes(x = time_value, y = value)) +
   geom_line() +
@@ -152,7 +152,7 @@ cases_deaths |>
 As with basically any dataset, there is some cleaning that we will need to do to make it actually usable; we'll use some utilities from [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this.
 First, to eliminate some of the noise coming from daily reporting, we do 7 day averaging over a trailing window[^1]:
 
-[^1]: This makes it so that any given day of the new dataset only depends on the previous week, which means that we avoid leaking future values when making a forecast.
+[^1]: This makes it so that any given day of the processed timeseries only depends on the previous week, which means that we avoid leaking future values when making a forecast.
 
 ```{r smooth}
 cases_deaths <-
@@ -193,7 +193,6 @@ of the states, noting the actual forecast date:
 <details>
 <summary> Plot </summary>
 ```{r plot_locs}
-plot_locations <- c("ca", "ma", "ny", "tx")
 forecast_date_label <-
   tibble(
     geo_value = rep(plot_locations, 2),
@@ -203,7 +202,7 @@ forecast_date_label <-
   )
 processed_data_plot <-
   cases_deaths |>
-  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  filter(geo_value %in% plot_locations) |>
   pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
   ggplot(aes(x = time_value, y = value)) +
   geom_line() +
@@ -219,7 +218,10 @@ processed_data_plot <-
 processed_data_plot
 ```
 
-To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using past (lagged) deaths and cases
+To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using lagged[^3] deaths and cases
+
+[^3]: lagged by 3 in this context meaning using the value from 3 days ago.
+
 ```{r make-forecasts, warning=FALSE}
 four_week_ahead <- arx_forecaster(
   cases_deaths |> filter(time_value <= forecast_date),
@@ -233,11 +235,13 @@ four_week_ahead <- arx_forecaster(
 four_week_ahead
 ```
 
-In this case, we have used a number of different lags for the case rate, while
-using zero, one and two weekly lags for the death rate (as predictors). `four_week_ahead` is both
-a fitted model object which could be used any time in the future to create
-different forecasts, as well as a set of predicted values (and prediction
-intervals) for each location 28 days after the forecast date.
+In this case, we have used 0-3 days, a week, and two week lags for the case
+rate, while using only zero, one and two weekly lags for the death rate (as
+predictors).
+The result `four_week_ahead` is both a fitted model object which could be used
+any time in the future to create different forecasts, as well as a set of
+predicted values (and prediction intervals) for each location 28 days after the
+forecast date.
 Plotting the prediction intervals on our subset above[^2]: 
 
 [^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get the full collection of forecasts. This is too busy for the space we have for plotting here.
@@ -249,7 +253,7 @@ This is the same kind of plot as `processed_data_plot` above, but with the past
 narrow_data_plot <-
   cases_deaths |>
   filter(time_value > "2021-04-01") |>
-  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  filter(geo_value %in% plot_locations) |>
   pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
   ggplot(aes(x = time_value, y = value)) +
   geom_line() +
@@ -273,15 +277,16 @@ forecast_plot <-
   narrow_data_plot |>
   epipredict:::plot_bands(
     restricted_predictions,
-    fill = "dodgerblue4") +
-  geom_point(data = restricted_predictions, aes(y = .data$value), color = "orange")
+    levels = 0.9,
+    fill = primary) +
+  geom_point(data = restricted_predictions, aes(y = .data$value), color = secondary)
 ```
 </details>
 
 ```{r show-single-forecast, warning=FALSE, echo=FALSE}
 forecast_plot
 ```
-The orange dot gives the median prediction, while the blue intervals give the 25-75%, 10-90%, and 2.5%-97.5% inter-quantile ranges.
+The yellow dot gives the median prediction, while the red interval gives the 5-95%  inter-quantile range.
 For this particular day and these locations, the forecasts are relatively accurate, with the true data being within the 25-75% interval.
 A couple of things to note:
 
diff --git a/README.md b/README.md
@@ -62,7 +62,7 @@ Creating the dataset using `{epidatr}` and `{epiprocess}`
 This dataset can be found in the package as \<TODO DOESN’T EXIST\>; we
 demonstrate some of the typically ubiquitous cleaning operations needed
 to be able to forecast. First we pull both jhu-csse cases and deaths
-from [`{epidatr}` package](https://cmu-delphi.github.io/epidatr/):
+from [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
 
 ``` r
 cases <- pub_covidcast(
@@ -88,7 +88,7 @@ cases_deaths <-
 plot_locations <- c("ca", "ma", "ny", "tx")
 # plotting the data as it was downloaded
 cases_deaths |>
-  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  filter(geo_value %in% plot_locations) |>
   pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
   ggplot(aes(x = time_value, y = value)) +
   geom_line() +
@@ -163,7 +163,6 @@ Plot
 </summary>
 
 ``` r
-plot_locations <- c("ca", "ma", "ny", "tx")
 forecast_date_label <-
   tibble(
     geo_value = rep(plot_locations, 2),
@@ -173,7 +172,7 @@ forecast_date_label <-
   )
 processed_data_plot <-
   cases_deaths |>
-  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  filter(geo_value %in% plot_locations) |>
   pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
   ggplot(aes(x = time_value, y = value)) +
   geom_line() +
@@ -191,7 +190,7 @@ processed_data_plot <-
 
 To make a forecast, we will use a “canned” simple auto-regressive
 forecaster to predict the death rate four weeks into the future using
-past (lagged) deaths and cases
+lagged[^2] deaths and cases
 
 ``` r
 four_week_ahead <- arx_forecaster(
@@ -223,13 +222,13 @@ four_week_ahead
 #> 
 ```
 
-In this case, we have used a number of different lags for the case rate,
-while using zero, one and two weekly lags for the death rate (as
-predictors). `four_week_ahead` is both a fitted model object which could
-be used any time in the future to create different forecasts, as well as
-a set of predicted values (and prediction intervals) for each location
-28 days after the forecast date. Plotting the prediction intervals on
-our subset above[^2]:
+In this case, we have used 0-3 days, a week, and two week lags for the
+case rate, while using only zero, one and two weekly lags for the death
+rate (as predictors). The result `four_week_ahead` is both a fitted
+model object which could be used any time in the future to create
+different forecasts, as well as a set of predicted values (and
+prediction intervals) for each location 28 days after the forecast date.
+Plotting the prediction intervals on our subset above[^3]:
 
 <details>
 <summary>
@@ -243,7 +242,7 @@ the past data narrowed somewhat
 narrow_data_plot <-
   cases_deaths |>
   filter(time_value > "2021-04-01") |>
-  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  filter(geo_value %in% plot_locations) |>
   pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
   ggplot(aes(x = time_value, y = value)) +
   geom_line() +
@@ -269,18 +268,18 @@ forecast_plot <-
   narrow_data_plot |>
   epipredict:::plot_bands(
     restricted_predictions,
-    fill = "dodgerblue4") +
-  geom_point(data = restricted_predictions, aes(y = .data$value), color = "orange")
+    levels = 0.9,
+    fill = primary) +
+  geom_point(data = restricted_predictions, aes(y = .data$value), color = secondary)
 ```
 
 </details>
 
 <img src="man/figures/README-show-single-forecast-1.png" width="90%" style="display: block; margin: auto;" />
-The orange dot gives the median prediction, while the blue intervals
-give the 25-75%, 10-90%, and 2.5%-97.5% inter-quantile ranges. For this
-particular day and these locations, the forecasts are relatively
-accurate, with the true data being within the 25-75% interval. A couple
-of things to note:
+The yellow dot gives the median prediction, while the red interval gives
+the 5-95% inter-quantile range. For this particular day and these
+locations, the forecasts are relatively accurate, with the true data
+being within the 25-75% interval. A couple of things to note:
 
 1.  Our methods are primarily direct forecasters; this means we don’t
     need to predict 1, 2,…, 27 days ahead to then predict 28 days ahead
@@ -300,10 +299,13 @@ questions, feel free to contact [Daniel](daniel@stat.ubc.ca),
 [Logan](lcbrooks@andrew.cmu.edu), either via email or on the Insightnet
 slack.
 
-[^1]: This makes it so that any given day of the new dataset only
-    depends on the previous week, which means that we avoid leaking
+[^1]: This makes it so that any given day of the processed timeseries
+    only depends on the previous week, which means that we avoid leaking
     future values when making a forecast.
 
-[^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get
+[^2]: lagged by 3 in this context meaning using the value from 3 days
+    ago.
+
+[^3]: Alternatively, you could call `auto_plot(four_week_ahead)` to get
     the full collection of forecasts. This is too busy for the space we
     have for plotting here.