cmu-delphi
diff --git a/‎README.Rmd
Lines changed: 123 additions & 34 deletions b/‎README.Rmd
Lines changed: 123 additions & 34 deletions
@@ -40,20 +40,46 @@ You can view documentation for the `main` branch at <https://cmu-delphi.github.i
 
 ## Goals for `epipredict`
 
-**We hope to provide:**
-
-1. A set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. For the basic forecasters, we currently provide:
-    * Baseline flatline forecaster
-    * Autoregressive forecaster
-    * Autoregressive classifier
-    * CDC FluSight flatline forecaster
-2. A framework for creating custom forecasters out of modular components. There are four types of components:
-    * Preprocessor: do things to the data before model training
-    * Trainer: train a model on data, resulting in a fitted model object
-    * Predictor: make predictions, using a fitted model object
-    * Postprocessor: do things to the predictions before returning
+<details>
+<summary> Creating the dataset using `{epidatr}` and `{epiprocess}` </summary>
+This dataset can be found in the package as <TODO DOESN'T EXIST>; we demonstrate some of the typically ubiquitous cleaning operations needed to be able to forecast.
+First we pull both jhu-csse cases and deaths from [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package:
+```{r case_death}
+cases <- pub_covidcast(
+  source = "jhu-csse",
+  signals = "confirmed_incidence_prop",
+  time_type = "day",
+  geo_type = "state",
+  time_values = epirange(20200601, 20220101),
+  geo_values = "*") |>
+  select(geo_value, time_value, case_rate = value)
+
+deaths <- pub_covidcast(
+  source = "jhu-csse",
+  signals = "deaths_incidence_prop",
+  time_type = "day",
+  geo_type = "state",
+  time_values = epirange(20200601, 20220101),
+  geo_values = "*") |>
+  select(geo_value, time_value, death_rate = value)
+cases_deaths <-
+  full_join(cases, deaths, by = c("time_value", "geo_value")) |>
+  as_epi_df(as_of = as.Date("2022-01-01"))
+plot_locations <- c("ca", "ma", "ny", "tx")
+# plotting the data as it was downloaded
+cases_deaths |>
+  filter(geo_value %in% plot_locations) |>
+  pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
+  ggplot(aes(x = time_value, y = value)) +
+  geom_line() +
+  facet_grid(source ~ geo_value, scale = "free") +
+  scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+```
+As with basically any dataset, there is some cleaning that we will need to do to make it actually usable; we'll use some utilities from [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this.
+First, to eliminate some of the noise coming from daily reporting, we do 7 day averaging over a trailing window[^1]:
 
-**Target audiences:**
+[^1]: This makes it so that any given day of the processed timeseries only depends on the previous week, which means that we avoid leaking future values when making a forecast.
 
 * Basic. Has data, calls forecaster with default arguments.
 * Intermediate. Wants to examine changes to the arguments, take advantage of
@@ -86,6 +112,41 @@ covid_case_death_rates
 
 To create and train a simple auto-regressive forecaster to predict the death rate two weeks into the future using past (lagged) deaths and cases, we could use the following function.
 
+After having downloaded and cleaned the data in `cases_deaths`, we plot a subset
+of the states, noting the actual forecast date:
+
+<details>
+<summary> Plot </summary>
+```{r plot_locs}
+forecast_date_label <-
+  tibble(
+    geo_value = rep(plot_locations, 2),
+    source = c(rep("case_rate",4), rep("death_rate", 4)),
+    dates = rep(forecast_date - 7*2, 2 * length(plot_locations)),
+    heights = c(rep(150, 4), rep(1.0, 4))
+  )
+processed_data_plot <-
+  cases_deaths |>
+  filter(geo_value %in% plot_locations) |>
+  pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
+  ggplot(aes(x = time_value, y = value)) +
+  geom_line() +
+  facet_grid(source ~ geo_value, scale = "free") +
+  geom_vline(aes(xintercept = forecast_date)) +
+  geom_text(
+    data = forecast_date_label, aes(x=dates, label = "forecast\ndate", y = heights), size = 3, hjust = "right") +
+  scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+```
+</details>
+```{r show-processed-data, warning=FALSE, echo=FALSE}
+processed_data_plot
+```
+
+To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using lagged[^3] deaths and cases
+
+[^3]: lagged by 3 in this context meaning using the value from 3 days ago.
+
 ```{r make-forecasts, warning=FALSE}
 two_week_ahead <- arx_forecaster(
   covid_case_death_rates,
@@ -99,28 +160,56 @@ two_week_ahead <- arx_forecaster(
 two_week_ahead
 ```
 
-In this case, we have used a number of different lags for the case rate, while
-only using 3 weekly lags for the death rate (as predictors). The result is both
-a fitted model object which could be used any time in the future to create
-different forecasts, as well as a set of predicted values (and prediction
-intervals) for each location 14 days after the last available time value in the
-data.
-
-```{r print-model}
-two_week_ahead$epi_workflow
+In this case, we have used 0-3 days, a week, and two week lags for the case
+rate, while using only zero, one and two weekly lags for the death rate (as
+predictors).
+The result `four_week_ahead` is both a fitted model object which could be used
+any time in the future to create different forecasts, as well as a set of
+predicted values (and prediction intervals) for each location 28 days after the
+forecast date.
+Plotting the prediction intervals on our subset above[^2]: 
+
+[^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get the full collection of forecasts. This is too busy for the space we have for plotting here.
+
+<details>
+<summary> Plot </summary>
+This is the same kind of plot as `processed_data_plot` above, but with the past data narrowed somewhat
+```{r}
+narrow_data_plot <-
+  cases_deaths |>
+  filter(time_value > "2021-04-01") |>
+  filter(geo_value %in% plot_locations) |>
+  pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
+  ggplot(aes(x = time_value, y = value)) +
+  geom_line() +
+  facet_grid(source ~ geo_value, scale = "free") +
+  geom_vline(aes(xintercept = forecast_date)) +
+  geom_text(
+    data = forecast_date_label, aes(x=dates, label = "forecast\ndate", y = heights), size = 3, hjust = "right") +
+  scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
 ```
 
-The fitted model here involved preprocessing the data to appropriately generate
-lagged predictors, estimating a linear model with `stats::lm()` and then
-postprocessing the results to be meaningful for epidemiological tasks. We can
-also examine the predictions.
-
-```{r show-preds}
-two_week_ahead$predictions
+Putting that together with a plot of the bands, and a plot of the median prediction.
+```{r plotting_forecast, warning=FALSE}
+epiworkflow <- four_week_ahead$epi_workflow
+restricted_predictions <-
+  four_week_ahead$predictions |>
+  filter(geo_value %in% plot_locations) |>
+  rename(time_value = target_date, value = .pred) |>
+  mutate(source = "death_rate")
+forecast_plot <-
+  narrow_data_plot |>
+  epipredict:::plot_bands(
+    restricted_predictions,
+    levels = 0.9,
+    fill = primary) +
+  geom_point(data = restricted_predictions, aes(y = .data$value), color = secondary)
 ```
 
-The results above show a distributional forecast produced using data through
-the end of 2021 for the 14th of January 2022. A prediction for the death rate
-per 100K inhabitants is available for every state (`geo_value`) along with a
-90% predictive interval.
-
+```{r show-single-forecast, warning=FALSE, echo=FALSE}
+forecast_plot
+```
+The yellow dot gives the median prediction, while the red interval gives the 5-95%  inter-quantile range.
+For this particular day and these locations, the forecasts are relatively accurate, with the true data being within the 25-75% interval.
+A couple of things to note: