cmu-delphi
diff --git a/‎DEVELOPMENT.md
Lines changed: 10 additions & 4 deletions b/‎DEVELOPMENT.md
Lines changed: 10 additions & 4 deletions
diff --git a/‎README.Rmd
Lines changed: 241 additions & 67 deletions b/‎README.Rmd
Lines changed: 241 additions & 67 deletions
@@ -32,13 +32,19 @@ Commands for developing the documentation site:
 # Basic build and preview
 R -e 'pkgdown::clean_site()'
 R -e 'devtools::document()'
-R -e 'pkgdown::build_site()'
+R -e 'pkgdown::build_site(lazy = TRUE, examples = FALSE, devel = TRUE, preview = FALSE)'
 ```
 
 If you work without R Studio and want to iterate on documentation, you might
-find [this
-script](https://gist.github.com/gadenbuie/d22e149e65591b91419e41ea5b2e0621)
-helpful.
+find `Rscript inst/pkgdown-watch.R` helpful to keep a live updating version of the website. Note that you need to have `c("pkgdown", "servr", "devtools", "here", "cli", "fs")` installed.
+
+### Index/homepage
+
+because we are using an `RMD` to make the index, figures are sometimes not added or updated. When in doubt, run `pkgdown::clean_cache()`, `pgkdown::clean_site()`, and delete the following directories/files (path relative to the project directory):
+- `readme_files`
+- `readme_cache`
+- `docs/dev/reference/figures/`
+- `man/figures/`
 
 ## Versioning
 
 
@@ -5,26 +5,95 @@ output: github_document
 <!-- README.md is generated from README.Rmd. Please edit that file -->
 
 ```{r, include = FALSE}
-options(width = 76)
 knitr::opts_chunk$set(
-  collapse = TRUE,
-  comment = "#>",
   fig.path = "man/figures/README-",
-  out.width = "100%"
+  digits = 3,
+  comment = "#>",
+  collapse = TRUE,
+  cache = TRUE,
+  dev.args = list(bg = "transparent"),
+  dpi = 300,
+  cache.lazy = FALSE,
+  out.width = "90%",
+  fig.align = "center",
+  fig.width = 9,
+  fig.height = 6
+)
+ggplot2::theme_set(ggplot2::theme_bw())
+options(
+  dplyr.print_min = 6,
+  dplyr.print_max = 6,
+  pillar.max_footer_lines = 2,
+  pillar.min_chars = 15,
+  stringr.view_n = 6,
+  pillar.bold = TRUE,
+  width = 77
 )
 ```
+```{r pkgs, include=FALSE, echo=FALSE}
+library(epipredict)
+library(epidatr)
+library(data.table)
+library(dplyr)
+library(tidyr)
+library(ggplot2)
+library(magrittr)
+library(purrr)
+library(scales)
+```
+
+```{r coloration, include=FALSE, echo=FALSE}
+base <- "#002676"
+primary <- "#941120"
+secondary <- "#f9c80e"
+tertiary <- "#177245"
+fourth_colour <- "#A393BF"
+fifth_colour <- "#2e8edd"
+colvec <- c(base = base, primary = primary, secondary = secondary,
+            tertiary = tertiary, fourth_colour = fourth_colour,
+            fifth_colour = fifth_colour)
+library(epiprocess)
+suppressMessages(library(tidyverse))
+theme_update(legend.position = "bottom", legend.title = element_blank())
+delphi_pal <- function(n) {
+  if (n > 6L) warning("Not enough colors in this palette!")
+  unname(colvec)[1:n]
+}
+scale_fill_delphi <- function(..., aesthetics = "fill") {
+  discrete_scale(aesthetics = aesthetics, palette = delphi_pal, ...)
+} 
+scale_color_delphi <- function(..., aesthetics = "color") {
+  discrete_scale(aesthetics = aesthetics, palette = delphi_pal, ...)
+}
+scale_colour_delphi <- scale_color_delphi
+```
 
-# epipredict
+# Epipredict
 
 <!-- badges: start -->
 [![R-CMD-check](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml)
 <!-- badges: end -->
 
-**Note:** This package is currently in development and may not work as expected. Please file bug reports as issues in this repo, and we will do our best to address them quickly.
+Epipredict is a framework for building transformation and forecasting pipelines
+for epidemiological and other panel time-series datasets. 
+In addition to tools for building forecasting pipelines, it contains a number of
+"canned" forecasters meant to run with little modification as an easy way to get
+started forecasting.
+
+It is designed to work well with
+[`epiprocess`](https://cmu-delphi.github.io/epiprocess/), a utility for handling
+various time series and geographic processing tools in an epidemiological
+context.
+Both of the packages are meant to work well with the panel data provided by
+[`epidatr`](https://cmu-delphi.github.io/epidatr/).
+
+If you are looking for more detail beyond the package documentation, see our
+[forecasting book](https://cmu-delphi.github.io/delphi-tooling-book/).
 
 ## Installation
 
-To install (unless you're making changes to the package, use the stable version):
+To install (unless you're planning on contributing to package development, we
+suggest using the stable version):
 
 ```r
 # Stable version
@@ -33,94 +102,199 @@ pak::pkg_install("cmu-delphi/epipredict@main")
 # Dev version
 pak::pkg_install("cmu-delphi/epipredict@dev")
 ```
+The documentation for the stable version is at <https://cmu-delphi.github.io/epipredict>, while the development version is at <https://cmu-delphi.github.io/epipredict/dev>.
 
-## Documentation
-
-You can view documentation for the `main` branch at <https://cmu-delphi.github.io/epipredict>.
 
-## Goals for `epipredict`
+## Motivating example
 
-**We hope to provide:**
-
-1. A set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. For the basic forecasters, we currently provide:
-    * Baseline flatline forecaster
-    * Autoregressive forecaster
-    * Autoregressive classifier
-    * CDC FluSight flatline forecaster
-2. A framework for creating custom forecasters out of modular components. There are four types of components:
-    * Preprocessor: do things to the data before model training
-    * Trainer: train a model on data, resulting in a fitted model object
-    * Predictor: make predictions, using a fitted model object
-    * Postprocessor: do things to the predictions before returning
-
-**Target audiences:**
+To demonstrate the kind of forecast epipredict can make, say we're predicting COVID deaths per 100k for each state on
+```{r fc_date}
+forecast_date <- as.Date("2021-08-01")
+```
+Below the fold, we construct this dataset as an `epiprocess::epi_df` from JHU data.
 
-* Basic. Has data, calls forecaster with default arguments.
-* Intermediate. Wants to examine changes to the arguments, take advantage of
-built in flexibility.
-* Advanced. Wants to write their own forecasters. Maybe willing to build up
-from some components.
+<details>
+<summary> Creating the dataset using `{epidatr}` and `{epiprocess}` </summary>
+This dataset can be found in the package as <TODO DOESN'T EXIST>; we demonstrate some of the typically ubiquitous cleaning operations needed to be able to forecast.
+First we pull both jhu-csse cases and deaths from [`{epidatr}` package](https://cmu-delphi.github.io/epidatr/):
+```{r case_death}
+cases <- pub_covidcast(
+  source = "jhu-csse",
+  signals = "confirmed_incidence_prop",
+  time_type = "day",
+  geo_type = "state",
+  time_values = epirange(20200601, 20220101),
+  geo_values = "*") |>
+  select(geo_value, time_value, case_rate = value)
 
-The Advanced user should find their task to be relatively easy. Examples of
-these tasks are illustrated in the [vignettes and articles](https://cmu-delphi.github.io/epipredict).
+deaths <- pub_covidcast(
+  source = "jhu-csse",
+  signals = "deaths_incidence_prop",
+  time_type = "day",
+  geo_type = "state",
+  time_values = epirange(20200601, 20220101),
+  geo_values = "*") |>
+  select(geo_value, time_value, death_rate = value)
+cases_deaths <-
+  full_join(cases, deaths, by = c("time_value", "geo_value")) |>
+  as_epi_df(as_of = as.Date("2022-01-01"))
+plot_locations <- c("ca", "ma", "ny", "tx")
+# plotting the data as it was downloaded
+cases_deaths |>
+  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
+  ggplot(aes(x = time_value, y = value)) +
+  geom_line() +
+  facet_grid(source ~ geo_value, scale = "free") +
+  scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+```
+As with basically any dataset, there is some cleaning that we will need to do to make it actually usable; we'll use some utilities from [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this.
+First, to eliminate some of the noise coming from daily reporting, we do 7 day averaging over a trailing window[^1]:
 
-See also the (in progress) [Forecasting Book](https://cmu-delphi.github.io/delphi-tooling-book/).
+[^1]: This makes it so that any given day of the new dataset only depends on the previous week, which means that we avoid leaking future values when making a forecast.
 
-## Intermediate example
+```{r smooth}
+cases_deaths <-
+  cases_deaths |> 
+  group_by(geo_value) |>
+  epi_slide(
+    cases_7dav = mean(case_rate, na.rm = TRUE),
+    death_rate_7dav = mean(death_rate, na.rm = TRUE),
+    .window_size = 7
+  ) |>
+  ungroup() |>
+  mutate(case_rate = NULL, death_rate = NULL) |>
+  rename(case_rate = cases_7dav, death_rate = death_rate_7dav)
+```
 
-The package comes with some built-in historical data for illustration, but
-up-to-date versions of this could be downloaded with the
-[`{epidatr}` package](https://cmu-delphi.github.io/epidatr/)
-and processed using
-[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/).[^1]
+Then trimming outliers, most especially negative values:
+```{r outlier}
+cases_deaths <-
+  cases_deaths |>
+  group_by(geo_value) |>
+  mutate(
+    outlr_death_rate = detect_outlr_rm(time_value, death_rate, detect_negatives = TRUE),
+    outlr_case_rate = detect_outlr_rm(time_value, case_rate, detect_negatives = TRUE)
+  ) |>
+  unnest(cols = starts_with("outlr"), names_sep = "_") |>
+  ungroup() |>
+  mutate(
+    death_rate = outlr_death_rate_replacement,
+    case_rate = outlr_case_rate_replacement) |>
+  select(geo_value, time_value, case_rate, death_rate)
+cases_deaths
+```
+</details>
 
-[^1]: Other epidemiological signals for non-Covid related illnesses are also
-available with [`{epidatr}`](https://github.com/cmu-delphi/epidatr) which
-interfaces directly to Delphi's
-[Epidata API](https://cmu-delphi.github.io/delphi-epidata/)
+After having downloaded and cleaned the data in `cases_deaths`, we plot a subset
+of the states, noting the actual forecast date:
 
-```{r epidf, message=FALSE}
-library(epipredict)
-covid_case_death_rates
+<details>
+<summary> Plot </summary>
+```{r plot_locs}
+plot_locations <- c("ca", "ma", "ny", "tx")
+forecast_date_label <-
+  tibble(
+    geo_value = rep(plot_locations, 2),
+    source = c(rep("case_rate",4), rep("death_rate", 4)),
+    dates = rep(forecast_date - 7*2, 2 * length(plot_locations)),
+    heights = c(rep(150, 4), rep(1.0, 4))
+  )
+processed_data_plot <-
+  cases_deaths |>
+  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
+  ggplot(aes(x = time_value, y = value)) +
+  geom_line() +
+  facet_grid(source ~ geo_value, scale = "free") +
+  geom_vline(aes(xintercept = forecast_date)) +
+  geom_text(
+    data = forecast_date_label, aes(x=dates, label = "forecast\ndate", y = heights), size = 3, hjust = "right") +
+  scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+```
+</details>
+```{r show-processed-data, warning=FALSE, echo=FALSE}
+processed_data_plot
 ```
 
-To create and train a simple auto-regressive forecaster to predict the death rate two weeks into the future using past (lagged) deaths and cases, we could use the following function.
-
+To make a forecast, we will use a "canned" simple auto-regressive forecaster to predict the death rate four weeks into the future using past (lagged) deaths and cases
 ```{r make-forecasts, warning=FALSE}
-two_week_ahead <- arx_forecaster(
-  covid_case_death_rates,
+four_week_ahead <- arx_forecaster(
+  cases_deaths |> filter(time_value <= forecast_date),
   outcome = "death_rate",
   predictors = c("case_rate", "death_rate"),
   args_list = arx_args_list(
     lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)),
-    ahead = 14
+    ahead = 4 * 7
   )
 )
-two_week_ahead
+four_week_ahead
 ```
 
 In this case, we have used a number of different lags for the case rate, while
-only using 3 weekly lags for the death rate (as predictors). The result is both
+using zero, one and two weekly lags for the death rate (as predictors). `four_week_ahead` is both
 a fitted model object which could be used any time in the future to create
 different forecasts, as well as a set of predicted values (and prediction
-intervals) for each location 14 days after the last available time value in the
-data.
+intervals) for each location 28 days after the forecast date.
+Plotting the prediction intervals on our subset above[^2]: 
+
+[^2]: Alternatively, you could call `auto_plot(four_week_ahead)` to get the full collection of forecasts. This is too busy for the space we have for plotting here.
 
-```{r print-model}
-two_week_ahead$epi_workflow
+<details>
+<summary> Plot </summary>
+This is the same kind of plot as `processed_data_plot` above, but with the past data narrowed somewhat
+```{r}
+narrow_data_plot <-
+  cases_deaths |>
+  filter(time_value > "2021-04-01") |>
+  filter(geo_value %in% c("ca", "ma", "ny", "tx")) |>
+  pivot_longer(cols = c("case_rate", "death_rate"), names_to = "source") |>
+  ggplot(aes(x = time_value, y = value)) +
+  geom_line() +
+  facet_grid(source ~ geo_value, scale = "free") +
+  geom_vline(aes(xintercept = forecast_date)) +
+  geom_text(
+    data = forecast_date_label, aes(x=dates, label = "forecast\ndate", y = heights), size = 3, hjust = "right") +
+  scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
 ```
 
-The fitted model here involved preprocessing the data to appropriately generate
-lagged predictors, estimating a linear model with `stats::lm()` and then
-postprocessing the results to be meaningful for epidemiological tasks. We can
-also examine the predictions.
+Putting that together with a plot of the bands, and a plot of the median prediction.
+```{r plotting_forecast, warning=FALSE}
+epiworkflow <- four_week_ahead$epi_workflow
+restricted_predictions <-
+  four_week_ahead$predictions |>
+  filter(geo_value %in% plot_locations) |>
+  rename(time_value = target_date, value = .pred) |>
+  mutate(source = "death_rate")
+forecast_plot <-
+  narrow_data_plot |>
+  epipredict:::plot_bands(
+    restricted_predictions,
+    fill = "dodgerblue4") +
+  geom_point(data = restricted_predictions, aes(y = .data$value), color = "orange")
+```
+</details>
 
-```{r show-preds}
-two_week_ahead$predictions
+```{r show-single-forecast, warning=FALSE, echo=FALSE}
+forecast_plot
 ```
+The orange dot gives the median prediction, while the blue intervals give the 25-75%, 10-90%, and 2.5%-97.5% inter-quantile ranges.
+For this particular day and these locations, the forecasts are relatively accurate, with the true data being within the 25-75% interval.
+A couple of things to note:
 
-The results above show a distributional forecast produced using data through
-the end of 2021 for the 14th of January 2022. A prediction for the death rate
-per 100K inhabitants is available for every state (`geo_value`) along with a
-90% predictive interval.
+1. Our methods are primarily direct forecasters; this means we don't need to
+   predict 1, 2,..., 27 days ahead to then predict 28 days ahead
+2. All of our existing engines are geo-pooled, meaning the training data is
+   shared across geographies. This has the advantage of increasing the amount of
+   available training data, with the restriction that the data needs to be on
+   comparable scales, such as rates.
 
+## Getting Help
+If you encounter a bug or have a feature request, feel free to file an [issue on
+our github page](https://github.com/cmu-delphi/epipredict/issues).
+For other
+questions, feel free to contact [Daniel]([email protected]), [David]([email protected]), [Dmitry]([email protected]), or
+[Logan]([email protected]), either via email or on the Insightnet slack.