diff --git a/DESCRIPTION b/DESCRIPTION index 81a35b30e..0fcae7990 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -39,7 +39,7 @@ Imports: lifecycle, lubridate, magrittr, - recipes (>= 1.0.4), + recipes (>= 1.1.1), rlang (>= 1.1.0), stats, tibble, @@ -53,6 +53,7 @@ Suggests: epidatr (>= 1.0.0), fs, grf, + here, knitr, poissonreg, purrr, diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index f710c2842..c52032cda 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -35,10 +35,14 @@ R -e 'devtools::document()' R -e 'pkgdown::build_site()' ``` +Note that sometimes the caches from either `pkgdown` or `knitr` can cause +difficulties. To clear those, run `make`, with either `clean_knitr`, +`clean_site`, or `clean` (which does both). + If you work without R Studio and want to iterate on documentation, you might find [this script](https://gist.github.com/gadenbuie/d22e149e65591b91419e41ea5b2e0621) -helpful. +helpful. For updating references, you will need to manually call `pkgdown::build_reference()`. ## Versioning diff --git a/Makefile b/Makefile new file mode 100644 index 000000000..9f5790aca --- /dev/null +++ b/Makefile @@ -0,0 +1,14 @@ +## +# epipredict docs build +# + +# knitr doesn't actually clean it's own cache properly; this just deletes any of +# the article knitr caches in vignettes or the base +clean_knitr: + rm -r *_cache; rm -r vignettes/*_cache +clean_site: + Rscript -e "pkgdown::clean_cache(); pkgdown::clean_site()" +# this combines +clean: clean_knitr clean_site + +# end diff --git a/NEWS.md b/NEWS.md index de698ee96..512de8d76 100644 --- a/NEWS.md +++ b/NEWS.md @@ -38,6 +38,7 @@ Pre-1.0.0 numbering scheme: 0.x will indicate releases, while 0.0.x will indicat - Replace `dist_quantiles()` with `hardhat::quantile_pred()` - Allow `quantile()` to threshold to an interval if desired (#434) - `arx_forecaster()` detects if there's enough data to predict +- Add `plot_data` to `autoplot` so that forecasts can be plotted against the values they're predicting ## Bug fixes @@ -69,7 +70,7 @@ Pre-1.0.0 numbering scheme: 0.x will indicate releases, while 0.0.x will indicat - training window step debugged - `min_train_window` argument removed from canned forecasters - add forecasters -- implement postprocessing +- implement post-processing - vignettes avaliable - arx_forecaster - pkgdown diff --git a/R/arx_classifier.R b/R/arx_classifier.R index bc8783610..9122435b1 100644 --- a/R/arx_classifier.R +++ b/R/arx_classifier.R @@ -1,8 +1,106 @@ #' Direct autoregressive classifier with covariates #' -#' This is an autoregressive classification model for -#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It does "direct" forecasting, meaning -#' that it estimates a class at a particular target horizon. +#' +#' @description +#' This is an autoregressive classification model for continuous data. It does +#' "direct" forecasting, meaning that it estimates a class at a particular +#' target horizon. +#' +#' @details +#' The `arx_classifier()` is an autoregressive classification model for `epi_df` +#' data that is used to predict a discrete class for each case under +#' consideration. It is a direct forecaster in that it estimates the classes +#' at a specific horizon or ahead value. +#' +#' To get a sense of how the `arx_classifier()` works, let's consider a simple +#' example with minimal inputs. For this, we will use the built-in +#' `covid_case_death_rates` that contains confirmed COVID-19 cases and deaths +#' from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, +#' we'll take a subset of data for five states over June 4, 2021 to December +#' 31, 2021. Our objective is to predict whether the case rates are increasing +#' when considering the 0, 7 and 14 day case rates: +#' +#' ```{r} +#' jhu <- covid_case_death_rates %>% +#' filter( +#' time_value >= "2021-06-04", +#' time_value <= "2021-12-31", +#' geo_value %in% c("ca", "fl", "tx", "ny", "nj") +#' ) +#' +#' out <- arx_classifier(jhu, outcome = "case_rate", predictors = "case_rate") +#' +#' out$predictions +#' ``` +#' +#' The key takeaway from the predictions is that there are two prediction +#' classes: `(-Inf, 0.25]` and `(0.25, Inf)`. This is because for our goal of +#' classification the classes must be discrete. The discretization of the +#' real-valued outcome is controlled by the `breaks` argument, which defaults +#' to `0.25`. Such breaks will be automatically extended to cover the entire +#' real line. For example, the default break of `0.25` is silently extended to +#' `breaks = c(-Inf, .25, Inf)` and, therefore, results in two classes: +#' `[-Inf, 0.25]` and `(0.25, Inf)`. These two classes are used to discretize +#' the outcome. The conversion of the outcome to such classes is handled +#' internally. So if discrete classes already exist for the outcome in the +#' `epi_df`, then we recommend to code a classifier from scratch using the +#' `epi_workflow` framework for more control. +#' +#' The `trainer` is a `parsnip` model describing the type of estimation such +#' that `mode = "classification"` is enforced. The two typical trainers that +#' are used are `parsnip::logistic_reg()` for two classes or +#' `parsnip::multinom_reg()` for more than two classes. +#' +#' ```{r} +#' workflows::extract_spec_parsnip(out$epi_workflow) +#' ``` +#' +#' From the parsnip model specification, we can see that the trainer used is +#' logistic regression, which is expected for our binary outcome. More +#' complicated trainers like `parsnip::naive_Bayes()` or +#' `parsnip::rand_forest()` may also be used (however, we will stick to the +#' basics in this gentle introduction to the classifier). +#' +#' If you use the default trainer of logistic regression for binary +#' classification and you decide against using the default break of 0.25, then +#' you should only input one break so that there are two classification bins +#' to properly dichotomize the outcome. For example, let's set a break of 0.5 +#' instead of relying on the default of 0.25. We can do this by passing 0.5 to +#' the `breaks` argument in `arx_class_args_list()` as follows: +#' +#' ```{r} +#' out_break_0.5 <- arx_classifier( +#' jhu, +#' outcome = "case_rate", +#' predictors = "case_rate", +#' args_list = arx_class_args_list( +#' breaks = 0.5 +#' ) +#' ) +#' +#' out_break_0.5$predictions +#' ``` +#' Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and +#' (0.5, Inf). See `help(arx_class_args_list)` for other available +#' modifications. +#' +#' Additional arguments that may be supplied to `arx_class_args_list()` include +#' the expected `lags` and `ahead` arguments for an autoregressive-type model. +#' These have default values of 0, 7, and 14 days for the lags of the +#' predictors and 7 days ahead of the forecast date for predicting the +#' outcome. There is also `n_training` to indicate the upper bound for the +#' number of training rows per key. If you would like some practice with using +#' this, then remove the filtering command to obtain data within "2021-06-04" +#' and "2021-12-31" and instead set `n_training` to be the number of days +#' between these two dates, inclusive of the end points. The end results +#' should be the same. In addition to `n_training`, there are `forecast_date` +#' and `target_date` to specify the date that the forecast is created and +#' intended, respectively. We will not dwell on such arguments here as they +#' are not unique to this classifier or absolutely essential to understanding +#' how it operates. The remaining arguments will be discussed organically, as +#' they are needed to serve our purposes. For information on any remaining +#' arguments that are not discussed here, please see the function +#' documentation for a complete list and their definitions. #' #' @inheritParams arx_forecaster #' @param outcome A character (scalar) specifying the outcome (in the @@ -68,9 +166,7 @@ arx_classifier <- function( } forecast_date <- args_list$forecast_date %||% forecast_date_default target_date <- args_list$target_date %||% (forecast_date + args_list$ahead) - preds <- forecast( - wf, - ) %>% + preds <- forecast(wf) %>% as_tibble() %>% select(-time_value) @@ -249,7 +345,7 @@ arx_class_epi_workflow <- function( #' be created using growth rates (as the predictors are) or lagged #' differences. The second case is closer to the requirements for the #' [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md). -#' See the Classification Vignette for details of how to create a reasonable +#' See the [Classification chapter from the forecasting book](https://cmu-delphi.github.io/delphi-tooling-book/arx-classifier.html) Vignette for details of how to create a reasonable #' baseline for this case. Selecting `"growth_rate"` (the default) uses #' [epiprocess::growth_rate()] to create the outcome using some of the #' additional arguments below. Choosing `"lag_difference"` instead simply diff --git a/R/arx_forecaster.R b/R/arx_forecaster.R index f988490fd..56034bffa 100644 --- a/R/arx_forecaster.R +++ b/R/arx_forecaster.R @@ -1,26 +1,29 @@ #' Direct autoregressive forecaster with covariates #' #' This is an autoregressive forecasting model for -#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It does "direct" forecasting, meaning -#' that it estimates a model for a particular target horizon. +#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It does "direct" +#' forecasting, meaning that it estimates a model for a particular target +#' horizon of `outcome` based on the lags of the `predictors`. See the [Get +#' started vignette](../articles/epipredict.html) for some worked examples and +#' [Custom epi_workflows vignette](../articles/custom_epiworkflows.html) for a +#' recreation using a custom `epi_workflow()`. #' #' #' @param epi_data An `epi_df` object -#' @param outcome A character (scalar) specifying the outcome (in the -#' `epi_df`). +#' @param outcome A character (scalar) specifying the outcome (in the `epi_df`). #' @param predictors A character vector giving column(s) of predictor variables. -#' This defaults to the `outcome`. However, if manually specified, only those variables -#' specifically mentioned will be used. (The `outcome` will not be added.) -#' By default, equals the outcome. If manually specified, does not add the -#' outcome variable, so make sure to specify it. -#' @param trainer A `{parsnip}` model describing the type of estimation. -#' For now, we enforce `mode = "regression"`. -#' @param args_list A list of customization arguments to determine -#' the type of forecasting model. See [arx_args_list()]. +#' This defaults to the `outcome`. However, if manually specified, only those +#' variables specifically mentioned will be used. (The `outcome` will not be +#' added.) By default, equals the outcome. If manually specified, does not +#' add the outcome variable, so make sure to specify it. +#' @param trainer A `{parsnip}` model describing the type of estimation. For +#' now, we enforce `mode = "regression"`. +#' @param args_list A list of customization arguments to determine the type of +#' forecasting model. See [arx_args_list()]. #' -#' @return A list with (1) `predictions` an `epi_df` of predicted values -#' and (2) `epi_workflow`, a list that encapsulates the entire estimation -#' workflow +#' @return An `arx_fcast`, with the fields `predictions` and `epi_workflow`. +#' `predictions` is an `epi_df` of predicted values while `epi_workflow()` is +#' the fit workflow used to make those predictions #' @export #' @seealso [arx_fcast_epi_workflow()], [arx_args_list()] #' @@ -29,15 +32,18 @@ #' dplyr::filter(time_value >= as.Date("2021-12-01")) #' #' out <- arx_forecaster( -#' jhu, "death_rate", +#' jhu, +#' "death_rate", #' c("case_rate", "death_rate") #' ) #' -#' out <- arx_forecaster(jhu, "death_rate", +#' out <- arx_forecaster(jhu, +#' "death_rate", #' c("case_rate", "death_rate"), #' trainer = quantile_reg(), #' args_list = arx_args_list(quantile_levels = 1:9 / 10) #' ) +#' out arx_forecaster <- function( epi_data, outcome, @@ -60,7 +66,7 @@ arx_forecaster <- function( forecast_date <- args_list$forecast_date %||% forecast_date_default - preds <- forecast(wf, forecast_date = forecast_date) %>% + preds <- forecast(wf) %>% as_tibble() %>% select(-time_value) @@ -262,10 +268,13 @@ arx_fcast_epi_workflow <- function( #' @param quantile_levels Vector or `NULL`. A vector of probabilities to produce #' prediction intervals. These are created by computing the quantiles of #' training residuals. A `NULL` value will result in point forecasts only. -#' @param symmetrize Logical. The default `TRUE` calculates -#' symmetric prediction intervals. This argument only applies when -#' residual quantiles are used. It is not applicable with -#' `trainer = quantile_reg()`, for example. +#' @param symmetrize Logical. The default `TRUE` calculates symmetric prediction +#' intervals. This argument only applies when residual quantiles are used. It +#' is not applicable with `trainer = quantile_reg()`, for example. This is +#' achieved by including both the residuals and their negation. Typically, one +#' would only want non-symmetric quantiles when increasing trajectories are +#' quite different from decreasing ones, such as a strictly postive variable +#' near zero. #' @param nonneg Logical. The default `TRUE` enforces nonnegative predictions #' by hard-thresholding at 0. #' @param quantile_by_key Character vector. Groups residuals by listed keys diff --git a/R/autoplot.R b/R/autoplot.R index efad56ffe..236419c4e 100644 --- a/R/autoplot.R +++ b/R/autoplot.R @@ -16,6 +16,8 @@ ggplot2::autoplot #' @param object,x An `epi_workflow` #' @param predictions A data frame with predictions. If `NULL`, only the #' original data is shown. +#' @param plot_data An epi_df of the data to plot against. This is for the case +#' where you have the actual results to compare the forecast against. #' @param .levels A numeric vector of levels to plot for any prediction bands. #' More than 3 levels begins to be difficult to see. #' @param ... Ignored @@ -81,7 +83,9 @@ NULL #' @export #' @rdname autoplot-epipred autoplot.epi_workflow <- function( - object, predictions = NULL, + object, + predictions = NULL, + plot_data = NULL, .levels = c(.5, .8, .9), ..., .color_by = c("all_keys", "geo_value", "other_keys", ".response", "all", "none"), .facet_by = c(".response", "other_keys", "all_keys", "geo_value", "all", "none"), @@ -109,31 +113,39 @@ autoplot.epi_workflow <- function( } keys <- c("geo_value", "time_value", "key") mold_roles <- names(mold$extras$roles) - edf <- bind_cols(mold$extras$roles[mold_roles %in% keys], y) - if (starts_with_impl("ahead_", names(y))) { + # extract the relevant column names for plotting + if (starts_with_impl("ahead_", names(y)) || starts_with_impl("lag_", names(y))) { old_name_y <- unlist(strsplit(names(y), "_")) - shift <- as.numeric(old_name_y[2]) new_name_y <- paste(old_name_y[-c(1:2)], collapse = "_") - edf <- rename(edf, !!new_name_y := !!names(y)) - } else if (starts_with_impl("lag_", names(y))) { - old_name_y <- unlist(strsplit(names(y), "_")) - shift <- -as.numeric(old_name_y[2]) - new_name_y <- paste(old_name_y[-c(1:2)], collapse = "_") - edf <- rename(edf, !!new_name_y := !!names(y)) } else { new_name_y <- names(y) - shift <- 0 } - - edf <- mutate(edf, time_value = time_value + shift) - other_keys <- key_colnames(object, exclude = c("geo_value", "time_value")) - edf <- as_epi_df(edf, - as_of = object$fit$meta$as_of, - other_keys = other_keys - ) + if (is.null(plot_data)) { + # the outcome has shifted, so we need to shift it forward (or back) + # by the corresponding amount + plot_data <- bind_cols(mold$extras$roles[mold_roles %in% keys], y) + if (starts_with_impl("ahead_", names(y))) { + shift <- as.numeric(old_name_y[2]) + } else if (starts_with_impl("lag_", names(y))) { + old_name_y <- unlist(strsplit(names(y), "_")) + shift <- -as.numeric(old_name_y[2]) + } else { + new_name_y <- names(y) + shift <- 0 + } + plot_data <- rename(plot_data, !!new_name_y := !!names(y)) + if (!is.null(shift)) { + plot_data <- mutate(plot_data, time_value = time_value + shift) + } + other_keys <- setdiff(key_colnames(object), c("geo_value", "time_value")) + plot_data <- as_epi_df(plot_data, + as_of = object$fit$meta$as_of, + other_keys = other_keys + ) + } if (is.null(predictions)) { return(autoplot( - edf, new_name_y, + plot_data, new_name_y, .color_by = .color_by, .facet_by = .facet_by, .base_color = .base_color, .facet_filter = {{ .facet_filter }} )) @@ -145,27 +157,27 @@ autoplot.epi_workflow <- function( } predictions <- rename(predictions, time_value = target_date) } - pred_cols_ok <- hardhat::check_column_names(predictions, key_colnames(edf)) + pred_cols_ok <- hardhat::check_column_names(predictions, key_colnames(plot_data)) if (!pred_cols_ok$ok) { cli_warn(c( "`predictions` is missing required variables: {.var {pred_cols_ok$missing_names}}.", i = "Plotting the original data." )) return(autoplot( - edf, !!new_name_y, + plot_data, !!new_name_y, .color_by = .color_by, .facet_by = .facet_by, .base_color = .base_color, .facet_filter = {{ .facet_filter }} )) } # First we plot the history, always faceted by everything - bp <- autoplot(edf, !!new_name_y, + bp <- autoplot(plot_data, !!new_name_y, .color_by = "none", .facet_by = "all_keys", .base_color = "black", .facet_filter = {{ .facet_filter }} ) # Now, prepare matching facets in the predictions - ek <- epi_keys_only(edf) + ek <- epi_keys_only(plot_data) predictions <- predictions %>% mutate( .facets = interaction(!!!rlang::syms(as.list(ek)), sep = " / "), @@ -203,7 +215,7 @@ autoplot.epi_workflow <- function( #' @export #' @rdname autoplot-epipred autoplot.canned_epipred <- function( - object, ..., + object, plot_data = NULL, ..., .color_by = c("all_keys", "geo_value", "other_keys", ".response", "all", "none"), .facet_by = c(".response", "other_keys", "all_keys", "geo_value", "all", "none"), .base_color = "dodgerblue4", @@ -218,7 +230,7 @@ autoplot.canned_epipred <- function( predictions <- object$predictions %>% rename(time_value = target_date) - autoplot(ewf, predictions, + autoplot(ewf, predictions, plot_data, ..., .color_by = .color_by, .facet_by = .facet_by, .base_color = .base_color, .facet_filter = {{ .facet_filter }} ) diff --git a/R/climatological_forecaster.R b/R/climatological_forecaster.R index e29d7d4a3..592fcccc0 100644 --- a/R/climatological_forecaster.R +++ b/R/climatological_forecaster.R @@ -115,10 +115,42 @@ climatological_forecaster <- function(epi_data, mean = function(x, w) mean(x, na.rm = TRUE), median = function(x, w) stats::median(x, na.rm = TRUE) ) - # get the point predictions keys <- key_colnames(epi_data, exclude = "time_value") - epi_data <- epi_data %>% mutate(.idx = time_aggr(time_value), .weights = 1) - climate_center <- epi_data %>% + # Get the prediction geo and .idx for the target date(s) + predictions <- epi_data %>% + select(all_of(keys)) %>% + dplyr::distinct() %>% + mutate(forecast_date = forecast_date, .idx = time_aggr(forecast_date)) + predictions <- + map(horizon, ~ { + predictions %>% + mutate(.idx = .idx + .x, target_date = forecast_date + ttype_dur(.x)) + }) %>% + purrr::list_rbind() %>% + mutate( + .idx = .idx %% modulus, + .idx = dplyr::case_when(.idx == 0 ~ modulus, TRUE ~ .idx) + ) + # get the distinct .idx for the target date(s) + distinct_target_idx <- predictions$.idx %>% unique() + # get all of the idx's within the window of the target .idxs + entries <- map(distinct_target_idx, \(idx) within_window(idx, window_size, modulus)) %>% + do.call(c, .) %>% + unique() + # for the center, we need those within twice the window, since for each point + # we're subtracting out the center to generate the quantiles + entries_double_window <- map(entries, \(idx) within_window(idx, window_size, modulus)) %>% + do.call(c, .) %>% + unique() + + epi_data_target <- + epi_data %>% + mutate(.idx = time_aggr(time_value), .weights = 1) + # get the point predictions + climate_center <- + epi_data_target %>% + filter(.idx %in% entries_double_window) %>% + mutate(.idx = time_aggr(time_value), .weights = 1) %>% select(.idx, .weights, all_of(c(outcome, keys))) %>% dplyr::reframe( roll_modular_multivec( @@ -136,7 +168,10 @@ climatological_forecaster <- function(epi_data, probs = args_list$quantile_levels, na.rm = TRUE, type = 8 ))) } - climate_quantiles <- epi_data %>% + # add on the centers and subtract them out before computing the quantiles + climate_quantiles <- + epi_data_target %>% + filter(.idx %in% entries) %>% left_join(climate_center, by = c(".idx", keys)) %>% mutate({{ outcome }} := !!sym_outcome - .pred) %>% select(.idx, .weights, all_of(c(outcome, args_list$quantile_by_key))) %>% @@ -147,31 +182,17 @@ climatological_forecaster <- function(epi_data, ), .by = all_of(args_list$quantile_by_key) ) %>% - rename(.pred_distn = climate_pred) %>% - mutate(.pred_distn = hardhat::quantile_pred(do.call(rbind, .pred_distn), args_list$quantile_levels)) + mutate(.pred_distn = hardhat::quantile_pred(do.call(rbind, climate_pred), args_list$quantile_levels)) %>% + select(-climate_pred) # combine them together climate_table <- climate_center %>% - left_join(climate_quantiles, by = c(".idx", args_list$quantile_by_key)) %>% + inner_join(climate_quantiles, by = c(".idx", args_list$quantile_by_key)) %>% mutate(.pred_distn = .pred_distn + .pred) - # create the predictions - predictions <- epi_data %>% - select(all_of(keys)) %>% - dplyr::distinct() %>% - mutate(forecast_date = forecast_date, .idx = time_aggr(forecast_date)) - predictions <- map(horizon, ~ { - predictions %>% - mutate(.idx = .idx + .x, target_date = forecast_date + ttype_dur(.x)) - }) %>% - purrr::list_rbind() %>% - mutate( - .idx = .idx %% modulus, - .idx = dplyr::case_when(.idx == 0 ~ modulus, TRUE ~ .idx) - ) %>% + predictions <- predictions %>% left_join(climate_table, by = c(".idx", keys)) %>% select(-.idx) if (args_list$nonneg) { - predictions <- mutate( - predictions, + predictions <- predictions %>% mutate( .pred = snap(.pred, 0, Inf), .pred_distn = snap(.pred_distn, 0, Inf) ) diff --git a/R/epi_recipe.R b/R/epi_recipe.R index dae445f53..3f90e40b1 100644 --- a/R/epi_recipe.R +++ b/R/epi_recipe.R @@ -232,9 +232,10 @@ is_epi_recipe <- function(x) { -#' Add an `epi_recipe` to a workflow +#' Given an `epi_recipe`, add it to, remove it from, or update it in an +#' `epi_workflow` #' -#' @seealso [workflows::add_recipe()] +#' @description #' - `add_recipe()` specifies the terms of the model and any preprocessing that #' is required through the usage of a recipe. #' @@ -244,9 +245,9 @@ is_epi_recipe <- function(x) { #' recipe with the new one. #' #' @details -#' `add_epi_recipe` has the same behaviour as -#' [workflows::add_recipe()] but sets a different -#' default blueprint to automatically handle [epiprocess::epi_df][epiprocess::as_epi_df] data. +#' `add_epi_recipe()` has the same behaviour as [workflows::add_recipe()] but +#' sets a different default blueprint to automatically handle +#' `epiprocess::epi_df()` data. #' #' @param x A `workflow` or `epi_workflow` #' @@ -265,6 +266,7 @@ is_epi_recipe <- function(x) { #' `x`, updated with a new recipe preprocessor. #' #' @export +#' @seealso [workflows::add_recipe()] #' @examples #' jhu <- covid_case_death_rates %>% #' filter(time_value > "2021-08-01") diff --git a/R/epi_workflow.R b/R/epi_workflow.R index 81b443e7b..6aac401b2 100644 --- a/R/epi_workflow.R +++ b/R/epi_workflow.R @@ -1,19 +1,20 @@ #' Create an epi_workflow #' #' This is a container object that unifies preprocessing, fitting, prediction, -#' and postprocessing for predictive modeling on epidemiological data. It extends -#' the functionality of a [workflows::workflow()] to handle the typical panel -#' data structures found in this field. This extension is handled completely -#' internally, and should be invisible to the user. For all intents and purposes, -#' this operates exactly like a [workflows::workflow()]. For more details -#' and numerous examples, see there. +#' and post-processing for predictive modeling on epidemiological data. It +#' extends the functionality of a [workflows::workflow()] to handle the typical +#' panel data structures found in this field. This extension is handled +#' completely internally, and should be invisible to the user. For all intents +#' and purposes, this operates exactly like a [workflows::workflow()]. For some +#' `{epipredict}` specific examples, see the [custom epiworkflows +#' vignette](../articles/custom_epiworkflows.html). #' #' @inheritParams workflows::workflow #' @param postprocessor An optional postprocessor to add to the workflow. #' Currently only `frosting` is allowed using, `add_frosting()`. #' #' @return A new `epi_workflow` object. -#' @seealso workflows::workflow +#' @seealso [workflows::workflow()] #' @importFrom rlang is_null #' @importFrom stats predict #' @importFrom generics fit @@ -62,9 +63,9 @@ is_epi_workflow <- function(x) { #' Fit an `epi_workflow` object #' #' @description -#' This is the `fit()` method for an `epi_workflow` object that +#' This is the `fit()` method for an `epi_workflow()` object that #' estimates parameters for a given model from a set of data. -#' Fitting an `epi_workflow` involves two main steps, which are +#' Fitting an `epi_workflow()` involves two main steps, which are #' preprocessing the data and fitting the underlying parsnip model. #' #' @inheritParams workflows::fit.workflow @@ -79,7 +80,7 @@ is_epi_workflow <- function(x) { #' @return The `epi_workflow` object, updated with a fit parsnip #' model in the `object$fit$fit` slot. #' -#' @seealso workflows::fit-workflow +#' @seealso [workflows::fit-workflow()] #' #' @name fit-epi_workflow #' @export @@ -111,20 +112,20 @@ fit.epi_workflow <- function(object, data, ..., control = workflows::control_wor #' Predict from an epi_workflow #' #' @description -#' This is the `predict()` method for a fit epi_workflow object. The nice thing -#' about predicting from an epi_workflow is that it will: +#' This is the `predict()` method for a fit epi_workflow object. The 3 steps that this implements are: #' -#' - Preprocess `new_data` using the preprocessing method specified when the -#' workflow was created and fit. This is accomplished using -#' [hardhat::forge()], which will apply any formula preprocessing or call -#' [recipes::bake()] if a recipe was supplied. +#' - Preprocessing `new_data` using the preprocessing method specified when the +#' epi_workflow was created and fit. This is accomplished using +#' `recipes::bake()` if a recipe was supplied. Note that this is a slightly +#' different `bake` operation than the one occuring during the fit. Any `step` +#' that has `skip = TRUE` isn't applied during prediction; for example in +#' `step_epi_naomit()`, `all_outcomes()` isn't `NA` omitted, since doing so +#' would drop the exact `time_values` we are trying to predict. #' -#' - Call [parsnip::predict.model_fit()] for you using the underlying fit +#' - Calling `parsnip::predict.model_fit()` for you using the underlying fit #' parsnip model. #' -#' - Ensure that the returned object is an [epiprocess::epi_df][epiprocess::as_epi_df] where -#' possible. Specifically, the output will have `time_value` and -#' `geo_value` columns as well as the prediction. +#' - `slather()` any frosting that has been included in the `epi_workflow`. #' #' @param object An epi_workflow that has been fit by #' [workflows::fit.workflow()] @@ -136,7 +137,7 @@ fit.epi_workflow <- function(object, data, ..., control = workflows::control_wor #' #' @return #' A data frame of model predictions, with as many rows as `new_data` has. -#' If `new_data` is an `epi_df` or a data frame with `time_value` or +#' If `new_data` is an `epi_df()` or a data frame with `time_value` or #' `geo_value` columns, then the result will have those as well. #' #' @name predict-epi_workflow @@ -177,6 +178,11 @@ predict.epi_workflow <- function(object, new_data, type = NULL, opts = list(), . #' Augment data with predictions #' +#' `augment()`, unlike `forecast()`, has the goal of modifying the training +#' data, rather than just producing new forecasts. It does a prediction on +#' `new_data`, which will produce a prediction for most `time_values`, and then +#' adds `.pred` as a column to `new_data` and returns the resulting join. +#' #' @param x A trained epi_workflow #' @param new_data A epi_df of predictors #' @param ... Arguments passed on to the predict method. @@ -228,26 +234,31 @@ print.epi_workflow <- function(x, ...) { } -#' Produce a forecast from an epi workflow +#' Produce a forecast from just an epi workflow +#' +#' `forecast.epi_workflow` predicts by restricting the training data to the +#' latest available data, and predicting on that. It binds together +#' `get_test_data()` and `predict()`. #' #' @param object An epi workflow. #' @param ... Not used. -#' @param n_recent Integer or NULL. If filling missing data with locf = TRUE, -#' how far back are we willing to tolerate missing data? Larger values allow -#' more filling. The default NULL will determine this from the the recipe. For -#' example, suppose n_recent = 3, then if the 3 most recent observations in any -#' geo_value are all NA’s, we won’t be able to fill anything, and an error -#' message will be thrown. (See details.) -#' @param forecast_date By default, this is set to the maximum time_value in x. -#' But if there is data latency such that recent NA's should be filled, this may -#' be after the last available time_value. #' #' @return A forecast tibble. #' #' @export -forecast.epi_workflow <- function(object, ..., n_recent = NULL, forecast_date = NULL) { - rlang::check_dots_empty() - +#' @examples +#' jhu <- covid_case_death_rates %>% +#' filter(time_value > "2021-08-01") +#' +#' r <- epi_recipe(jhu) %>% +#' step_epi_lag(death_rate, lag = c(0, 7, 14)) %>% +#' step_epi_ahead(death_rate, ahead = 7) %>% +#' step_epi_naomit() +#' +#' epi_workflow(r, parsnip::linear_reg()) %>% +#' fit(jhu) %>% +#' forecast() +forecast.epi_workflow <- function(object, ...) { if (!object$trained) { cli_abort(c( "You cannot `forecast()` a {.cls workflow} that has not been trained.", diff --git a/R/extrapolate_quantiles.R b/R/extrapolate_quantiles.R index c7a9a3b6b..ec68d8256 100644 --- a/R/extrapolate_quantiles.R +++ b/R/extrapolate_quantiles.R @@ -1,4 +1,16 @@ -#' Summarize a distribution with a set of quantiles +#' Extrapolate the quantiles to new quantile levels +#' +#' This both interpolates between quantile levels already defined in `x` and +#' extrapolates quantiles outside their bounds. The interpolation method is +#' determined by the `quantile` argument `middle`, which can be either `"cubic"` +#' for a (hyman) cubic spline interpolation, or `"linear"` for simple linear +#' interpolation. +#' +#' There is only one extrapolation method for values greater than the largest +#' known quantile level or smaller than the smallest known quantile level. It +#' assumes a roughly exponential tail, whose decay rate and offset is derived +#' from the slope of the two most extreme quantile levels on a logistic scale. +#' See the internal function `tail_extrapolate()` for the exact implementation. #' #' This function takes a `quantile_pred` vector and returns the same #' type of object, expanded to include @@ -20,7 +32,9 @@ #' @examples #' dstn <- quantile_pred(rbind(1:4, 8:11), c(.2, .4, .6, .8)) #' # extra quantiles are appended -#' as_tibble(extrapolate_quantiles(dstn, probs = c(.25, 0.5, .75))) +#' as_tibble(extrapolate_quantiles(dstn, probs = c(0.25, 0.5, 0.75))) +#' +#' extrapolate_quantiles(dstn, probs = c(0.0001, 0.25, 0.5, 0.75, 0.99999)) extrapolate_quantiles <- function(x, probs, replace_na = TRUE, ...) { UseMethod("extrapolate_quantiles") } diff --git a/R/flatline_forecaster.R b/R/flatline_forecaster.R index bf7ecb5b0..617d703e7 100644 --- a/R/flatline_forecaster.R +++ b/R/flatline_forecaster.R @@ -1,18 +1,41 @@ #' Predict the future with today's value #' -#' This is a simple forecasting model for -#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It uses the most recent -#' observation as the -#' forecast for any future date, and produces intervals based on the quantiles -#' of the residuals of such a "flatline" forecast over all available training -#' data. +#' @description This is a simple forecasting model for +#' [epiprocess::epi_df][epiprocess::as_epi_df] data. It uses the most recent +#' observation as the forecast for any future date, and produces intervals +#' based on the quantiles of the residuals of such a "flatline" forecast over +#' all available training data. #' #' By default, the predictive intervals are computed separately for each -#' combination of key values (`geo_value` + any additional keys) in the -#' `epi_data` argument. +#' combination of key values (`geo_value` + any additional keys) in the +#' `epi_data` argument. #' #' This forecaster is very similar to that used by the -#' [COVID19ForecastHub](https://covid19forecasthub.org) +#' [COVID19ForecastHub](https://covid19forecasthub.org) +#' +#' @details +#' Here is (roughly) the code for the `flatline_forecaster()` applied to the +#' `case_rate` for `epidatasets::covid_case_death_rates`. +#' +#' ```{r} +#' jhu <- covid_case_death_rates %>% +#' filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny")) +#' r <- epi_recipe(covid_case_death_rates) %>% +#' step_epi_ahead(case_rate, ahead = 7, skip = TRUE) %>% +#' recipes::update_role(case_rate, new_role = "predictor") %>% +#' recipes::add_role(all_of(key_colnames(jhu)), new_role = "predictor") +#' +#' f <- frosting() %>% +#' layer_predict() %>% +#' layer_residual_quantiles() %>% +#' layer_add_forecast_date() %>% +#' layer_add_target_date() %>% +#' layer_threshold(starts_with(".pred")) +#' +#' eng <- linear_reg() %>% set_engine("flatline") +#' wf <- epi_workflow(r, eng, f) %>% fit(jhu) +#' preds <- forecast(wf) +#' ``` #' #' @param epi_data An [epiprocess::epi_df][epiprocess::as_epi_df] #' @param outcome A scalar character for the column name we wish to predict. diff --git a/R/frosting.R b/R/frosting.R index cb0fa916e..65115cbcf 100644 --- a/R/frosting.R +++ b/R/frosting.R @@ -1,4 +1,5 @@ -#' Add frosting to a workflow +#' Given a `frosting()`, add it to, remove it from, or update it in an +#' `epi_workflow` #' #' @param x A workflow #' @param frosting A frosting object created using `frosting()`. @@ -246,10 +247,10 @@ new_frosting <- function() { } -#' Create frosting for postprocessing predictions +#' Create frosting for post-processing predictions #' -#' This generates a postprocessing container (much like `recipes::recipe()`) -#' to hold steps for postprocessing predictions. +#' This generates a post-processing container (much like `recipes::recipe()`) +#' to hold steps for post-processing predictions. #' #' The arguments are currently placeholders and must be NULL #' @@ -260,7 +261,7 @@ new_frosting <- function() { #' @export #' #' @examples -#' # Toy example to show that frosting can be created and added for postprocessing +#' # Toy example to show that frosting can be created and added for post-processing #' f <- frosting() #' wf <- epi_workflow() %>% add_frosting(f) #' @@ -322,9 +323,9 @@ extract_frosting.epi_workflow <- function(x, ...) { } } -#' Apply postprocessing to a fitted workflow +#' Apply post-processing to a fitted workflow #' -#' This function is intended for internal use. It implements postprocessing +#' This function is intended for internal use. It implements post-processing #' inside of the `predict()` method for a fitted workflow. #' #' @param workflow An object of class workflow @@ -342,7 +343,7 @@ apply_frosting <- function(workflow, ...) { apply_frosting.default <- function(workflow, components, ...) { if (has_postprocessor(workflow)) { cli_abort(c( - "Postprocessing is only available for epi_workflows currently.", + "Post-processing is only available for epi_workflows currently.", i = "Can you use `epi_workflow()` instead of `workflow()`?" )) } diff --git a/R/get_test_data.R b/R/get_test_data.R index 442272a2f..fd01f10e2 100644 --- a/R/get_test_data.R +++ b/R/get_test_data.R @@ -1,10 +1,9 @@ #' Get test data for prediction based on longest lag period #' -#' Based on the longest lag period in the recipe, -#' `get_test_data()` creates an [epi_df][epiprocess::as_epi_df] -#' with columns `geo_value`, `time_value` -#' and other variables in the original dataset, -#' which will be used to create features necessary to produce forecasts. +#' If `predict()` is given the full training dataset, it will produce a forecast +#' for every day which has enough data. For most cases, this is far more +#' forecasts than is necessary. `get_test_data()` is designed to restrict the given dataset to the minimum amount needed to produce a forecast on the `forecast_date`. +#' Primarily this is based on the longest lag period in the recipe. #' #' The minimum required (recent) data to produce a forecast is equal to #' the maximum lag requested (on any predictor) plus the longest horizon diff --git a/R/layer_add_forecast_date.R b/R/layer_add_forecast_date.R index 3e62bafb0..72bc33703 100644 --- a/R/layer_add_forecast_date.R +++ b/R/layer_add_forecast_date.R @@ -1,4 +1,4 @@ -#' Postprocessing step to add the forecast date +#' Post-processing step to add the forecast date #' #' @param frosting a `frosting` postprocessor #' @param forecast_date The forecast date to add as a column to the `epi_df`. @@ -7,7 +7,7 @@ #' values. If there is a `step_adjust_latency` step present, it uses the #' `forecast_date` as set in that function. Otherwise, it uses the maximum #' `time_value` across the data used for pre-processing, fitting the model, -#' and postprocessing. +#' and post-processing. #' @param id a random id string #' #' @return an updated `frosting` postprocessor @@ -15,9 +15,9 @@ #' @details To use this function, either specify a forecast date or leave the #' forecast date unspecifed here. In the latter case, the forecast date will #' be set as the maximum time value from the data used in pre-processing, -#' fitting the model, and postprocessing. In any case, when the forecast date is +#' fitting the model, and post-processing. In any case, when the forecast date is #' less than the maximum `as_of` value (from the data used pre-processing, -#' model fitting, and postprocessing), an appropriate warning will be thrown. +#' model fitting, and post-processing), an appropriate warning will be thrown. #' #' @export #' @examples diff --git a/R/layer_add_target_date.R b/R/layer_add_target_date.R index bd97862ca..8c60dfbfc 100644 --- a/R/layer_add_target_date.R +++ b/R/layer_add_target_date.R @@ -1,4 +1,4 @@ -#' Postprocessing step to add the target date +#' Post-processing step to add the target date #' #' @param frosting a `frosting` postprocessor #' @param target_date The target date to add as a column to the `epi_df`. If @@ -6,21 +6,22 @@ #' `step_adjust_latency` or in a `layer_forecast_date`), then it is the #' forecast date plus `ahead` (from `step_epi_ahead` in the `epi_recipe`). #' Otherwise, it is the maximum `time_value` (from the data used in -#' pre-processing, fitting the model, and postprocessing) plus `ahead`, where +#' pre-processing, fitting the model, and post-processing) plus `ahead`, where #' `ahead` has been specified in preprocessing. The user may override these by #' specifying a target date of their own (of the form "yyyy-mm-dd"). #' @param id a random id string #' #' @return an updated `frosting` postprocessor #' -#' @details By default, this function assumes that a value for `ahead` -#' has been specified in a preprocessing step (most likely in -#' `step_epi_ahead`). Then, `ahead` is added to the `forecast_date` -#' in the test data to get the target date. `forecast_date` can be set in 3 ways: -#' 1. `step_adjust_latency`, which typically uses the training `epi_df`'s `as_of` -#' 2. `layer_add_forecast_date`, which inherits from 1 if not manually specifed -#' 3. if none of those are the case, it is simply the maximum `time_value` over -#' every dataset used (prep, training, and prediction). +#' @details By default, this function assumes that a value for `ahead` has been +#' specified in a preprocessing step (most likely in `step_epi_ahead`). Then, +#' `ahead` is added to the `forecast_date` in the test data to get the target +#' date. `forecast_date` itself can be set in 3 ways: +#' 1. The default `forecast_date` is simply the maximum `time_value` over every +#' dataset used (prep, training, and prediction). +#' 2. if `step_adjust_latency` is present, it will typically use the training +#' `epi_df`'s `as_of` +#' 3. `layer_add_forecast_date`, which inherits from 2 if not manually specifed #' #' @export #' @examples diff --git a/R/layer_point_from_distn.R b/R/layer_point_from_distn.R index a67e3e079..07f524470 100644 --- a/R/layer_point_from_distn.R +++ b/R/layer_point_from_distn.R @@ -1,6 +1,6 @@ #' Converts distributional forecasts to point forecasts #' -#' This function adds a postprocessing layer to extract a point forecast from +#' This function adds a post-processing layer to extract a point forecast from #' a distributional forecast. NOTE: With default arguments, this will remove #' information, so one should usually call this AFTER `layer_quantile_distn()` #' or set the `name` argument to something specific. diff --git a/R/layer_population_scaling.R b/R/layer_population_scaling.R index a3183a0ae..91aae77c6 100644 --- a/R/layer_population_scaling.R +++ b/R/layer_population_scaling.R @@ -1,15 +1,15 @@ #' Convert per-capita predictions to raw scale #' -#' `layer_population_scaling` creates a specification of a frosting layer -#' that will "undo" per-capita scaling. Typical usage would -#' load a dataset that contains state-level population, and use it to convert -#' predictions made from a rate-scale model to raw scale by multiplying by -#' the population. -#' Although, it is worth noting that there is nothing special about "population". -#' The function can be used to scale by any variable. Population is the -#' standard use case in the epidemiology forecasting scenario. Any value -#' passed will *multiply* the selected variables while the `rate_rescaling` -#' argument is a common *divisor* of the selected variables. +#' `layer_population_scaling` creates a specification of a frosting layer that +#' will "undo" per-capita scaling done in `step_population_scaling()`. Typical +#' usage would set `df` to be a dataset that contains state-level population, +#' and use it to convert predictions made from a raw scale model to rate-scale +#' by dividing by the population. +#' Although, it is worth noting that there is nothing special about +#' "population", and the function can be used to scale by any variable. +#' Population is the standard use case in the epidemiology forecasting scenario. +#' Any value passed will *multiply* the selected variables while the +#' `rate_rescaling` argument is a common *divisor* of the selected variables. #' #' @param frosting a `frosting` postprocessor. The layer will be added to the #' sequence of operations for this frosting. @@ -17,7 +17,8 @@ #' for this step. See [recipes::selections()] for more details. #' @param df a data frame that contains the population data to be used for #' inverting the existing scaling. -#' @param by A (possibly named) character vector of variables to join by. +#' @param by A (possibly named) character vector of variables to join `df` onto +#' the `epi_df` by. #' #' If `NULL`, the default, the function will try to infer a reasonable set of #' columns. First, it will try to join by all variables in the test data with diff --git a/R/layer_predict.R b/R/layer_predict.R index 0f4c33e11..623ab3391 100644 --- a/R/layer_predict.R +++ b/R/layer_predict.R @@ -1,11 +1,11 @@ -#' Prediction layer for postprocessing +#' Prediction layer for post-processing #' #' Implements prediction on a fitted `epi_workflow`. One may want different #' types of prediction, and to potentially apply this after some amount of -#' postprocessing. This would typically be the first layer in a `frosting` +#' post-processing. This would typically be the first layer in a `frosting` #' postprocessor. #' -#' @seealso `parsnip::predict.model_fit()` +#' @seealso [parsnip::predict.model_fit()] #' #' @inheritParams parsnip::predict.model_fit #' @param frosting a frosting object diff --git a/R/layer_predictive_distn.R b/R/layer_predictive_distn.R index 824593f8d..8b6e170ab 100644 --- a/R/layer_predictive_distn.R +++ b/R/layer_predictive_distn.R @@ -5,9 +5,11 @@ #' This function calculates an _approximation_ to a parametric predictive #' distribution. Predictive distributions from linear models require #' `x* (X'X)^{-1} x*` -#' along with the degrees of freedom. This function approximates both. It -#' should be reasonably accurate for models fit using `lm` when the new point -#' `x*` isn't too far from the bulk of the data. +#' along with the degrees of freedom. This function approximates both. It should +#' be reasonably accurate for models fit using `lm` when the new point `x*` +#' isn't too far from the bulk of the data. Outside of that specific case, it is +#' recommended to use `layer_residual_quantiles()`, or if you are working with a +#' model that produces distributional predictions, use `layer_quantile_distn()`. #' #' @param frosting a `frosting` postprocessor #' @param ... Unused, include for consistency with other layers. diff --git a/R/layer_quantile_distn.R b/R/layer_quantile_distn.R index e07713e00..a0af380a2 100644 --- a/R/layer_quantile_distn.R +++ b/R/layer_quantile_distn.R @@ -1,6 +1,8 @@ #' Returns predictive quantiles #' #' This function calculates quantiles when the prediction was _distributional_. +#' If the model producing the forecast is not distributional, it is recommended +#' to use `layer_residual_quantiles()` instead. #' #' Currently, the only distributional modes/engines are #' * `quantile_reg()` diff --git a/R/layer_residual_quantiles.R b/R/layer_residual_quantiles.R index 6b32d2921..779bf36b6 100644 --- a/R/layer_residual_quantiles.R +++ b/R/layer_residual_quantiles.R @@ -1,11 +1,20 @@ #' Creates predictions based on residual quantiles #' +#' This function calculates quantiles based on the empirical quantiles of the +#' model's residuals. If the model producing the forecast is distributional, it +#' is recommended to use `layer_residual_quantiles()` instead, as those will be +#' significantly more accurate. +#' #' @param frosting a `frosting` postprocessor #' @param ... Unused, include for consistency with other layers. #' @param quantile_levels numeric vector of probabilities with values in (0,1) #' referring to the desired quantile. Note that 0.5 will always be included #' even if left out by the user. -#' @param symmetrize logical. If `TRUE` then interval will be symmetric. +#' @param symmetrize logical. If `TRUE` then the interval will be symmetric. +#' This is achieved by including both the residuals and their negations. +#' Typically, one would only want non-symmetric quantiles when increasing +#' trajectories are quite different from decreasing ones, such as a strictly +#' postive variable near zero. #' @param by_key A character vector of keys to group the residuals by before #' calculating quantiles. The default, `c()` performs no grouping. #' @param name character. The name for the output column. @@ -91,7 +100,7 @@ slather.layer_residual_quantiles <- return(components) } - s <- ifelse(object$symmetrize, -1, NA) + symmetric <- ifelse(object$symmetrize, -1, NA) r <- grab_residuals(the_fit, components) ## Handle any grouping requests @@ -126,7 +135,7 @@ slather.layer_residual_quantiles <- r <- r %>% summarize(dstn = quantile_pred(matrix(quantile( - c(.resid, s * .resid), + c(.resid, symmetric * .resid), probs = object$quantile_levels, na.rm = TRUE ), nrow = 1), quantile_levels = object$quantile_levels)) # Check for NA diff --git a/R/layer_threshold_preds.R b/R/layer_threshold_preds.R index 2869fff07..9e8d3bbae 100644 --- a/R/layer_threshold_preds.R +++ b/R/layer_threshold_preds.R @@ -1,8 +1,16 @@ #' Lower and upper thresholds for predicted values #' -#' This postprocessing step is used to set prediction values that are -#' smaller than the lower threshold or higher than the upper threshold equal -#' to the threshold values. +#' This post-processing step is used to set prediction values that are smaller +#' than the lower threshold or higher than the upper threshold equal to the +#' threshold values. + +#' @details +#' Making case count predictions strictly positive is a typical example usage. +#' It can be called before or after the quantiles are created using +#' `layer_quantile_distn()` since the quantiles are an inherent part of the +#' result from `layer_predict()` for distributional models, but must be called +#' after `layer_residual_quantiles()`, since the quantiles for that case don't +#' exist until after that layer. #' #' @param frosting a `frosting` postprocessor #' @param ... <[`tidy-select`][dplyr::dplyr_tidy_select]> One or more unquoted diff --git a/R/layer_unnest.R b/R/layer_unnest.R index a6fc9f0af..ffac480e9 100644 --- a/R/layer_unnest.R +++ b/R/layer_unnest.R @@ -1,5 +1,12 @@ #' Unnest prediction list-cols #' +#' For any model that produces forecasts for multiple outcomes, such as multiple +#' aheads, the resulting prediction is a list of forecasts inside a column of +#' the prediction tibble, which is not an ideal format. This layer "lengthens" +#' the result, moving each outcome to a separate row, in the same manner as +#' `tidyr::unnest()` would. At the moment, the only such engine is +#' `smooth_quantile_reg()`. +#' #' @param frosting a `frosting` postprocessor #' @param ... <[`tidy-select`][dplyr::dplyr_tidy_select]> One or more unquoted #' expressions separated by commas. Variable names can be used as if they @@ -9,6 +16,35 @@ #' #' @return an updated `frosting` postprocessor #' @export +#' @examples +#' jhu <- covid_case_death_rates %>% +#' filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny")) +#' +#' aheads <- 1:7 +#' +#' r <- epi_recipe(jhu) %>% +#' step_epi_lag(death_rate, lag = c(0, 7, 14)) %>% +#' step_epi_ahead(death_rate, ahead = aheads) %>% +#' step_epi_naomit() +#' +#' wf <- epi_workflow( +#' r, +#' smooth_quantile_reg( +#' quantile_levels = c(.05, .1, .25, .5, .75, .9, .95), +#' outcome_locations = aheads +#' ) +#' ) %>% +#' fit(jhu) +#' +#' f <- frosting() %>% +#' layer_predict() %>% +#' layer_naomit() %>% +#' layer_unnest(.pred) +#' +#' wf1 <- wf %>% add_frosting(f) +#' +#' p <- forecast(wf1) +#' p layer_unnest <- function(frosting, ..., id = rand_id("unnest")) { arg_is_chr_scalar(id) diff --git a/R/layers.R b/R/layers.R index 01cb19dba..b35dceaf2 100644 --- a/R/layers.R +++ b/R/layers.R @@ -178,11 +178,14 @@ detect_layer.workflow <- function(x, name, ...) { #' Spread a layer of frosting on a fitted workflow #' -#' Slathering frosting means to implement a postprocessing layer. When -#' creating a new postprocessing layer, you must implement an S3 method -#' for this function -#' -#' @param object a workflow with `frosting` postprocessing steps +#' Slathering frosting means to implement a post-processing layer. It is the +#' post-processing equivalent of `bake` for a recipe. Given a layer, it applies +#' the actual transformation of that layer. When creating a new post-processing +#' layer, you must implement an S3 method for this function. Generally, you will +#' not need to call this function directly, as it will be used indirectly during +#' `predict`. +#' +#' @param object a workflow with `frosting` post-processing steps #' @param components a list of components containing model information. These #' will be updated and returned by the layer. These should be #' * `mold` - the output of calling `hardhat::mold()` on the workflow. This @@ -200,7 +203,8 @@ detect_layer.workflow <- function(x, name, ...) { #' #' @param ... additional arguments used by methods. Currently unused. #' -#' @return The `components` list. In the same format after applying any updates. +#' @return The `components` list, in the same format as before, after applying +#' any updates. #' @export slather <- function(object, components, workflow, new_data, ...) { UseMethod("slather") diff --git a/R/pivot_quantiles.R b/R/pivot_quantiles.R index e4b5b3320..75fff6e3e 100644 --- a/R/pivot_quantiles.R +++ b/R/pivot_quantiles.R @@ -34,9 +34,9 @@ nested_quantiles <- function(x) { #' Pivot a column containing `quantile_pred` longer #' -#' A column that contains `quantile_pred` will be "lengthened" with -#' the quantile levels serving as 1 column and the values as another. If -#' multiple columns are selected, these will be prefixed with the column name. +#' Selected columns that contain `quantile_pred` will be "lengthened" with the +#' `quantile_level`s in one column and the `value`s in another. If multiple +#' columns are selected, these will be prefixed with the column name. #' #' @param .data A data frame, or a data frame extension such as a tibble or #' epi_df. @@ -68,10 +68,10 @@ pivot_quantiles_longer <- function(.data, ...) { #' Pivot a column containing `quantile_pred` wider #' -#' Any selected columns that contain `quantile_pred` will be "widened" with -#' the "taus" (quantile) serving as names and the values in the data frame. -#' When pivoting multiple columns, the original column name will be used as -#' a prefix. +#' Any selected columns that contain `quantile_pred` will be "widened" with the +#' "taus" (quantile) serving as column names and the values in the corresponding +#' column. When pivoting multiple columns, the original column name will be +#' used as a prefix. #' #' @inheritParams pivot_quantiles_longer #' diff --git a/R/step_adjust_latency.R b/R/step_adjust_latency.R index ae9db6ef2..5b9db2995 100644 --- a/R/step_adjust_latency.R +++ b/R/step_adjust_latency.R @@ -26,15 +26,16 @@ #' ) %>% #' as_epi_df(as_of = as.Date("2015-01-14")) #' ``` -#' If we're looking to predict the value on the 15th, forecasting from the 14th (the `as_of` date above), -#' there are two issues we will need to address: +#' If we're looking to predict the value on the 15th, forecasting from the 14th +#' (the `as_of` date above), there are two issues we will need to address: #' 1. `"ca"` is latent by 2 days, whereas `"ma"` is latent by 1 -#' 2. if we want to use `b` as an exogenous variable, for `"ma"` it is latent by 3 days instead of just 1. +#' 2. if we want to use `b` as an exogenous variable, for `"ma"` it is latent by +#' 3 days instead of just 1. #' -#' Regardless of `method`, `epi_keys_checked="geo_value"` guarantees that the -#' difference between `"ma"` and `"ca"` is accounted for by making the -#' latency adjustment at least 2. For some comparison, here's what the various -#' methods will do: +#' Regardless of `method`, `epi_keys_checked="geo_value"` guarantees tha the +#' difference between `"ma"` and `"ca"` is accounted for by making the latency +#' adjustment at least 2. For some comparison, here's what the various methods +#' will do: #' #' ## `locf` #' Short for "last observation carried forward", `locf` assumes that every day diff --git a/R/step_climate.R b/R/step_climate.R index 6e2817faf..fa505298d 100644 --- a/R/step_climate.R +++ b/R/step_climate.R @@ -338,57 +338,74 @@ print.step_climate <- function(x, width = max(20, options()$width - 30), ...) { } #' group col by .idx values and sum windows around each .idx value -#' @param .idx the relevant periodic part of time value, e.g. the week number -#' @param col the list of values indexed by `.idx` -#' @param weights how much to weigh each particular datapoint -#' @param aggr the aggregation function, probably Quantile, mean or median +#' @param idx_in the relevant periodic part of time value, e.g. the week number, +#' limited to the relevant range +#' @param col the list of values indexed by `idx_in` +#' @param weights how much to weigh each particular datapoint (also indexed by +#' `idx_in`) +#' @param aggr the aggregation function, probably Quantile, mean, or median #' @param window_size the number of .idx entries before and after to include in #' the aggregation -#' @param modulus the maximum value of `.idx` +#' @param modulus the number of days/weeks/months in the year, not including any +#' leap days/weeks #' @importFrom lubridate %m-% #' @keywords internal -roll_modular_multivec <- function(col, .idx, weights, aggr, window_size, modulus) { - tib <- tibble(col = col, weights = weights, .idx = .idx) |> +roll_modular_multivec <- function(col, idx_in, weights, aggr, window_size, modulus) { + # make a tibble where data gives the list of all datapoints with the + # corresponding .idx + tib <- tibble(col = col, weights = weights, .idx = idx_in) |> arrange(.idx) |> tidyr::nest(data = c(col, weights), .by = .idx) - out <- double(modulus + 1) - for (iter in seq_along(out)) { - # +1 from 1-indexing - entries <- (iter - window_size):(iter + window_size) %% modulus - entries[entries == 0] <- modulus - # note that because we are 1-indexing, we're looking for indices that are 1 - # larger than the actual day/week in the year - if (modulus == 365) { - # we need to grab just the window around the leap day on the leap day - if (iter == 366) { - # there's an extra data point in front of the leap day - entries <- (59 - window_size):(59 + window_size - 1) %% modulus - entries[entries == 0] <- modulus - # adding in the leap day itself - entries <- c(entries, 999) - } else if ((59 %in% entries) || (60 %in% entries)) { - # if we're on the Feb/March boundary for daily data, we need to add in the - # leap day data - entries <- c(entries, 999) - } - } else if (modulus == 52) { - # we need to grab just the window around the leap week on the leap week - if (iter == 53) { - entries <- (53 - window_size):(53 + window_size - 1) %% 52 - entries[entries == 0] <- 52 - entries <- c(entries, 999) - } else if ((52 %in% entries) || (1 %in% entries)) { - # if we're on the year boundary for weekly data, we need to add in the - # leap week data (which is the extra week at the end) - entries <- c(entries, 999) - } - } - out[iter] <- with( + # storage for the results, includes all possible time indexes + out <- tibble(.idx = c(1:modulus, 999), climate_pred = double(modulus + 1)) + for (tib_idx in tib$.idx) { + entries <- within_window(tib_idx, window_size, modulus) + out$climate_pred[out$.idx == tib_idx] <- with( purrr::list_rbind(tib %>% filter(.idx %in% entries) %>% pull(data)), aggr(col, weights) ) } - tibble(.idx = unique(tib$.idx), climate_pred = out[seq_len(nrow(tib))]) + # filter to only the ones we actually computed + out %>% filter(.idx %in% idx_in) +} + +#' generate the idx values within `window_size` of `target_idx` given that our +#' time value is of the type matching modulus +#' @param target_idx the time index which we're drawing the window around +#' @param window_size the size of the window on one side of `target_idx` +#' @param modulus the number of days/weeks/months in the year, not including any leap days/weeks +#' @keywords internal +within_window <- function(target_idx, window_size, modulus) { + entries <- (target_idx - window_size):(target_idx + window_size) %% modulus + entries[entries == 0] <- modulus + # note that because we are 1-indexing, we're looking for indices that are 1 + # larger than the actual day/week in the year + if (modulus == 365) { + # we need to grab just the window around the leap day on the leap day + if (target_idx == 999) { + # there's an extra data point in front of the leap day + entries <- (59 - window_size):(59 + window_size - 1) %% modulus + entries[entries == 0] <- modulus + # adding in the leap day itself + entries <- c(entries, 999) + } else if ((59 %in% entries) || (60 %in% entries)) { + # if we're on the Feb/March boundary for daily data, we need to add in the + # leap day data + entries <- c(entries, 999) + } + } else if (modulus == 52) { + # we need to grab just the window around the leap week on the leap week + if (target_idx == 999) { + entries <- (53 - window_size):(53 + window_size - 1) %% 52 + entries[entries == 0] <- 52 + entries <- c(entries, 999) + } else if ((52 %in% entries) || (1 %in% entries)) { + # if we're on the year boundary for weekly data, we need to add in the + # leap week data (which is the extra week at the end) + entries <- c(entries, 999) + } + } + entries } diff --git a/R/step_epi_naomit.R b/R/step_epi_naomit.R index bfe8a4faa..0544bc5f9 100644 --- a/R/step_epi_naomit.R +++ b/R/step_epi_naomit.R @@ -2,10 +2,15 @@ #' #' @param recipe Recipe to be used for omission steps #' -#' @return Omits NA's from both predictors and outcomes at training time -#' to fit the model. Also only omits associated predictors and not -#' outcomes at prediction time due to lack of response and avoidance -#' of data loss. +#' @return Omits NA's from both predictors and outcomes at training time to fit +#' the model. Also only omits associated predictors and not outcomes at +#' prediction time due to lack of response and avoidance of data loss. Given a +#' `recipe`, this step is literally equivalent to +#' ```{r, eval=FALSE} +#' recipe %>% +#' recipes::step_naomit(all_predictors(), skip = FALSE) %>% +#' recipes::step_naomit(all_outcomes(), skip = TRUE) +#' ``` #' @export #' @examples #' covid_case_death_rates %>% diff --git a/R/step_epi_shift.R b/R/step_epi_shift.R index ae4bd3f31..862c224e5 100644 --- a/R/step_epi_shift.R +++ b/R/step_epi_shift.R @@ -1,12 +1,11 @@ #' Create a shifted predictor #' #' `step_epi_lag` and `step_epi_ahead` create a *specification* of a recipe step -#' that will add new columns of shifted data. The former will created a lag -#' column, while the latter will create a lead column. Shifted data will -#' by default include NA values where the shift was induced. -#' These can be properly removed with [step_epi_naomit()], or you may -#' specify an alternative filler value with the `default` -#' argument. +#' that will add new columns of shifted data. The `step_epi_lag` will created +#' a lagged `predictor` column, while `step_epi_ahead` will create a leading +#' `outcome` column. Shifted data will by default include NA values where the +#' shift was induced. These can be properly removed with [step_epi_naomit()], +#' or you may specify an alternative filler value with the `default` argument. #' #' #' @param recipe A recipe object. The step will be added to the @@ -30,8 +29,14 @@ #' @param id A unique identifier for the step #' @template step-return #' -#' @details The step assumes that the data are already _in the proper sequential -#' order_ for shifting. +#' @details The step assumes that the data's `time_value` column is already _in +#' the proper sequential order_ for shifting. +#' +#' Our `lag/ahead` functions respect the `geo_value` and `other_keys` of the +#' `epi_df`, and allow for discontiguous `time_value`s. Both of these features +#' are noticably lacking from `recipe::step_lag()`. +#' Our `lag/ahead` functions also appropriately adjust the amount of data to +#' avoid accidentally dropping recent predictors from the test data. #' #' The `prefix` and `id` arguments are unchangeable to ensure that the code runs #' properly and to avoid inconsistency with naming. For `step_epi_ahead`, they diff --git a/R/step_epi_slide.R b/R/step_epi_slide.R index 4cf0e7acf..564d525d9 100644 --- a/R/step_epi_slide.R +++ b/R/step_epi_slide.R @@ -1,8 +1,9 @@ #' Calculate a rolling window transformation #' -#' `step_epi_slide()` creates a *specification* of a recipe step -#' that will generate one or more new columns of derived data by "sliding" -#' a computation along existing data. +#' `step_epi_slide()` creates a *specification* of a recipe step that will +#' generate one or more new columns of derived data by "sliding" a computation +#' along existing data. This is a wrapper around `epiprocess::epi_slide()` +#' to allow its use within an `epi_recipe()`. #' #' @inheritParams step_epi_lag #' @param .f A function in one of the following formats: diff --git a/R/step_growth_rate.R b/R/step_growth_rate.R index 5497c2957..80b5bf682 100644 --- a/R/step_growth_rate.R +++ b/R/step_growth_rate.R @@ -1,7 +1,8 @@ #' Calculate a growth rate #' -#' `step_growth_rate()` creates a *specification* of a recipe step -#' that will generate one or more new columns of derived data. +#' `step_growth_rate()` creates a *specification* of a recipe step that will +#' generate one or more new columns of derived data. This is a wrapper around +#' `epiprocess::growth_rate()` to allow its use within an `epi_recipe()`. #' #' #' @inheritParams step_epi_lag diff --git a/R/step_lag_difference.R b/R/step_lag_difference.R index 2b0af00f2..1c38b0659 100644 --- a/R/step_lag_difference.R +++ b/R/step_lag_difference.R @@ -1,7 +1,14 @@ #' Calculate a lagged difference #' -#' `step_lag_difference()` creates a *specification* of a recipe step -#' that will generate one or more new columns of derived data. +#' `step_lag_difference()` creates a *specification* of a recipe step that will +#' generate one or more new columns of derived data. For each column in the +#' specification, `step_lag_difference()` will calculate the difference +#' between the values at a distance of `horizon`. For example, with +#' `horizon=1`, this would simply be the difference between adjacent days. +#' +#' Much like `step_epi_lag()` this step works with the actual time values (so if +#' there are gaps it will fill with `NA` values), and respects the grouping +#' inherent in the `epi_df()` as specified by `geo_value` and `other_keys`. #' #' #' @inheritParams step_epi_lag diff --git a/R/step_population_scaling.R b/R/step_population_scaling.R index bb60d039a..cd3a54f83 100644 --- a/R/step_population_scaling.R +++ b/R/step_population_scaling.R @@ -1,20 +1,22 @@ #' Convert raw scale predictions to per-capita #' -#' `step_population_scaling` creates a specification of a recipe step -#' that will perform per-capita scaling. Typical usage would -#' load a dataset that contains state-level population, and use it to convert -#' predictions made from a raw scale model to rate-scale by dividing by -#' the population. -#' Although, it is worth noting that there is nothing special about "population". -#' The function can be used to scale by any variable. Population is the -#' standard use case in the epidemiology forecasting scenario. Any value -#' passed will *divide* the selected variables while the `rate_rescaling` -#' argument is a common *multiplier* of the selected variables. +#' `step_population_scaling()` creates a specification of a recipe step that +#' will perform per-capita scaling. Typical usage would set `df` to be a dataset +#' that contains state-level population, and use it to convert predictions made +#' from a raw scale model to rate-scale by dividing by the population. +#' Although, it is worth noting that there is nothing special about +#' "population", and the function can be used to scale by any variable. +#' Population is the standard use case in the epidemiology forecasting scenario. +#' Any value passed will *divide* the selected variables while the +#' `rate_rescaling` argument is a common *multiplier* of the selected variables. #' #' @inheritParams step_epi_lag -#' @param df a data frame that contains the population data to be used for -#' inverting the existing scaling. -#' @param by A (possibly named) character vector of variables to join by. +#' @param role For model terms created by this step, what analysis role should +#' they be assigned? +#' @param df a data frame containing the scaling data (such as population). The +#' target column is divided by the value in `df_pop_col`. +#' @param by A (possibly named) character vector of variables to join `df` onto +#' the `epi_df` by. #' #' If `NULL`, the default, the function will try to infer a reasonable set of #' columns. First, it will try to join by all variables in the training/test @@ -41,7 +43,7 @@ #' @param rate_rescaling Sometimes raw scales are "per 100K" or "per 1M". #' Adjustments can be made here. For example, if the original #' scale is "per 100K", then set `rate_rescaling = 1e5` to get rates. -#' @param create_new TRUE to create a new column and keep the original column +#' @param create_new `TRUE` to create a new column and keep the original column #' in the `epi_df` #' @param suffix a character. The suffix added to the column name if #' `create_new = TRUE`. Default to "_scaled". diff --git a/R/step_training_window.R b/R/step_training_window.R index eafc076c7..51361b4f9 100644 --- a/R/step_training_window.R +++ b/R/step_training_window.R @@ -14,8 +14,12 @@ #' @inheritParams step_epi_lag #' @template step-return #' -#' @details Note that `step_epi_lead()` and `step_epi_lag()` should come -#' after any filtering step. +#' @details It is recommended to do this after any `step_epi_ahead()`, +#' `step_epi_lag()`, or `step_epi_naomit()` steps. If `step_training_window()` +#' happens first, there will be less than `n_training` remaining examples, +#' since either leading or lagging will introduce `NA`'s later removed by +#' `step_epi_naomit()`. Typical usage will have this function applied after +#' every other step. #' #' @export #' diff --git a/README.Rmd b/README.Rmd index 73cedbeaa..1adc09bc6 100644 --- a/README.Rmd +++ b/README.Rmd @@ -7,24 +7,94 @@ output: github_document ```{r, include = FALSE} options(width = 76) knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>", fig.path = "man/figures/README-", - out.width = "100%" + digits = 3, + comment = "#>", + collapse = TRUE, + cache = TRUE, + dev.args = list(bg = "transparent"), + dpi = 300, + cache.lazy = FALSE, + out.width = "90%", + fig.align = "center", + fig.width = 9, + fig.height = 6 +) +ggplot2::theme_set(ggplot2::theme_bw()) +options( + dplyr.print_min = 6, + dplyr.print_max = 6, + pillar.max_footer_lines = 2, + pillar.min_chars = 15, + stringr.view_n = 6, + pillar.bold = TRUE, + width = 77 +) +``` +```{r pkgs, include=FALSE, echo=FALSE} +library(epipredict) +library(epidatr) +library(data.table) +library(dplyr) +library(tidyr) +library(ggplot2) +library(magrittr) +library(purrr) +library(scales) +``` + +```{r coloration, include=FALSE, echo=FALSE} +base <- "#002676" +primary <- "#941120" +secondary <- "#f9c80e" +tertiary <- "#177245" +fourth_colour <- "#A393BF" +fifth_colour <- "#2e8edd" +colvec <- c( + base = base, primary = primary, secondary = secondary, + tertiary = tertiary, fourth_colour = fourth_colour, + fifth_colour = fifth_colour ) +library(epiprocess) +suppressMessages(library(tidyverse)) +theme_update(legend.position = "bottom", legend.title = element_blank()) +delphi_pal <- function(n) { + if (n > 6L) warning("Not enough colors in this palette!") + unname(colvec)[1:n] +} +scale_fill_delphi <- function(..., aesthetics = "fill") { + discrete_scale(aesthetics = aesthetics, palette = delphi_pal, ...) +} +scale_color_delphi <- function(..., aesthetics = "color") { + discrete_scale(aesthetics = aesthetics, palette = delphi_pal, ...) +} +scale_colour_delphi <- scale_color_delphi ``` -# epipredict +# Epipredict [![R-CMD-check](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml) -**Note:** This package is currently in development and may not work as expected. Please file bug reports as issues in this repo, and we will do our best to address them quickly. +[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) is a framework for building transformation and forecasting pipelines for epidemiological and other panel time-series datasets. +In addition to tools for building forecasting pipelines, it contains a number of “canned” forecasters meant to run with little modification as an easy way to get started forecasting. + +It is designed to work well with +[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/), a utility for time series handling and geographic processing in an epidemiological context. +Both of the packages are meant to work well with the panel data provided by +[`{epidatr}`](https://cmu-delphi.github.io/epidatr/). +Pre-compiled example datasets are also available in +[`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/). + +If you are looking for detail beyond the package documentation, see our +[forecasting book](https://cmu-delphi.github.io/delphi-tooling-book/). + ## Installation -To install (unless you're making changes to the package, use the stable version): +Unless you’re planning on contributing to package development, we suggest using the stable version. +To install, run: ```r # Stable version @@ -34,93 +104,255 @@ pak::pkg_install("cmu-delphi/epipredict@main") pak::pkg_install("cmu-delphi/epipredict@dev") ``` -## Documentation +The documentation for the stable version is at +, while the development version is at +. + + +## Motivating example -You can view documentation for the `main` branch at . -## Goals for `epipredict` +
+ Required packages -**We hope to provide:** +```{r install, run = FALSE} +library(epipredict) +library(epidatr) +library(epiprocess) +library(dplyr) +library(ggplot2) +``` +
-1. A set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. For the basic forecasters, we currently provide: - * Baseline flatline forecaster - * Autoregressive forecaster - * Autoregressive classifier - * CDC FluSight flatline forecaster -2. A framework for creating custom forecasters out of modular components. There are four types of components: - * Preprocessor: do things to the data before model training - * Trainer: train a model on data, resulting in a fitted model object - * Predictor: make predictions, using a fitted model object - * Postprocessor: do things to the predictions before returning +To demonstrate using [`{epipredict}`](https://cmu-delphi.github.io/epipredict/) for forecasting, say we want to +predict COVID-19 deaths per 100k people for each of a subset of states -**Target audiences:** +```{r subset_geos} +used_locations <- c("ca", "ma", "ny", "tx") +``` -* Basic. Has data, calls forecaster with default arguments. -* Intermediate. Wants to examine changes to the arguments, take advantage of -built in flexibility. -* Advanced. Wants to write their own forecasters. Maybe willing to build up -from some components. +on -The Advanced user should find their task to be relatively easy. Examples of -these tasks are illustrated in the [vignettes and articles](https://cmu-delphi.github.io/epipredict). +```{r fc_date} +forecast_date <- as.Date("2021-08-01") +``` -See also the (in progress) [Forecasting Book](https://cmu-delphi.github.io/delphi-tooling-book/). +We will be using a subset of +[Johns Hopkins Center for Systems Science and Engineering deaths data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). +Below the fold, we pull the dataset from the epidata API and clean it. -## Intermediate example +
+ Creating the dataset using `{epidatr}` and `{epiprocess}` -The package comes with some built-in historical data for illustration, but -up-to-date versions of this could be downloaded with the -[`{epidatr}` package](https://cmu-delphi.github.io/epidatr/) -and processed using -[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/).[^1] +This section is intended to demonstrate some of the ubiquitous cleaning operations needed to be able to forecast. +A subset of the dataset prepared here is also included ready-to-go in [`{epipredict}`](https://cmu-delphi.github.io/epipredict/) as `covid_case_death_rates`. -[^1]: Other epidemiological signals for non-Covid related illnesses are also -available with [`{epidatr}`](https://github.com/cmu-delphi/epidatr) which -interfaces directly to Delphi's -[Epidata API](https://cmu-delphi.github.io/delphi-epidata/) +First we pull both `jhu-csse` cases and deaths data from the +[Delphi API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html) using the +[`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package: -```{r epidf, message=FALSE} -library(epipredict) -covid_case_death_rates +```{r case_death, warning = FALSE, eval = TRUE} +cases <- pub_covidcast( + source = "jhu-csse", + signals = "confirmed_7dav_incidence_prop", + time_type = "day", + geo_type = "state", + time_values = epirange(20200601, 20211231), + geo_values = "*" +) |> + select(geo_value, time_value, case_rate = value) + +deaths <- pub_covidcast( + source = "jhu-csse", + signals = "deaths_7dav_incidence_prop", + time_type = "day", + geo_type = "state", + time_values = epirange(20200601, 20211231), + geo_values = "*" +) |> + select(geo_value, time_value, death_rate = value) +cases_deaths <- + full_join(cases, deaths, by = c("time_value", "geo_value")) |> + filter(geo_value %in% used_locations) |> + as_epi_df(as_of = as.Date("2022-01-01")) +``` + +Since visualizing the results on every geography is somewhat overwhelming, +we’ll only train on a subset of locations. + +```{r date, warning = FALSE} +# plotting the data as it was downloaded +cases_deaths |> + autoplot( + case_rate, + death_rate, + .color_by = "none" + ) + + facet_grid( + rows = vars(.response_name), + cols = vars(geo_value), + scale = "free" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` + +As with the typical dataset, we will need to do some cleaning to +make it actually usable; we’ll use some utilities from +[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this. +Specifically we'll trim outliers, especially negative values: + +```{r outlier} +cases_deaths <- + cases_deaths |> + group_by(geo_value) |> + mutate( + outlr_death_rate = detect_outlr_rm( + time_value, death_rate, + detect_negatives = TRUE + ), + outlr_case_rate = detect_outlr_rm( + time_value, case_rate, + detect_negatives = TRUE + ) + ) |> + unnest(cols = starts_with("outlr"), names_sep = "_") |> + ungroup() |> + mutate( + death_rate = outlr_death_rate_replacement, + case_rate = outlr_case_rate_replacement + ) |> + select(geo_value, time_value, case_rate, death_rate) ``` +
+ +After downloading and cleaning the cases and deaths data, we can plot +a subset of the states, marking the desired forecast date: -To create and train a simple auto-regressive forecaster to predict the death rate two weeks into the future using past (lagged) deaths and cases, we could use the following function. +
+ Plot + +```{r plot_locs} +forecast_date_label <- + tibble( + geo_value = rep(used_locations, 2), + .response_name = c(rep("case_rate", 4), rep("death_rate", 4)), + dates = rep(forecast_date - 7 * 2, 2 * length(used_locations)), + heights = c(rep(150, 4), rep(0.75, 4)) + ) +processed_data_plot <- + cases_deaths |> + filter(geo_value %in% used_locations) |> + autoplot( + case_rate, + death_rate, + .color_by = "none" + ) + + facet_grid( + rows = vars(.response_name), + cols = vars(geo_value), + scale = "free" + ) + + geom_vline(aes(xintercept = forecast_date)) + + geom_text( + data = forecast_date_label, + aes(x = dates, label = "forecast\ndate", y = heights), + size = 3, hjust = "right" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` +
+```{r show-processed-data, warning=FALSE, echo=FALSE} +processed_data_plot +``` + +To make a forecast, we will use a simple “canned” auto-regressive forecaster to +predict the death rate four weeks into the future using lagged[^3] deaths and +cases. + +[^3]: lagged by 3 in this context meaning using the value from 3 days ago. ```{r make-forecasts, warning=FALSE} -two_week_ahead <- arx_forecaster( - covid_case_death_rates, +four_week_ahead <- arx_forecaster( + cases_deaths |> filter(time_value <= forecast_date), outcome = "death_rate", predictors = c("case_rate", "death_rate"), args_list = arx_args_list( lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), - ahead = 14 + ahead = 4 * 7, + quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9) ) ) -two_week_ahead +four_week_ahead ``` -In this case, we have used a number of different lags for the case rate, while -only using 3 weekly lags for the death rate (as predictors). The result is both -a fitted model object which could be used any time in the future to create -different forecasts, as well as a set of predicted values (and prediction -intervals) for each location 14 days after the last available time value in the -data. +In our model setup, we are defining as predictors case rate lagged 0-3 +days, one week, and two weeks, and death rate lagged 0-2 weeks. +The result `four_week_ahead` is both a fitted model object which could be used +any time in the future to create different forecasts, and a set of predicted +values (and prediction intervals) for each location 28 days after the forecast +date. + +Plotting the prediction intervals on the true values for our location subset[^2]: -```{r print-model} -two_week_ahead$epi_workflow +[^2]: Alternatively, you could call `autoplot(four_week_ahead, plot_data = + cases_deaths)` to get the full collection of forecasts. This is too busy for + the space we have for plotting here. + +
+ Plot +```{r plotting_forecast, warning=FALSE} +epiworkflow <- four_week_ahead$epi_workflow +restricted_predictions <- + four_week_ahead$predictions |> + rename(time_value = target_date, value = .pred) |> + mutate(.response_name = "death_rate") +forecast_plot <- + four_week_ahead |> + autoplot(plot_data = cases_deaths) + + geom_vline(aes(xintercept = forecast_date)) + + geom_text( + data = forecast_date_label %>% filter(.response_name == "death_rate"), + aes(x = dates, label = "forecast\ndate", y = heights), + size = 3, hjust = "right" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ``` +
-The fitted model here involved preprocessing the data to appropriately generate -lagged predictors, estimating a linear model with `stats::lm()` and then -postprocessing the results to be meaningful for epidemiological tasks. We can -also examine the predictions. +```{r show-single-forecast, warning=FALSE, echo=FALSE} +forecast_plot +``` -```{r show-preds} -two_week_ahead$predictions +And as a tibble of quantile level-value pairs: +```{r pivot_wider} +four_week_ahead$predictions |> + select(-.pred) |> + pivot_quantiles_longer(.pred_distn) |> + select(geo_value, forecast_date, target_date, quantile = .pred_distn_quantile_level, value = .pred_distn_value) ``` -The results above show a distributional forecast produced using data through -the end of 2021 for the 14th of January 2022. A prediction for the death rate -per 100K inhabitants is available for every state (`geo_value`) along with a -90% predictive interval. +The yellow dot gives the median prediction, while the blue intervals give the +25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges[^4]. +For this particular day and these locations, the forecasts are relatively +accurate, with the true data being at least within the 10-90% interval. +A couple of things to note: + +1. `epipredict` methods are primarily direct forecasters; this means we don't need to + predict 1, 2,..., 27 days ahead to then predict 28 days ahead. +2. All of our existing engines are geo-pooled, meaning the training data is + shared across geographies. This has the advantage of increasing the amount of + available training data, with the restriction that the data needs to be on + comparable scales, such as rates. + +## Getting Help +If you encounter a bug or have a feature request, feel free to file an [issue on +our GitHub page](https://github.com/cmu-delphi/epipredict/issues). +For other questions, feel free to reach out to the authors, either via this +[contact form](https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform), +email, or the InsightNet Slack. +[^4]: Note that these are not the same quantiles that we fit when creating + `four_week_ahead`. They are extrapolated from those quantiles using `extrapolate_quantiles()` (which assumes an exponential decay in the tails). diff --git a/README.md b/README.md index 1f24bab2b..24093478a 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,35 @@ -# epipredict +# Epipredict [![R-CMD-check](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml) -**Note:** This package is currently in development and may not work as -expected. Please file bug reports as issues in this repo, and we will do -our best to address them quickly. +[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) is a +framework for building transformation and forecasting pipelines for +epidemiological and other panel time-series datasets. In addition to +tools for building forecasting pipelines, it contains a number of +“canned” forecasters meant to run with little modification as an easy +way to get started forecasting. + +It is designed to work well with +[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/), a utility +for time series handling and geographic processing in an epidemiological +context. Both of the packages are meant to work well with the panel data +provided by [`{epidatr}`](https://cmu-delphi.github.io/epidatr/). +Pre-compiled example datasets are also available in +[`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/). + +If you are looking for detail beyond the package documentation, see our +[forecasting book](https://cmu-delphi.github.io/delphi-tooling-book/). ## Installation -To install (unless you’re making changes to the package, use the stable -version): +Unless you’re planning on contributing to package development, we +suggest using the stable version. To install, run: ``` r # Stable version @@ -25,190 +39,309 @@ pak::pkg_install("cmu-delphi/epipredict@main") pak::pkg_install("cmu-delphi/epipredict@dev") ``` -## Documentation +The documentation for the stable version is at +, while the development version +is at . -You can view documentation for the `main` branch at -. +## Motivating example -## Goals for `epipredict` +
+ +Required packages + -**We hope to provide:** +``` r +library(epipredict) +library(epidatr) +library(epiprocess) +library(dplyr) +library(ggplot2) +``` -1. A set of basic, easy-to-use forecasters that work out of the box. - You should be able to do a reasonably limited amount of - customization on them. For the basic forecasters, we currently - provide: - - Baseline flatline forecaster - - Autoregressive forecaster - - Autoregressive classifier - - CDC FluSight flatline forecaster -2. A framework for creating custom forecasters out of modular - components. There are four types of components: - - Preprocessor: do things to the data before model training - - Trainer: train a model on data, resulting in a fitted model object - - Predictor: make predictions, using a fitted model object - - Postprocessor: do things to the predictions before returning +
-**Target audiences:** +To demonstrate using +[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) for +forecasting, say we want to predict COVID-19 deaths per 100k people for +each of a subset of states -- Basic. Has data, calls forecaster with default arguments. -- Intermediate. Wants to examine changes to the arguments, take - advantage of built in flexibility. -- Advanced. Wants to write their own forecasters. Maybe willing to build - up from some components. +``` r +used_locations <- c("ca", "ma", "ny", "tx") +``` -The Advanced user should find their task to be relatively easy. Examples -of these tasks are illustrated in the [vignettes and -articles](https://cmu-delphi.github.io/epipredict). +on -See also the (in progress) [Forecasting -Book](https://cmu-delphi.github.io/delphi-tooling-book/). +``` r +forecast_date <- as.Date("2021-08-01") +``` + +We will be using a subset of [Johns Hopkins Center for Systems Science +and Engineering deaths +data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). +Below the fold, we pull the dataset from the epidata API and clean it. -## Intermediate example +
+ +Creating the dataset using `{epidatr}` and `{epiprocess}` + -The package comes with some built-in historical data for illustration, -but up-to-date versions of this could be downloaded with the -[`{epidatr}` package](https://cmu-delphi.github.io/epidatr/) and -processed using -[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/).[^1] +This section is intended to demonstrate some of the ubiquitous cleaning +operations needed to be able to forecast. A subset of the dataset +prepared here is also included ready-to-go in +[`{epipredict}`](https://cmu-delphi.github.io/epipredict/) as +`covid_case_death_rates`. + +First we pull both `jhu-csse` cases and deaths data from the [Delphi +API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html) +using the [`{epidatr}`](https://cmu-delphi.github.io/epidatr/) package: ``` r -library(epipredict) -covid_case_death_rates -#> An `epi_df` object, 20,496 x 4 with metadata: -#> * geo_type = state -#> * time_type = day -#> * as_of = 2023-03-10 -#> -#> # A tibble: 20,496 × 4 -#> geo_value time_value case_rate death_rate -#> * -#> 1 ak 2020-12-31 35.9 0.158 -#> 2 al 2020-12-31 65.1 0.438 -#> 3 ar 2020-12-31 66.0 1.27 -#> 4 as 2020-12-31 0 0 -#> 5 az 2020-12-31 76.8 1.10 -#> 6 ca 2020-12-31 95.9 0.755 -#> 7 co 2020-12-31 37.8 0.376 -#> 8 ct 2020-12-31 52.1 0.819 -#> 9 dc 2020-12-31 31.0 0.601 -#> 10 de 2020-12-31 64.3 0.912 -#> # ℹ 20,486 more rows +cases <- pub_covidcast( + source = "jhu-csse", + signals = "confirmed_7dav_incidence_prop", + time_type = "day", + geo_type = "state", + time_values = epirange(20200601, 20211231), + geo_values = "*" +) |> + select(geo_value, time_value, case_rate = value) + +deaths <- pub_covidcast( + source = "jhu-csse", + signals = "deaths_7dav_incidence_prop", + time_type = "day", + geo_type = "state", + time_values = epirange(20200601, 20211231), + geo_values = "*" +) |> + select(geo_value, time_value, death_rate = value) +cases_deaths <- + full_join(cases, deaths, by = c("time_value", "geo_value")) |> + filter(geo_value %in% used_locations) |> + as_epi_df(as_of = as.Date("2022-01-01")) +``` + +Since visualizing the results on every geography is somewhat +overwhelming, we’ll only train on a subset of locations. + +``` r +# plotting the data as it was downloaded +cases_deaths |> + autoplot( + case_rate, + death_rate, + .color_by = "none" + ) + + facet_grid( + rows = vars(.response_name), + cols = vars(geo_value), + scale = "free" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` + + + +As with the typical dataset, we will need to do some cleaning to make it +actually usable; we’ll use some utilities from +[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) for this. +Specifically we’ll trim outliers, especially negative values: + +``` r +cases_deaths <- + cases_deaths |> + group_by(geo_value) |> + mutate( + outlr_death_rate = detect_outlr_rm( + time_value, death_rate, + detect_negatives = TRUE + ), + outlr_case_rate = detect_outlr_rm( + time_value, case_rate, + detect_negatives = TRUE + ) + ) |> + unnest(cols = starts_with("outlr"), names_sep = "_") |> + ungroup() |> + mutate( + death_rate = outlr_death_rate_replacement, + case_rate = outlr_case_rate_replacement + ) |> + select(geo_value, time_value, case_rate, death_rate) +``` + +
+ +After downloading and cleaning the cases and deaths data, we can plot a +subset of the states, marking the desired forecast date: + +
+ +Plot + + +``` r +forecast_date_label <- + tibble( + geo_value = rep(used_locations, 2), + .response_name = c(rep("case_rate", 4), rep("death_rate", 4)), + dates = rep(forecast_date - 7 * 2, 2 * length(used_locations)), + heights = c(rep(150, 4), rep(0.75, 4)) + ) +processed_data_plot <- + cases_deaths |> + filter(geo_value %in% used_locations) |> + autoplot( + case_rate, + death_rate, + .color_by = "none" + ) + + facet_grid( + rows = vars(.response_name), + cols = vars(geo_value), + scale = "free" + ) + + geom_vline(aes(xintercept = forecast_date)) + + geom_text( + data = forecast_date_label, + aes(x = dates, label = "forecast\ndate", y = heights), + size = 3, hjust = "right" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ``` -To create and train a simple auto-regressive forecaster to predict the -death rate two weeks into the future using past (lagged) deaths and -cases, we could use the following function. +
+ + + +To make a forecast, we will use a simple “canned” auto-regressive +forecaster to predict the death rate four weeks into the future using +lagged[^1] deaths and cases. ``` r -two_week_ahead <- arx_forecaster( - covid_case_death_rates, +four_week_ahead <- arx_forecaster( + cases_deaths |> filter(time_value <= forecast_date), outcome = "death_rate", predictors = c("case_rate", "death_rate"), args_list = arx_args_list( lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), - ahead = 14 + ahead = 4 * 7, + quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9) ) ) -two_week_ahead -#> ══ A basic forecaster of type ARX Forecaster ═══════════════════════════════ +four_week_ahead +#> ══ A basic forecaster of type ARX Forecaster ════════════════════════════════ #> -#> This forecaster was fit on 2025-02-11 12:32:56. +#> This forecaster was fit on 2025-03-03 14:43:07. #> #> Training data was an with: #> • Geography: state, #> • Time type: day, -#> • Using data up-to-date as of: 2023-03-10. -#> • With the last data available on 2021-12-31 +#> • Using data up-to-date as of: 2022-01-01. +#> • With the last data available on 2021-08-01 #> -#> ── Predictions ───────────────────────────────────────────────────────────── +#> ── Predictions ────────────────────────────────────────────────────────────── #> -#> A total of 56 predictions are available for -#> • 56 unique geographic regions, -#> • At forecast date: 2021-12-31, -#> • For target date: 2022-01-14, +#> A total of 4 predictions are available for +#> • 4 unique geographic regions, +#> • At forecast date: 2021-08-01, +#> • For target date: 2021-08-29, #> ``` -In this case, we have used a number of different lags for the case rate, -while only using 3 weekly lags for the death rate (as predictors). The -result is both a fitted model object which could be used any time in the -future to create different forecasts, as well as a set of predicted -values (and prediction intervals) for each location 14 days after the -last available time value in the data. +In our model setup, we are defining as predictors case rate lagged 0-3 +days, one week, and two weeks, and death rate lagged 0-2 weeks. The +result `four_week_ahead` is both a fitted model object which could be +used any time in the future to create different forecasts, and a set of +predicted values (and prediction intervals) for each location 28 days +after the forecast date. + +Plotting the prediction intervals on the true values for our location +subset[^2]: + +
+ +Plot + ``` r -two_week_ahead$epi_workflow -#> -#> ══ Epi Workflow [trained] ══════════════════════════════════════════════════ -#> Preprocessor: Recipe -#> Model: linear_reg() -#> Postprocessor: Frosting -#> -#> ── Preprocessor ──────────────────────────────────────────────────────────── -#> -#> 6 Recipe steps. -#> 1. step_epi_lag() -#> 2. step_epi_lag() -#> 3. step_epi_ahead() -#> 4. step_naomit() -#> 5. step_naomit() -#> 6. step_training_window() -#> -#> ── Model ─────────────────────────────────────────────────────────────────── -#> -#> Call: -#> stats::lm(formula = ..y ~ ., data = data) -#> -#> Coefficients: -#> (Intercept) lag_0_case_rate lag_1_case_rate lag_2_case_rate -#> -0.0071026 0.0040340 0.0007863 0.0003699 -#> lag_3_case_rate lag_7_case_rate lag_14_case_rate lag_0_death_rate -#> 0.0012887 0.0011980 0.0002527 0.1348573 -#> lag_7_death_rate lag_14_death_rate -#> 0.1479274 0.1067074 -#> -#> ── Postprocessor ─────────────────────────────────────────────────────────── -#> -#> 5 Frosting layers. -#> 1. layer_predict() -#> 2. layer_residual_quantiles() -#> 3. layer_add_forecast_date() -#> 4. layer_add_target_date() -#> 5. layer_threshold() -#> +epiworkflow <- four_week_ahead$epi_workflow +restricted_predictions <- + four_week_ahead$predictions |> + rename(time_value = target_date, value = .pred) |> + mutate(.response_name = "death_rate") +forecast_plot <- + four_week_ahead |> + autoplot(plot_data = cases_deaths) + + geom_vline(aes(xintercept = forecast_date)) + + geom_text( + data = forecast_date_label %>% filter(.response_name == "death_rate"), + aes(x = dates, label = "forecast\ndate", y = heights), + size = 3, hjust = "right" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ``` -The fitted model here involved preprocessing the data to appropriately -generate lagged predictors, estimating a linear model with `stats::lm()` -and then postprocessing the results to be meaningful for epidemiological -tasks. We can also examine the predictions. +
+ + + +And as a tibble of quantile level-value pairs: ``` r -two_week_ahead$predictions -#> # A tibble: 56 × 5 -#> geo_value .pred .pred_distn forecast_date target_date -#> -#> 1 ak 0.450 quantiles(0.45)[7] 2021-12-31 2022-01-14 -#> 2 al 0.602 quantiles(0.6)[7] 2021-12-31 2022-01-14 -#> 3 ar 0.694 quantiles(0.69)[7] 2021-12-31 2022-01-14 -#> 4 as 0 quantiles(0)[7] 2021-12-31 2022-01-14 -#> 5 az 0.699 quantiles(0.7)[7] 2021-12-31 2022-01-14 -#> 6 ca 0.592 quantiles(0.59)[7] 2021-12-31 2022-01-14 -#> 7 co 1.47 quantiles(1.47)[7] 2021-12-31 2022-01-14 -#> 8 ct 1.08 quantiles(1.08)[7] 2021-12-31 2022-01-14 -#> 9 dc 2.14 quantiles(2.14)[7] 2021-12-31 2022-01-14 -#> 10 de 1.13 quantiles(1.13)[7] 2021-12-31 2022-01-14 -#> # ℹ 46 more rows +four_week_ahead$predictions |> + select(-.pred) |> + pivot_quantiles_longer(.pred_distn) |> + select(geo_value, forecast_date, target_date, quantile = .pred_distn_quantile_level, value = .pred_distn_value) +#> # A tibble: 20 × 5 +#> geo_value forecast_date target_date quantile value +#> +#> 1 ca 2021-08-01 2021-08-29 0.1 0.198 +#> 2 ca 2021-08-01 2021-08-29 0.25 0.285 +#> 3 ca 2021-08-01 2021-08-29 0.5 0.345 +#> 4 ca 2021-08-01 2021-08-29 0.75 0.405 +#> 5 ca 2021-08-01 2021-08-29 0.9 0.491 +#> 6 ma 2021-08-01 2021-08-29 0.1 0.0277 +#> # ℹ 14 more rows ``` -The results above show a distributional forecast produced using data -through the end of 2021 for the 14th of January 2022. A prediction for -the death rate per 100K inhabitants is available for every state -(`geo_value`) along with a 90% predictive interval. +The yellow dot gives the median prediction, while the blue intervals +give the 25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges[^3]. +For this particular day and these locations, the forecasts are +relatively accurate, with the true data being at least within the 10-90% +interval. A couple of things to note: + +1. `epipredict` methods are primarily direct forecasters; this means we + don’t need to predict 1, 2,…, 27 days ahead to then predict 28 days + ahead. +2. All of our existing engines are geo-pooled, meaning the training + data is shared across geographies. This has the advantage of + increasing the amount of available training data, with the + restriction that the data needs to be on comparable scales, such as + rates. + +## Getting Help + +If you encounter a bug or have a feature request, feel free to file an +[issue on our GitHub +page](https://github.com/cmu-delphi/epipredict/issues). For other +questions, feel free to reach out to the authors, either via this +[contact +form](https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform), +email, or the InsightNet Slack. + +[^1]: lagged by 3 in this context meaning using the value from 3 days + ago. + +[^2]: Alternatively, you could call + `autoplot(four_week_ahead, plot_data = cases_deaths)` to get the + full collection of forecasts. This is too busy for the space we have + for plotting here. -[^1]: Other epidemiological signals for non-Covid related illnesses are - also available with - [`{epidatr}`](https://github.com/cmu-delphi/epidatr) which - interfaces directly to Delphi’s [Epidata - API](https://cmu-delphi.github.io/delphi-epidata/) +[^3]: Note that these are not the same quantiles that we fit when + creating `four_week_ahead`. They are extrapolated from those + quantiles using `extrapolate_quantiles()` (which assumes an + exponential decay in the tails). diff --git a/_pkgdown.yml b/_pkgdown.yml index 814bf6aa4..32da5cae7 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -4,21 +4,27 @@ development: mode: devel template: + light-switch: true package: delphidocs -articles: - - title: Get started - navbar: ~ - contents: - - epipredict - - preprocessing-and-models - - backtesting - - arx-classifier - - update - - title: Advanced methods - contents: - - articles/smooth-qr - - panel-data +navbar: + structure: + left: [intro, workflows, backtesting, reference, articles, news] + right: [search, github, lightswitch] + components: + workflows: + text: Epiworkflows + href: articles/custom_epiworkflows.html + backtesting: + text: Backtesting + href: articles/backtesting.html + articles: + text: Articles + menu: + - text: Using the add/update/remove/adjust functions + href: articles/update.html + - text: Using epipredict on non-epidemic panel data + href: articles/panel-data.html home: links: @@ -42,62 +48,85 @@ reference: - contains("forecaster") - contains("classifier") - - title: Forecaster modifications + - subtitle: Forecaster modifications desc: Constructors to modify forecaster arguments and utilities to produce `epi_workflow` objects contents: - contains("args_list") - contains("_epi_workflow") - - title: Helper functions for Hub submission + ########################## + - title: Steps and Layers + + - subtitle: Epi recipe preprocessing steps + desc: > + Note that any `{recipes}` + [`step`](https://recipes.tidymodels.org/reference/index.html) is also valid contents: - - flusight_hub_formatter + - starts_with("step_") - - title: Parsnip engines - desc: Prediction methods not available elsewhere + - subtitle: Frosting post-processing layers contents: - - quantile_reg - - smooth_quantile_reg - - grf_quantiles + - starts_with("layer_") - - title: Custom panel data forecasting workflows + ########################## + - title: Epiworkflows + - subtitle: Basic forecasting workflow functions contents: - epi_recipe - epi_workflow - add_epi_recipe - - adjust_epi_recipe - - Add_model - - predict.epi_workflow - fit.epi_workflow - - augment.epi_workflow - - forecast.epi_workflow - - - title: Epi recipe preprocessing steps - contents: - - starts_with("step_") - - contains("bake") - - title: Epi recipe verification checks - contents: - - check_enough_data - - - title: Forecast postprocessing - desc: Create a series of postprocessing operations + - subtitle: Forecast post-processing workflow functions + desc: Create and apply series of post-processing operations contents: - frosting - ends_with("_frosting") - - get_test_data - tidy.frosting + - contains("slather") - - title: Frosting layers + - subtitle: Prediction + desc: Methods for prediction and modifying predictions contents: - - contains("layer") - - contains("slather") + - predict.epi_workflow + - augment.epi_workflow + - get_test_data + - forecast.epi_workflow + - subtitle: Modifying forecasting epiworkflows + desc: > + Modify or inspect an existing recipe, workflow, or frosting. See also [the + article on the topic](../articles/update.html) + contents: + - adjust_epi_recipe + - Add_model + - add_layer + - layer-processors + - update.layer + + ########################## - title: Automatic forecast visualization contents: - autoplot.epi_workflow - autoplot.canned_epipred - - title: Utilities for quantile distribution processing + ########################## + - title: Parsnip engines + desc: > + Prediction methods not available in the [general parsnip + repository](https://www.tidymodels.org/find/parsnip/) + contents: + - quantile_reg + - smooth_quantile_reg + - grf_quantiles + + ########################## + - title: Utilities + contents: + - flusight_hub_formatter + - clean_f_name + - check_enough_data + + - subtitle: Utilities for quantile distribution processing contents: - dist_quantiles - contains("quantile_pred") @@ -105,7 +134,3 @@ reference: - nested_quantiles - weighted_interval_score - starts_with("pivot_quantiles") - - - title: Other utilities - contents: - - clean_f_name diff --git a/inst/pkgdown-watch.R b/inst/pkgdown-watch.R new file mode 100644 index 000000000..bd23406a3 --- /dev/null +++ b/inst/pkgdown-watch.R @@ -0,0 +1,65 @@ +# Run with: Rscript pkgdown-watch.R +# +# Modifying this: https://gist.github.com/gadenbuie/d22e149e65591b91419e41ea5b2e0621 +# - Removed docopts cli interface and various configs/features I didn't need. +# - Sped up reference building by not running examples. +# +# Note that the `pattern` regex is case sensitive, so make sure your Rmd files +# end in `.Rmd` and not `.rmd`. +# +# Also I had issues with `pkgdown::build_reference()` not working, so I just run +# it manually when I need to. + +rlang::check_installed(c("pkgdown", "servr", "devtools", "here", "cli", "fs")) +library(pkgdown) +pkg <- pkgdown::as_pkgdown(here::here()) +devtools::document(here::here()) +devtools::build_readme() +pkgdown::build_articles(pkg) +pkgdown::build_site(pkg, lazy = FALSE, examples = FALSE, devel = TRUE, preview = FALSE) + +servr::httw( + dir = here::here("docs"), + watch = here::here(), + pattern = "[.](Rm?d|y?ml|s[ac]ss|css|js)$", + handler = function(files) { + devtools::load_all() + + files_rel <- fs::path_rel(files, start = getwd()) + cli::cli_inform("{cli::col_yellow('Updated')} {.val {files_rel}}") + + articles <- grep("vignettes.+Rmd$", files, value = TRUE) + + if (length(articles) == 1) { + name <- fs::path_ext_remove(fs::path_rel(articles, fs::path(pkg$src_path, "vignettes"))) + pkgdown::build_article(name, pkg) + } else if (length(articles) > 1) { + pkgdown::build_articles(pkg, preview = FALSE) + } + + refs <- grep("man.+R(m?d)?$", files, value = TRUE) + if (length(refs)) { + # Doesn't work for me, so I run it manually. + # pkgdown::build_reference(pkg) # nolint: commented_code_linter + } + + pkgdown <- grep("pkgdown", files, value = TRUE) + if (length(pkgdown) && !pkgdown %in% c(articles, refs)) { + pkgdown::init_site(pkg) + } + + pkgdown_index <- grep("index[.]Rmd$", files_rel, value = TRUE) + if (length(pkgdown_index)) { + devtools::build_rmd(pkgdown_index) + pkgdown::build_home(pkg) + } + + readme <- grep("README[.]rmd$", files, value = TRUE, ignore.case = TRUE) + if (length(readme)) { + devtools::build_readme() + pkgdown::build_site(pkg, lazy = TRUE, examples = FALSE, devel = TRUE, preview = FALSE) + } + + cli::cli_alert("Site rebuild done!") + } +) diff --git a/man/add_epi_recipe.Rd b/man/add_epi_recipe.Rd index 0135cfd6f..f5cb32a73 100644 --- a/man/add_epi_recipe.Rd +++ b/man/add_epi_recipe.Rd @@ -4,7 +4,8 @@ \alias{add_epi_recipe} \alias{remove_epi_recipe} \alias{update_epi_recipe} -\title{Add an \code{epi_recipe} to a workflow} +\title{Given an \code{epi_recipe}, add it to, remove it from, or update it in an +\code{epi_workflow}} \usage{ add_epi_recipe(x, recipe, ..., blueprint = default_epi_recipe_blueprint()) @@ -30,12 +31,18 @@ might be done automatically by the underlying model.} \code{x}, updated with a new recipe preprocessor. } \description{ -Add an \code{epi_recipe} to a workflow +\itemize{ +\item \code{add_recipe()} specifies the terms of the model and any preprocessing that +is required through the usage of a recipe. +\item \code{remove_recipe()} removes the recipe as well as any downstream objects +\item \code{update_recipe()} first removes the recipe, then replaces the previous +recipe with the new one. +} } \details{ -\code{add_epi_recipe} has the same behaviour as -\code{\link[workflows:add_recipe]{workflows::add_recipe()}} but sets a different -default blueprint to automatically handle \link[epiprocess:epi_df]{epiprocess::epi_df} data. +\code{add_epi_recipe()} has the same behaviour as \code{\link[workflows:add_recipe]{workflows::add_recipe()}} but +sets a different default blueprint to automatically handle +\code{epiprocess::epi_df()} data. } \examples{ jhu <- covid_case_death_rates \%>\% @@ -65,11 +72,4 @@ workflow } \seealso{ \code{\link[workflows:add_recipe]{workflows::add_recipe()}} -\itemize{ -\item \code{add_recipe()} specifies the terms of the model and any preprocessing that -is required through the usage of a recipe. -\item \code{remove_recipe()} removes the recipe as well as any downstream objects -\item \code{update_recipe()} first removes the recipe, then replaces the previous -recipe with the new one. -} } diff --git a/man/add_frosting.Rd b/man/add_frosting.Rd index 825747487..6a38b359e 100644 --- a/man/add_frosting.Rd +++ b/man/add_frosting.Rd @@ -4,7 +4,8 @@ \alias{add_frosting} \alias{remove_frosting} \alias{update_frosting} -\title{Add frosting to a workflow} +\title{Given a \code{frosting()}, add it to, remove it from, or update it in an +\code{epi_workflow}} \usage{ add_frosting(x, frosting, ...) @@ -23,7 +24,8 @@ update_frosting(x, frosting, ...) \code{x}, updated with a new frosting postprocessor } \description{ -Add frosting to a workflow +Given a \code{frosting()}, add it to, remove it from, or update it in an +\code{epi_workflow} } \examples{ jhu <- covid_case_death_rates \%>\% diff --git a/man/apply_frosting.Rd b/man/apply_frosting.Rd index ef18796cc..ece3261e8 100644 --- a/man/apply_frosting.Rd +++ b/man/apply_frosting.Rd @@ -5,7 +5,7 @@ \alias{apply_frosting.default} \alias{apply_frosting.epi_recipe} \alias{apply_frosting.epi_workflow} -\title{Apply postprocessing to a fitted workflow} +\title{Apply post-processing to a fitted workflow} \usage{ apply_frosting(workflow, ...) @@ -39,6 +39,6 @@ and predict on} \code{\link[=slather]{slather()}} for supported layers} } \description{ -This function is intended for internal use. It implements postprocessing +This function is intended for internal use. It implements post-processing inside of the \code{predict()} method for a fitted workflow. } diff --git a/man/arx_args_list.Rd b/man/arx_args_list.Rd index 650c4a614..ca00bcc5c 100644 --- a/man/arx_args_list.Rd +++ b/man/arx_args_list.Rd @@ -66,10 +66,13 @@ latency is large. If this is \code{FALSE}, that warning is turned off.} prediction intervals. These are created by computing the quantiles of training residuals. A \code{NULL} value will result in point forecasts only.} -\item{symmetrize}{Logical. The default \code{TRUE} calculates -symmetric prediction intervals. This argument only applies when -residual quantiles are used. It is not applicable with -\code{trainer = quantile_reg()}, for example.} +\item{symmetrize}{Logical. The default \code{TRUE} calculates symmetric prediction +intervals. This argument only applies when residual quantiles are used. It +is not applicable with \code{trainer = quantile_reg()}, for example. This is +achieved by including both the residuals and their negation. Typically, one +would only want non-symmetric quantiles when increasing trajectories are +quite different from decreasing ones, such as a strictly postive variable +near zero.} \item{nonneg}{Logical. The default \code{TRUE} enforces nonnegative predictions by hard-thresholding at 0.} diff --git a/man/arx_class_args_list.Rd b/man/arx_class_args_list.Rd index 40bb48ca9..7359c8764 100644 --- a/man/arx_class_args_list.Rd +++ b/man/arx_class_args_list.Rd @@ -67,7 +67,7 @@ latency is large. If this is \code{FALSE}, that warning is turned off.} be created using growth rates (as the predictors are) or lagged differences. The second case is closer to the requirements for the \href{https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md}{2022-23 CDC Flusight Hospitalization Experimental Target}. -See the Classification Vignette for details of how to create a reasonable +See the \href{https://cmu-delphi.github.io/delphi-tooling-book/arx-classifier.html}{Classification chapter from the forecasting book} Vignette for details of how to create a reasonable baseline for this case. Selecting \code{"growth_rate"} (the default) uses \code{\link[epiprocess:growth_rate]{epiprocess::growth_rate()}} to create the outcome using some of the additional arguments below. Choosing \code{"lag_difference"} instead simply diff --git a/man/arx_class_epi_workflow.Rd b/man/arx_class_epi_workflow.Rd index 9b048463c..be4ecedef 100644 --- a/man/arx_class_epi_workflow.Rd +++ b/man/arx_class_epi_workflow.Rd @@ -23,10 +23,10 @@ If discrete classes are already in the \code{epi_df}, it is recommended to code up a classifier from scratch using \code{\link[=epi_recipe]{epi_recipe()}}.} \item{predictors}{A character vector giving column(s) of predictor variables. -This defaults to the \code{outcome}. However, if manually specified, only those variables -specifically mentioned will be used. (The \code{outcome} will not be added.) -By default, equals the outcome. If manually specified, does not add the -outcome variable, so make sure to specify it.} +This defaults to the \code{outcome}. However, if manually specified, only those +variables specifically mentioned will be used. (The \code{outcome} will not be +added.) By default, equals the outcome. If manually specified, does not +add the outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "classification"}. Typical values are diff --git a/man/arx_classifier.Rd b/man/arx_classifier.Rd index d78700df3..61cb8db8c 100644 --- a/man/arx_classifier.Rd +++ b/man/arx_classifier.Rd @@ -23,10 +23,10 @@ If discrete classes are already in the \code{epi_df}, it is recommended to code up a classifier from scratch using \code{\link[=epi_recipe]{epi_recipe()}}.} \item{predictors}{A character vector giving column(s) of predictor variables. -This defaults to the \code{outcome}. However, if manually specified, only those variables -specifically mentioned will be used. (The \code{outcome} will not be added.) -By default, equals the outcome. If manually specified, does not add the -outcome variable, so make sure to specify it.} +This defaults to the \code{outcome}. However, if manually specified, only those +variables specifically mentioned will be used. (The \code{outcome} will not be +added.) By default, equals the outcome. If manually specified, does not +add the outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "classification"}. Typical values are @@ -43,9 +43,124 @@ and (2) \code{epi_workflow}, a list that encapsulates the entire estimation workflow } \description{ -This is an autoregressive classification model for -\link[epiprocess:epi_df]{epiprocess::epi_df} data. It does "direct" forecasting, meaning -that it estimates a class at a particular target horizon. +This is an autoregressive classification model for continuous data. It does +"direct" forecasting, meaning that it estimates a class at a particular +target horizon. +} +\details{ +The \code{arx_classifier()} is an autoregressive classification model for \code{epi_df} +data that is used to predict a discrete class for each case under +consideration. It is a direct forecaster in that it estimates the classes +at a specific horizon or ahead value. + +To get a sense of how the \code{arx_classifier()} works, let's consider a simple +example with minimal inputs. For this, we will use the built-in +\code{covid_case_death_rates} that contains confirmed COVID-19 cases and deaths +from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, +we'll take a subset of data for five states over June 4, 2021 to December +31, 2021. Our objective is to predict whether the case rates are increasing +when considering the 0, 7 and 14 day case rates: + +\if{html}{\out{
}}\preformatted{jhu <- covid_case_death_rates \%>\% + filter( + time_value >= "2021-06-04", + time_value <= "2021-12-31", + geo_value \%in\% c("ca", "fl", "tx", "ny", "nj") + ) + +out <- arx_classifier(jhu, outcome = "case_rate", predictors = "case_rate") + +out$predictions +#> # A tibble: 5 x 4 +#> geo_value .pred_class forecast_date target_date +#> +#> 1 ca (-Inf,0.25] 2021-12-31 2022-01-07 +#> 2 fl (-Inf,0.25] 2021-12-31 2022-01-07 +#> 3 nj (-Inf,0.25] 2021-12-31 2022-01-07 +#> 4 ny (-Inf,0.25] 2021-12-31 2022-01-07 +#> 5 tx (-Inf,0.25] 2021-12-31 2022-01-07 +}\if{html}{\out{
}} + +The key takeaway from the predictions is that there are two prediction +classes: \verb{(-Inf, 0.25]} and \verb{(0.25, Inf)}. This is because for our goal of +classification the classes must be discrete. The discretization of the +real-valued outcome is controlled by the \code{breaks} argument, which defaults +to \code{0.25}. Such breaks will be automatically extended to cover the entire +real line. For example, the default break of \code{0.25} is silently extended to +\code{breaks = c(-Inf, .25, Inf)} and, therefore, results in two classes: +\verb{[-Inf, 0.25]} and \verb{(0.25, Inf)}. These two classes are used to discretize +the outcome. The conversion of the outcome to such classes is handled +internally. So if discrete classes already exist for the outcome in the +\code{epi_df}, then we recommend to code a classifier from scratch using the +\code{epi_workflow} framework for more control. + +The \code{trainer} is a \code{parsnip} model describing the type of estimation such +that \code{mode = "classification"} is enforced. The two typical trainers that +are used are \code{parsnip::logistic_reg()} for two classes or +\code{parsnip::multinom_reg()} for more than two classes. + +\if{html}{\out{
}}\preformatted{workflows::extract_spec_parsnip(out$epi_workflow) +#> Logistic Regression Model Specification (classification) +#> +#> Computational engine: glm +}\if{html}{\out{
}} + +From the parsnip model specification, we can see that the trainer used is +logistic regression, which is expected for our binary outcome. More +complicated trainers like \code{parsnip::naive_Bayes()} or +\code{parsnip::rand_forest()} may also be used (however, we will stick to the +basics in this gentle introduction to the classifier). + +If you use the default trainer of logistic regression for binary +classification and you decide against using the default break of 0.25, then +you should only input one break so that there are two classification bins +to properly dichotomize the outcome. For example, let's set a break of 0.5 +instead of relying on the default of 0.25. We can do this by passing 0.5 to +the \code{breaks} argument in \code{arx_class_args_list()} as follows: + +\if{html}{\out{
}}\preformatted{out_break_0.5 <- arx_classifier( + jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.5 + ) +) +#> Warning: glm.fit: algorithm did not converge +#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred + +out_break_0.5$predictions +#> # A tibble: 5 x 4 +#> geo_value .pred_class forecast_date target_date +#> +#> 1 ca (-Inf,0.5] 2021-12-31 2022-01-07 +#> 2 fl (-Inf,0.5] 2021-12-31 2022-01-07 +#> 3 nj (-Inf,0.5] 2021-12-31 2022-01-07 +#> 4 ny (-Inf,0.5] 2021-12-31 2022-01-07 +#> 5 tx (-Inf,0.5] 2021-12-31 2022-01-07 +}\if{html}{\out{
}} + +Indeed, we can observe that the two \code{.pred_class} are now (-Inf, 0.5] and +(0.5, Inf). See \code{help(arx_class_args_list)} for other available +modifications. + +Additional arguments that may be supplied to \code{arx_class_args_list()} include +the expected \code{lags} and \code{ahead} arguments for an autoregressive-type model. +These have default values of 0, 7, and 14 days for the lags of the +predictors and 7 days ahead of the forecast date for predicting the +outcome. There is also \code{n_training} to indicate the upper bound for the +number of training rows per key. If you would like some practice with using +this, then remove the filtering command to obtain data within "2021-06-04" +and "2021-12-31" and instead set \code{n_training} to be the number of days +between these two dates, inclusive of the end points. The end results +should be the same. In addition to \code{n_training}, there are \code{forecast_date} +and \code{target_date} to specify the date that the forecast is created and +intended, respectively. We will not dwell on such arguments here as they +are not unique to this classifier or absolutely essential to understanding +how it operates. The remaining arguments will be discussed organically, as +they are needed to serve our purposes. For information on any remaining +arguments that are not discussed here, please see the function +documentation for a complete list and their definitions. } \examples{ tiny_geos <- c("as", "mp", "vi", "gu", "pr") diff --git a/man/arx_fcast_epi_workflow.Rd b/man/arx_fcast_epi_workflow.Rd index 87d503bc6..4aa97ff45 100644 --- a/man/arx_fcast_epi_workflow.Rd +++ b/man/arx_fcast_epi_workflow.Rd @@ -15,21 +15,20 @@ arx_fcast_epi_workflow( \arguments{ \item{epi_data}{An \code{epi_df} object} -\item{outcome}{A character (scalar) specifying the outcome (in the -\code{epi_df}).} +\item{outcome}{A character (scalar) specifying the outcome (in the \code{epi_df}).} \item{predictors}{A character vector giving column(s) of predictor variables. -This defaults to the \code{outcome}. However, if manually specified, only those variables -specifically mentioned will be used. (The \code{outcome} will not be added.) -By default, equals the outcome. If manually specified, does not add the -outcome variable, so make sure to specify it.} +This defaults to the \code{outcome}. However, if manually specified, only those +variables specifically mentioned will be used. (The \code{outcome} will not be +added.) By default, equals the outcome. If manually specified, does not +add the outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "regression"}. May be \code{NULL} if you'd like to decide later.} -\item{args_list}{A list of customization arguments to determine -the type of forecasting model. See \code{\link[=arx_args_list]{arx_args_list()}}.} +\item{args_list}{A list of customization arguments to determine the type of +forecasting model. See \code{\link[=arx_args_list]{arx_args_list()}}.} } \value{ An unfitted \code{epi_workflow}. diff --git a/man/arx_forecaster.Rd b/man/arx_forecaster.Rd index ff820b8c8..85c25c115 100644 --- a/man/arx_forecaster.Rd +++ b/man/arx_forecaster.Rd @@ -15,45 +15,50 @@ arx_forecaster( \arguments{ \item{epi_data}{An \code{epi_df} object} -\item{outcome}{A character (scalar) specifying the outcome (in the -\code{epi_df}).} +\item{outcome}{A character (scalar) specifying the outcome (in the \code{epi_df}).} \item{predictors}{A character vector giving column(s) of predictor variables. -This defaults to the \code{outcome}. However, if manually specified, only those variables -specifically mentioned will be used. (The \code{outcome} will not be added.) -By default, equals the outcome. If manually specified, does not add the -outcome variable, so make sure to specify it.} +This defaults to the \code{outcome}. However, if manually specified, only those +variables specifically mentioned will be used. (The \code{outcome} will not be +added.) By default, equals the outcome. If manually specified, does not +add the outcome variable, so make sure to specify it.} -\item{trainer}{A \code{{parsnip}} model describing the type of estimation. -For now, we enforce \code{mode = "regression"}.} +\item{trainer}{A \code{{parsnip}} model describing the type of estimation. For +now, we enforce \code{mode = "regression"}.} -\item{args_list}{A list of customization arguments to determine -the type of forecasting model. See \code{\link[=arx_args_list]{arx_args_list()}}.} +\item{args_list}{A list of customization arguments to determine the type of +forecasting model. See \code{\link[=arx_args_list]{arx_args_list()}}.} } \value{ -A list with (1) \code{predictions} an \code{epi_df} of predicted values -and (2) \code{epi_workflow}, a list that encapsulates the entire estimation -workflow +An \code{arx_fcast}, with the fields \code{predictions} and \code{epi_workflow}. +\code{predictions} is an \code{epi_df} of predicted values while \code{epi_workflow()} is +the fit workflow used to make those predictions } \description{ This is an autoregressive forecasting model for -\link[epiprocess:epi_df]{epiprocess::epi_df} data. It does "direct" forecasting, meaning -that it estimates a model for a particular target horizon. +\link[epiprocess:epi_df]{epiprocess::epi_df} data. It does "direct" +forecasting, meaning that it estimates a model for a particular target +horizon of \code{outcome} based on the lags of the \code{predictors}. See the \href{../articles/epipredict.html}{Get started vignette} for some worked examples and +\href{../articles/custom_epiworkflows.html}{Custom epi_workflows vignette} for a +recreation using a custom \code{epi_workflow()}. } \examples{ jhu <- covid_case_death_rates \%>\% dplyr::filter(time_value >= as.Date("2021-12-01")) out <- arx_forecaster( - jhu, "death_rate", + jhu, + "death_rate", c("case_rate", "death_rate") ) -out <- arx_forecaster(jhu, "death_rate", +out <- arx_forecaster(jhu, + "death_rate", c("case_rate", "death_rate"), trainer = quantile_reg(), args_list = arx_args_list(quantile_levels = 1:9 / 10) ) +out } \seealso{ \code{\link[=arx_fcast_epi_workflow]{arx_fcast_epi_workflow()}}, \code{\link[=arx_args_list]{arx_args_list()}} diff --git a/man/augment.epi_workflow.Rd b/man/augment.epi_workflow.Rd index 8007a4d30..cbc807ebb 100644 --- a/man/augment.epi_workflow.Rd +++ b/man/augment.epi_workflow.Rd @@ -17,5 +17,8 @@ new_data with additional columns containing the predicted values } \description{ -Augment data with predictions +\code{augment()}, unlike \code{forecast()}, has the goal of modifying the training +data, rather than just producing new forecasts. It does a prediction on +\code{new_data}, which will produce a prediction for most \code{time_values}, and then +adds \code{.pred} as a column to \code{new_data} and returns the resulting join. } diff --git a/man/autoplot-epipred.Rd b/man/autoplot-epipred.Rd index f4657967f..e92864664 100644 --- a/man/autoplot-epipred.Rd +++ b/man/autoplot-epipred.Rd @@ -11,6 +11,7 @@ \method{autoplot}{epi_workflow}( object, predictions = NULL, + plot_data = NULL, .levels = c(0.5, 0.8, 0.9), ..., .color_by = c("all_keys", "geo_value", "other_keys", ".response", "all", "none"), @@ -23,6 +24,7 @@ \method{autoplot}{canned_epipred}( object, + plot_data = NULL, ..., .color_by = c("all_keys", "geo_value", "other_keys", ".response", "all", "none"), .facet_by = c(".response", "other_keys", "all_keys", "geo_value", "all", "none"), @@ -42,6 +44,9 @@ \item{predictions}{A data frame with predictions. If \code{NULL}, only the original data is shown.} +\item{plot_data}{An epi_df of the data to plot against. This is for the case +where you have the actual results to compare the forecast against.} + \item{.levels}{A numeric vector of levels to plot for any prediction bands. More than 3 levels begins to be difficult to see.} diff --git a/man/cdc_baseline_args_list.Rd b/man/cdc_baseline_args_list.Rd index 4a8c13113..2c557e5dc 100644 --- a/man/cdc_baseline_args_list.Rd +++ b/man/cdc_baseline_args_list.Rd @@ -51,10 +51,13 @@ These samples are spaced evenly on the (0, 1) scale, F_X(x) resulting in linear interpolation on the X scale. This is achieved with \code{\link[stats:quantile]{stats::quantile()}} Type 7 (the default for that function).} -\item{symmetrize}{Logical. The default \code{TRUE} calculates -symmetric prediction intervals. This argument only applies when -residual quantiles are used. It is not applicable with -\code{trainer = quantile_reg()}, for example.} +\item{symmetrize}{Logical. The default \code{TRUE} calculates symmetric prediction +intervals. This argument only applies when residual quantiles are used. It +is not applicable with \code{trainer = quantile_reg()}, for example. This is +achieved by including both the residuals and their negation. Typically, one +would only want non-symmetric quantiles when increasing trajectories are +quite different from decreasing ones, such as a strictly postive variable +near zero.} \item{nonneg}{Logical. Force all predictive intervals be non-negative. Because non-negativity is forced \emph{before} propagating forward, this diff --git a/man/climate_args_list.Rd b/man/climate_args_list.Rd index 3a889e5c7..bb1ef7eab 100644 --- a/man/climate_args_list.Rd +++ b/man/climate_args_list.Rd @@ -46,10 +46,13 @@ rolling average, centered at each day.} prediction intervals. These are created by computing the quantiles of training residuals. A \code{NULL} value will result in point forecasts only.} -\item{symmetrize}{Logical. The default \code{TRUE} calculates -symmetric prediction intervals. This argument only applies when -residual quantiles are used. It is not applicable with -\code{trainer = quantile_reg()}, for example.} +\item{symmetrize}{Logical. The default \code{TRUE} calculates symmetric prediction +intervals. This argument only applies when residual quantiles are used. It +is not applicable with \code{trainer = quantile_reg()}, for example. This is +achieved by including both the residuals and their negation. Typically, one +would only want non-symmetric quantiles when increasing trajectories are +quite different from decreasing ones, such as a strictly postive variable +near zero.} \item{nonneg}{Logical. The default \code{TRUE} enforces nonnegative predictions by hard-thresholding at 0.} diff --git a/man/epi_workflow.Rd b/man/epi_workflow.Rd index 59e3d5c8f..c84626363 100644 --- a/man/epi_workflow.Rd +++ b/man/epi_workflow.Rd @@ -25,12 +25,12 @@ A new \code{epi_workflow} object. } \description{ This is a container object that unifies preprocessing, fitting, prediction, -and postprocessing for predictive modeling on epidemiological data. It extends -the functionality of a \code{\link[workflows:workflow]{workflows::workflow()}} to handle the typical panel -data structures found in this field. This extension is handled completely -internally, and should be invisible to the user. For all intents and purposes, -this operates exactly like a \code{\link[workflows:workflow]{workflows::workflow()}}. For more details -and numerous examples, see there. +and post-processing for predictive modeling on epidemiological data. It +extends the functionality of a \code{\link[workflows:workflow]{workflows::workflow()}} to handle the typical +panel data structures found in this field. This extension is handled +completely internally, and should be invisible to the user. For all intents +and purposes, this operates exactly like a \code{\link[workflows:workflow]{workflows::workflow()}}. For some +\code{{epipredict}} specific examples, see the \href{../articles/custom_epiworkflows.html}{custom epiworkflows vignette}. } \examples{ jhu <- covid_case_death_rates @@ -46,5 +46,5 @@ wf <- epi_workflow(r, parsnip::linear_reg()) wf } \seealso{ -workflows::workflow +\code{\link[workflows:workflow]{workflows::workflow()}} } diff --git a/man/extrapolate_quantiles.Rd b/man/extrapolate_quantiles.Rd index bd460dbe9..5b6b97ff0 100644 --- a/man/extrapolate_quantiles.Rd +++ b/man/extrapolate_quantiles.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/extrapolate_quantiles.R \name{extrapolate_quantiles} \alias{extrapolate_quantiles} -\title{Summarize a distribution with a set of quantiles} +\title{Extrapolate the quantiles to new quantile levels} \usage{ extrapolate_quantiles(x, probs, replace_na = TRUE, ...) } @@ -22,6 +22,19 @@ of \code{x} will now have a superset of the original \code{quantile_values} (the union of those and \code{probs}). } \description{ +This both interpolates between quantile levels already defined in \code{x} and +extrapolates quantiles outside their bounds. The interpolation method is +determined by the \code{quantile} argument \code{middle}, which can be either \code{"cubic"} +for a (hyman) cubic spline interpolation, or \code{"linear"} for simple linear +interpolation. +} +\details{ +There is only one extrapolation method for values greater than the largest +known quantile level or smaller than the smallest known quantile level. It +assumes a roughly exponential tail, whose decay rate and offset is derived +from the slope of the two most extreme quantile levels on a logistic scale. +See the internal function \code{tail_extrapolate()} for the exact implementation. + This function takes a \code{quantile_pred} vector and returns the same type of object, expanded to include \emph{additional} quantiles computed at \code{probs}. If you want behaviour more @@ -31,5 +44,7 @@ appropriate. \examples{ dstn <- quantile_pred(rbind(1:4, 8:11), c(.2, .4, .6, .8)) # extra quantiles are appended -as_tibble(extrapolate_quantiles(dstn, probs = c(.25, 0.5, .75))) +as_tibble(extrapolate_quantiles(dstn, probs = c(0.25, 0.5, 0.75))) + +extrapolate_quantiles(dstn, probs = c(0.0001, 0.25, 0.5, 0.75, 0.99999)) } diff --git a/man/figures/README-date-1.png b/man/figures/README-date-1.png new file mode 100644 index 000000000..b66ec04c8 Binary files /dev/null and b/man/figures/README-date-1.png differ diff --git a/man/figures/README-show-processed-data-1.png b/man/figures/README-show-processed-data-1.png new file mode 100644 index 000000000..e312c9fd4 Binary files /dev/null and b/man/figures/README-show-processed-data-1.png differ diff --git a/man/figures/README-show-single-forecast-1.png b/man/figures/README-show-single-forecast-1.png new file mode 100644 index 000000000..8756795de Binary files /dev/null and b/man/figures/README-show-single-forecast-1.png differ diff --git a/man/fit-epi_workflow.Rd b/man/fit-epi_workflow.Rd index 83b3b9f51..ef0f4ff40 100644 --- a/man/fit-epi_workflow.Rd +++ b/man/fit-epi_workflow.Rd @@ -22,9 +22,9 @@ The \code{epi_workflow} object, updated with a fit parsnip model in the \code{object$fit$fit} slot. } \description{ -This is the \code{fit()} method for an \code{epi_workflow} object that +This is the \code{fit()} method for an \code{epi_workflow()} object that estimates parameters for a given model from a set of data. -Fitting an \code{epi_workflow} involves two main steps, which are +Fitting an \code{epi_workflow()} involves two main steps, which are preprocessing the data and fitting the underlying parsnip model. } \examples{ @@ -40,5 +40,5 @@ wf } \seealso{ -workflows::fit-workflow +\code{\link[workflows:fit-workflow]{workflows::fit-workflow()}} } diff --git a/man/flatline_args_list.Rd b/man/flatline_args_list.Rd index 626bcb6f1..7c9bb0399 100644 --- a/man/flatline_args_list.Rd +++ b/man/flatline_args_list.Rd @@ -44,10 +44,13 @@ will determine this automatically as \code{forecast_date + ahead}.} prediction intervals. These are created by computing the quantiles of training residuals. A \code{NULL} value will result in point forecasts only.} -\item{symmetrize}{Logical. The default \code{TRUE} calculates -symmetric prediction intervals. This argument only applies when -residual quantiles are used. It is not applicable with -\code{trainer = quantile_reg()}, for example.} +\item{symmetrize}{Logical. The default \code{TRUE} calculates symmetric prediction +intervals. This argument only applies when residual quantiles are used. It +is not applicable with \code{trainer = quantile_reg()}, for example. This is +achieved by including both the residuals and their negation. Typically, one +would only want non-symmetric quantiles when increasing trajectories are +quite different from decreasing ones, such as a strictly postive variable +near zero.} \item{nonneg}{Logical. The default \code{TRUE} enforces nonnegative predictions by hard-thresholding at 0.} diff --git a/man/flatline_forecaster.Rd b/man/flatline_forecaster.Rd index f78e2f931..cc789bac2 100644 --- a/man/flatline_forecaster.Rd +++ b/man/flatline_forecaster.Rd @@ -21,12 +21,10 @@ ahead (unique horizon) for each unique combination of \code{key_vars}. \description{ This is a simple forecasting model for \link[epiprocess:epi_df]{epiprocess::epi_df} data. It uses the most recent -observation as the -forecast for any future date, and produces intervals based on the quantiles -of the residuals of such a "flatline" forecast over all available training -data. -} -\details{ +observation as the forecast for any future date, and produces intervals +based on the quantiles of the residuals of such a "flatline" forecast over +all available training data. + By default, the predictive intervals are computed separately for each combination of key values (\code{geo_value} + any additional keys) in the \code{epi_data} argument. @@ -34,6 +32,29 @@ combination of key values (\code{geo_value} + any additional keys) in the This forecaster is very similar to that used by the \href{https://covid19forecasthub.org}{COVID19ForecastHub} } +\details{ +Here is (roughly) the code for the \code{flatline_forecaster()} applied to the +\code{case_rate} for \code{epidatasets::covid_case_death_rates}. + +\if{html}{\out{
}}\preformatted{jhu <- covid_case_death_rates \%>\% + filter(time_value > "2021-11-01", geo_value \%in\% c("ak", "ca", "ny")) +r <- epi_recipe(covid_case_death_rates) \%>\% + step_epi_ahead(case_rate, ahead = 7, skip = TRUE) \%>\% + recipes::update_role(case_rate, new_role = "predictor") \%>\% + recipes::add_role(all_of(key_colnames(jhu)), new_role = "predictor") + +f <- frosting() \%>\% + layer_predict() \%>\% + layer_residual_quantiles() \%>\% + layer_add_forecast_date() \%>\% + layer_add_target_date() \%>\% + layer_threshold(starts_with(".pred")) + +eng <- linear_reg() \%>\% set_engine("flatline") +wf <- epi_workflow(r, eng, f) \%>\% fit(jhu) +preds <- forecast(wf) +}\if{html}{\out{
}} +} \examples{ jhu <- covid_case_death_rates \%>\% dplyr::filter(time_value >= as.Date("2021-12-01")) diff --git a/man/forecast.epi_workflow.Rd b/man/forecast.epi_workflow.Rd index 22f8cf4bb..feae32f68 100644 --- a/man/forecast.epi_workflow.Rd +++ b/man/forecast.epi_workflow.Rd @@ -2,29 +2,33 @@ % Please edit documentation in R/epi_workflow.R \name{forecast.epi_workflow} \alias{forecast.epi_workflow} -\title{Produce a forecast from an epi workflow} +\title{Produce a forecast from just an epi workflow} \usage{ -\method{forecast}{epi_workflow}(object, ..., n_recent = NULL, forecast_date = NULL) +\method{forecast}{epi_workflow}(object, ...) } \arguments{ \item{object}{An epi workflow.} \item{...}{Not used.} - -\item{n_recent}{Integer or NULL. If filling missing data with locf = TRUE, -how far back are we willing to tolerate missing data? Larger values allow -more filling. The default NULL will determine this from the the recipe. For -example, suppose n_recent = 3, then if the 3 most recent observations in any -geo_value are all NA’s, we won’t be able to fill anything, and an error -message will be thrown. (See details.)} - -\item{forecast_date}{By default, this is set to the maximum time_value in x. -But if there is data latency such that recent NA's should be filled, this may -be after the last available time_value.} } \value{ A forecast tibble. } \description{ -Produce a forecast from an epi workflow +\code{forecast.epi_workflow} predicts by restricting the training data to the +latest available data, and predicting on that. It binds together +\code{get_test_data()} and \code{predict()}. +} +\examples{ +jhu <- covid_case_death_rates \%>\% + filter(time_value > "2021-08-01") + +r <- epi_recipe(jhu) \%>\% + step_epi_lag(death_rate, lag = c(0, 7, 14)) \%>\% + step_epi_ahead(death_rate, ahead = 7) \%>\% + step_epi_naomit() + +epi_workflow(r, parsnip::linear_reg()) \%>\% + fit(jhu) \%>\% + forecast() } diff --git a/man/frosting.Rd b/man/frosting.Rd index e36100bae..c00875125 100644 --- a/man/frosting.Rd +++ b/man/frosting.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/frosting.R \name{frosting} \alias{frosting} -\title{Create frosting for postprocessing predictions} +\title{Create frosting for post-processing predictions} \usage{ frosting(layers = NULL, requirements = NULL) } @@ -15,14 +15,14 @@ frosting(layers = NULL, requirements = NULL) A frosting object. } \description{ -This generates a postprocessing container (much like \code{recipes::recipe()}) -to hold steps for postprocessing predictions. +This generates a post-processing container (much like \code{recipes::recipe()}) +to hold steps for post-processing predictions. } \details{ The arguments are currently placeholders and must be NULL } \examples{ -# Toy example to show that frosting can be created and added for postprocessing +# Toy example to show that frosting can be created and added for post-processing f <- frosting() wf <- epi_workflow() \%>\% add_frosting(f) diff --git a/man/get_test_data.Rd b/man/get_test_data.Rd index 16359b9c3..3c3317067 100644 --- a/man/get_test_data.Rd +++ b/man/get_test_data.Rd @@ -17,11 +17,10 @@ An object of the same type as \code{x} with columns \code{geo_value}, \code{time keys, as well other variables in the original dataset. } \description{ -Based on the longest lag period in the recipe, -\code{get_test_data()} creates an \link[epiprocess:epi_df]{epi_df} -with columns \code{geo_value}, \code{time_value} -and other variables in the original dataset, -which will be used to create features necessary to produce forecasts. +If \code{predict()} is given the full training dataset, it will produce a forecast +for every day which has enough data. For most cases, this is far more +forecasts than is necessary. \code{get_test_data()} is designed to restrict the given dataset to the minimum amount needed to produce a forecast on the \code{forecast_date}. +Primarily this is based on the longest lag period in the recipe. } \details{ The minimum required (recent) data to produce a forecast is equal to diff --git a/man/layer_add_forecast_date.Rd b/man/layer_add_forecast_date.Rd index 14c7864bc..6370962cc 100644 --- a/man/layer_add_forecast_date.Rd +++ b/man/layer_add_forecast_date.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/layer_add_forecast_date.R \name{layer_add_forecast_date} \alias{layer_add_forecast_date} -\title{Postprocessing step to add the forecast date} +\title{Post-processing step to add the forecast date} \usage{ layer_add_forecast_date( frosting, @@ -19,7 +19,7 @@ that when the forecast date is left unspecified, it is set to one of two values. If there is a \code{step_adjust_latency} step present, it uses the \code{forecast_date} as set in that function. Otherwise, it uses the maximum \code{time_value} across the data used for pre-processing, fitting the model, -and postprocessing.} +and post-processing.} \item{id}{a random id string} } @@ -27,15 +27,15 @@ and postprocessing.} an updated \code{frosting} postprocessor } \description{ -Postprocessing step to add the forecast date +Post-processing step to add the forecast date } \details{ To use this function, either specify a forecast date or leave the forecast date unspecifed here. In the latter case, the forecast date will be set as the maximum time value from the data used in pre-processing, -fitting the model, and postprocessing. In any case, when the forecast date is +fitting the model, and post-processing. In any case, when the forecast date is less than the maximum \code{as_of} value (from the data used pre-processing, -model fitting, and postprocessing), an appropriate warning will be thrown. +model fitting, and post-processing), an appropriate warning will be thrown. } \examples{ jhu <- covid_case_death_rates \%>\% diff --git a/man/layer_add_target_date.Rd b/man/layer_add_target_date.Rd index c9f43c5f1..f10178898 100644 --- a/man/layer_add_target_date.Rd +++ b/man/layer_add_target_date.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/layer_add_target_date.R \name{layer_add_target_date} \alias{layer_add_target_date} -\title{Postprocessing step to add the target date} +\title{Post-processing step to add the target date} \usage{ layer_add_target_date( frosting, @@ -18,7 +18,7 @@ there's a forecast date specified upstream (either in a \code{step_adjust_latency} or in a \code{layer_forecast_date}), then it is the forecast date plus \code{ahead} (from \code{step_epi_ahead} in the \code{epi_recipe}). Otherwise, it is the maximum \code{time_value} (from the data used in -pre-processing, fitting the model, and postprocessing) plus \code{ahead}, where +pre-processing, fitting the model, and post-processing) plus \code{ahead}, where \code{ahead} has been specified in preprocessing. The user may override these by specifying a target date of their own (of the form "yyyy-mm-dd").} @@ -28,18 +28,19 @@ specifying a target date of their own (of the form "yyyy-mm-dd").} an updated \code{frosting} postprocessor } \description{ -Postprocessing step to add the target date +Post-processing step to add the target date } \details{ -By default, this function assumes that a value for \code{ahead} -has been specified in a preprocessing step (most likely in -\code{step_epi_ahead}). Then, \code{ahead} is added to the \code{forecast_date} -in the test data to get the target date. \code{forecast_date} can be set in 3 ways: +By default, this function assumes that a value for \code{ahead} has been +specified in a preprocessing step (most likely in \code{step_epi_ahead}). Then, +\code{ahead} is added to the \code{forecast_date} in the test data to get the target +date. \code{forecast_date} itself can be set in 3 ways: \enumerate{ -\item \code{step_adjust_latency}, which typically uses the training \code{epi_df}'s \code{as_of} -\item \code{layer_add_forecast_date}, which inherits from 1 if not manually specifed -\item if none of those are the case, it is simply the maximum \code{time_value} over -every dataset used (prep, training, and prediction). +\item The default \code{forecast_date} is simply the maximum \code{time_value} over every +dataset used (prep, training, and prediction). +\item if \code{step_adjust_latency} is present, it will typically use the training +\code{epi_df}'s \code{as_of} +\item \code{layer_add_forecast_date}, which inherits from 2 if not manually specifed } } \examples{ diff --git a/man/layer_point_from_distn.Rd b/man/layer_point_from_distn.Rd index 329f62d5b..3e770e912 100644 --- a/man/layer_point_from_distn.Rd +++ b/man/layer_point_from_distn.Rd @@ -28,7 +28,7 @@ will overwrite the \code{.pred} column, removing the distribution information.} an updated \code{frosting} postprocessor. } \description{ -This function adds a postprocessing layer to extract a point forecast from +This function adds a post-processing layer to extract a point forecast from a distributional forecast. NOTE: With default arguments, this will remove information, so one should usually call this AFTER \code{layer_quantile_distn()} or set the \code{name} argument to something specific. diff --git a/man/layer_population_scaling.Rd b/man/layer_population_scaling.Rd index 1b4c3d7d9..25164bbd9 100644 --- a/man/layer_population_scaling.Rd +++ b/man/layer_population_scaling.Rd @@ -26,7 +26,8 @@ for this step. See \code{\link[recipes:selections]{recipes::selections()}} for m \item{df}{a data frame that contains the population data to be used for inverting the existing scaling.} -\item{by}{A (possibly named) character vector of variables to join by. +\item{by}{A (possibly named) character vector of variables to join \code{df} onto +the \code{epi_df} by. If \code{NULL}, the default, the function will try to infer a reasonable set of columns. First, it will try to join by all variables in the test data with @@ -67,16 +68,16 @@ in the \code{epi_df}.} an updated \code{frosting} postprocessor } \description{ -\code{layer_population_scaling} creates a specification of a frosting layer -that will "undo" per-capita scaling. Typical usage would -load a dataset that contains state-level population, and use it to convert -predictions made from a rate-scale model to raw scale by multiplying by -the population. -Although, it is worth noting that there is nothing special about "population". -The function can be used to scale by any variable. Population is the -standard use case in the epidemiology forecasting scenario. Any value -passed will \emph{multiply} the selected variables while the \code{rate_rescaling} -argument is a common \emph{divisor} of the selected variables. +\code{layer_population_scaling} creates a specification of a frosting layer that +will "undo" per-capita scaling done in \code{step_population_scaling()}. Typical +usage would set \code{df} to be a dataset that contains state-level population, +and use it to convert predictions made from a raw scale model to rate-scale +by dividing by the population. +Although, it is worth noting that there is nothing special about +"population", and the function can be used to scale by any variable. +Population is the standard use case in the epidemiology forecasting scenario. +Any value passed will \emph{multiply} the selected variables while the +\code{rate_rescaling} argument is a common \emph{divisor} of the selected variables. } \examples{ jhu <- cases_deaths_subset \%>\% diff --git a/man/layer_predict.Rd b/man/layer_predict.Rd index 9a7f12a46..d678c9fae 100644 --- a/man/layer_predict.Rd +++ b/man/layer_predict.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/layer_predict.R \name{layer_predict} \alias{layer_predict} -\title{Prediction layer for postprocessing} +\title{Prediction layer for post-processing} \usage{ layer_predict( frosting, @@ -54,7 +54,7 @@ An updated \code{frosting} object \description{ Implements prediction on a fitted \code{epi_workflow}. One may want different types of prediction, and to potentially apply this after some amount of -postprocessing. This would typically be the first layer in a \code{frosting} +post-processing. This would typically be the first layer in a \code{frosting} postprocessor. } \examples{ @@ -84,5 +84,5 @@ p2 <- predict(wf2, latest) p2 } \seealso{ -\code{parsnip::predict.model_fit()} +\code{\link[parsnip:predict.model_fit]{parsnip::predict.model_fit()}} } diff --git a/man/layer_predictive_distn.Rd b/man/layer_predictive_distn.Rd index 0e4b17cdb..7f5464513 100644 --- a/man/layer_predictive_distn.Rd +++ b/man/layer_predictive_distn.Rd @@ -37,7 +37,9 @@ residual quantiles added to the prediction This function calculates an \emph{approximation} to a parametric predictive distribution. Predictive distributions from linear models require \verb{x* (X'X)^\{-1\} x*} -along with the degrees of freedom. This function approximates both. It -should be reasonably accurate for models fit using \code{lm} when the new point -\verb{x*} isn't too far from the bulk of the data. +along with the degrees of freedom. This function approximates both. It should +be reasonably accurate for models fit using \code{lm} when the new point \verb{x*} +isn't too far from the bulk of the data. Outside of that specific case, it is +recommended to use \code{layer_residual_quantiles()}, or if you are working with a +model that produces distributional predictions, use \code{layer_quantile_distn()}. } diff --git a/man/layer_quantile_distn.Rd b/man/layer_quantile_distn.Rd index 21c38b812..9ce3cd57d 100644 --- a/man/layer_quantile_distn.Rd +++ b/man/layer_quantile_distn.Rd @@ -32,6 +32,8 @@ quantiles will be added to the predictions. } \description{ This function calculates quantiles when the prediction was \emph{distributional}. +If the model producing the forecast is not distributional, it is recommended +to use \code{layer_residual_quantiles()} instead. } \details{ Currently, the only distributional modes/engines are diff --git a/man/layer_residual_quantiles.Rd b/man/layer_residual_quantiles.Rd index 16e16cea8..e879677f3 100644 --- a/man/layer_residual_quantiles.Rd +++ b/man/layer_residual_quantiles.Rd @@ -23,7 +23,11 @@ layer_residual_quantiles( referring to the desired quantile. Note that 0.5 will always be included even if left out by the user.} -\item{symmetrize}{logical. If \code{TRUE} then interval will be symmetric.} +\item{symmetrize}{logical. If \code{TRUE} then the interval will be symmetric. +This is achieved by including both the residuals and their negations. +Typically, one would only want non-symmetric quantiles when increasing +trajectories are quite different from decreasing ones, such as a strictly +postive variable near zero.} \item{by_key}{A character vector of keys to group the residuals by before calculating quantiles. The default, \code{c()} performs no grouping.} @@ -37,7 +41,10 @@ an updated \code{frosting} postprocessor with additional columns of the residual quantiles added to the prediction } \description{ -Creates predictions based on residual quantiles +This function calculates quantiles based on the empirical quantiles of the +model's residuals. If the model producing the forecast is distributional, it +is recommended to use \code{layer_residual_quantiles()} instead, as those will be +significantly more accurate. } \examples{ jhu <- covid_case_death_rates \%>\% diff --git a/man/layer_threshold.Rd b/man/layer_threshold.Rd index 875390ced..77f0a9aa2 100644 --- a/man/layer_threshold.Rd +++ b/man/layer_threshold.Rd @@ -35,9 +35,17 @@ Default value is \code{Inf}.} an updated \code{frosting} postprocessor } \description{ -This postprocessing step is used to set prediction values that are -smaller than the lower threshold or higher than the upper threshold equal -to the threshold values. +This post-processing step is used to set prediction values that are smaller +than the lower threshold or higher than the upper threshold equal to the +threshold values. +} +\details{ +Making case count predictions strictly positive is a typical example usage. +It can be called before or after the quantiles are created using +\code{layer_quantile_distn()} since the quantiles are an inherent part of the +result from \code{layer_predict()} for distributional models, but must be called +after \code{layer_residual_quantiles()}, since the quantiles for that case don't +exist until after that layer. } \examples{ jhu <- covid_case_death_rates \%>\% diff --git a/man/layer_unnest.Rd b/man/layer_unnest.Rd index 3c25608a6..dfa1f4dc1 100644 --- a/man/layer_unnest.Rd +++ b/man/layer_unnest.Rd @@ -20,5 +20,40 @@ be used to select a range of variables.} an updated \code{frosting} postprocessor } \description{ -Unnest prediction list-cols +For any model that produces forecasts for multiple outcomes, such as multiple +aheads, the resulting prediction is a list of forecasts inside a column of +the prediction tibble, which is not an ideal format. This layer "lengthens" +the result, moving each outcome to a separate row, in the same manner as +\code{tidyr::unnest()} would. At the moment, the only such engine is +\code{smooth_quantile_reg()}. +} +\examples{ +jhu <- covid_case_death_rates \%>\% + filter(time_value > "2021-11-01", geo_value \%in\% c("ak", "ca", "ny")) + +aheads <- 1:7 + +r <- epi_recipe(jhu) \%>\% + step_epi_lag(death_rate, lag = c(0, 7, 14)) \%>\% + step_epi_ahead(death_rate, ahead = aheads) \%>\% + step_epi_naomit() + +wf <- epi_workflow( + r, + smooth_quantile_reg( + quantile_levels = c(.05, .1, .25, .5, .75, .9, .95), + outcome_locations = aheads + ) +) \%>\% + fit(jhu) + +f <- frosting() \%>\% + layer_predict() \%>\% + layer_naomit() \%>\% + layer_unnest(.pred) + +wf1 <- wf \%>\% add_frosting(f) + +p <- forecast(wf1) +p } diff --git a/man/pivot_quantiles_longer.Rd b/man/pivot_quantiles_longer.Rd index b872d0c97..3c112b12f 100644 --- a/man/pivot_quantiles_longer.Rd +++ b/man/pivot_quantiles_longer.Rd @@ -19,9 +19,9 @@ can be selected for this operation.} An object of the same class as \code{.data}. } \description{ -A column that contains \code{quantile_pred} will be "lengthened" with -the quantile levels serving as 1 column and the values as another. If -multiple columns are selected, these will be prefixed with the column name. +Selected columns that contain \code{quantile_pred} will be "lengthened" with the +\code{quantile_level}s in one column and the \code{value}s in another. If multiple +columns are selected, these will be prefixed with the column name. } \examples{ d1 <- quantile_pred(rbind(1:3, 2:4), 1:3 / 4) diff --git a/man/pivot_quantiles_wider.Rd b/man/pivot_quantiles_wider.Rd index 01f85a70a..bba08e292 100644 --- a/man/pivot_quantiles_wider.Rd +++ b/man/pivot_quantiles_wider.Rd @@ -19,10 +19,10 @@ can be selected for this operation.} An object of the same class as \code{.data} } \description{ -Any selected columns that contain \code{quantile_pred} will be "widened" with -the "taus" (quantile) serving as names and the values in the data frame. -When pivoting multiple columns, the original column name will be used as -a prefix. +Any selected columns that contain \code{quantile_pred} will be "widened" with the +"taus" (quantile) serving as column names and the values in the corresponding +column. When pivoting multiple columns, the original column name will be +used as a prefix. } \examples{ d1 <- quantile_pred(rbind(1:3, 2:4), 1:3 / 4) diff --git a/man/predict-epi_workflow.Rd b/man/predict-epi_workflow.Rd index 0b605d556..c04097c47 100644 --- a/man/predict-epi_workflow.Rd +++ b/man/predict-epi_workflow.Rd @@ -47,22 +47,22 @@ time points at which the survival probability or hazard is estimated. } \value{ A data frame of model predictions, with as many rows as \code{new_data} has. -If \code{new_data} is an \code{epi_df} or a data frame with \code{time_value} or +If \code{new_data} is an \code{epi_df()} or a data frame with \code{time_value} or \code{geo_value} columns, then the result will have those as well. } \description{ -This is the \code{predict()} method for a fit epi_workflow object. The nice thing -about predicting from an epi_workflow is that it will: +This is the \code{predict()} method for a fit epi_workflow object. The 3 steps that this implements are: \itemize{ -\item Preprocess \code{new_data} using the preprocessing method specified when the -workflow was created and fit. This is accomplished using -\code{\link[hardhat:forge]{hardhat::forge()}}, which will apply any formula preprocessing or call -\code{\link[recipes:bake]{recipes::bake()}} if a recipe was supplied. -\item Call \code{\link[parsnip:predict.model_fit]{parsnip::predict.model_fit()}} for you using the underlying fit +\item Preprocessing \code{new_data} using the preprocessing method specified when the +epi_workflow was created and fit. This is accomplished using +\code{recipes::bake()} if a recipe was supplied. Note that this is a slightly +different \code{bake} operation than the one occuring during the fit. Any \code{step} +that has \code{skip = TRUE} isn't applied during prediction; for example in +\code{step_epi_naomit()}, \code{all_outcomes()} isn't \code{NA} omitted, since doing so +would drop the exact \code{time_values} we are trying to predict. +\item Calling \code{parsnip::predict.model_fit()} for you using the underlying fit parsnip model. -\item Ensure that the returned object is an \link[epiprocess:epi_df]{epiprocess::epi_df} where -possible. Specifically, the output will have \code{time_value} and -\code{geo_value} columns as well as the prediction. +\item \code{slather()} any frosting that has been included in the \code{epi_workflow}. } } \examples{ diff --git a/man/roll_modular_multivec.Rd b/man/roll_modular_multivec.Rd index b01304cc6..83c32aa41 100644 --- a/man/roll_modular_multivec.Rd +++ b/man/roll_modular_multivec.Rd @@ -4,21 +4,24 @@ \alias{roll_modular_multivec} \title{group col by .idx values and sum windows around each .idx value} \usage{ -roll_modular_multivec(col, .idx, weights, aggr, window_size, modulus) +roll_modular_multivec(col, idx_in, weights, aggr, window_size, modulus) } \arguments{ -\item{col}{the list of values indexed by \code{.idx}} +\item{col}{the list of values indexed by \code{idx_in}} -\item{.idx}{the relevant periodic part of time value, e.g. the week number} +\item{idx_in}{the relevant periodic part of time value, e.g. the week number, +limited to the relevant range} -\item{weights}{how much to weigh each particular datapoint} +\item{weights}{how much to weigh each particular datapoint (also indexed by +\code{idx_in})} -\item{aggr}{the aggregation function, probably Quantile, mean or median} +\item{aggr}{the aggregation function, probably Quantile, mean, or median} \item{window_size}{the number of .idx entries before and after to include in the aggregation} -\item{modulus}{the maximum value of \code{.idx}} +\item{modulus}{the number of days/weeks/months in the year, not including any +leap days/weeks} } \description{ group col by .idx values and sum windows around each .idx value diff --git a/man/slather.Rd b/man/slather.Rd index dd556b629..3219d0e32 100644 --- a/man/slather.Rd +++ b/man/slather.Rd @@ -7,7 +7,7 @@ slather(object, components, workflow, new_data, ...) } \arguments{ -\item{object}{a workflow with \code{frosting} postprocessing steps} +\item{object}{a workflow with \code{frosting} post-processing steps} \item{components}{a list of components containing model information. These will be updated and returned by the layer. These should be @@ -31,10 +31,14 @@ and predict on} \item{...}{additional arguments used by methods. Currently unused.} } \value{ -The \code{components} list. In the same format after applying any updates. +The \code{components} list, in the same format as before, after applying +any updates. } \description{ -Slathering frosting means to implement a postprocessing layer. When -creating a new postprocessing layer, you must implement an S3 method -for this function +Slathering frosting means to implement a post-processing layer. It is the +post-processing equivalent of \code{bake} for a recipe. Given a layer, it applies +the actual transformation of that layer. When creating a new post-processing +layer, you must implement an S3 method for this function. Generally, you will +not need to call this function directly, as it will be used indirectly during +\code{predict}. } diff --git a/man/step_adjust_latency.Rd b/man/step_adjust_latency.Rd index 9e1bafbd5..c909452eb 100644 --- a/man/step_adjust_latency.Rd +++ b/man/step_adjust_latency.Rd @@ -107,17 +107,18 @@ demonstrate some of the subtleties, let's consider a toy dataset: as_epi_df(as_of = as.Date("2015-01-14")) }\if{html}{\out{}} -If we're looking to predict the value on the 15th, forecasting from the 14th (the \code{as_of} date above), -there are two issues we will need to address: +If we're looking to predict the value on the 15th, forecasting from the 14th +(the \code{as_of} date above), there are two issues we will need to address: \enumerate{ \item \code{"ca"} is latent by 2 days, whereas \code{"ma"} is latent by 1 -\item if we want to use \code{b} as an exogenous variable, for \code{"ma"} it is latent by 3 days instead of just 1. +\item if we want to use \code{b} as an exogenous variable, for \code{"ma"} it is latent by +3 days instead of just 1. } -Regardless of \code{method}, \code{epi_keys_checked="geo_value"} guarantees that the -difference between \code{"ma"} and \code{"ca"} is accounted for by making the -latency adjustment at least 2. For some comparison, here's what the various -methods will do: +Regardless of \code{method}, \code{epi_keys_checked="geo_value"} guarantees tha the +difference between \code{"ma"} and \code{"ca"} is accounted for by making the latency +adjustment at least 2. For some comparison, here's what the various methods +will do: \subsection{\code{locf}}{ Short for "last observation carried forward", \code{locf} assumes that every day @@ -267,8 +268,8 @@ while this will not: \if{html}{\out{
}}\preformatted{toy_recipe <- epi_recipe(toy_df) \%>\% step_epi_lag(a, lag=0) \%>\% step_adjust_latency(a, method = "extend_lags") -#> Warning: If `method` is "extend_lags" or "locf", then the previous `step_epi_lag`s won't work with -#> modified data. +#> Warning: If `method` is "extend_lags" or "locf", then the previous +#> `step_epi_lag`s won't work with modified data. }\if{html}{\out{
}} If you create columns that you then apply lags to (such as diff --git a/man/step_epi_naomit.Rd b/man/step_epi_naomit.Rd index faf7484da..aa9208d89 100644 --- a/man/step_epi_naomit.Rd +++ b/man/step_epi_naomit.Rd @@ -10,10 +10,15 @@ step_epi_naomit(recipe) \item{recipe}{Recipe to be used for omission steps} } \value{ -Omits NA's from both predictors and outcomes at training time -to fit the model. Also only omits associated predictors and not -outcomes at prediction time due to lack of response and avoidance -of data loss. +Omits NA's from both predictors and outcomes at training time to fit +the model. Also only omits associated predictors and not outcomes at +prediction time due to lack of response and avoidance of data loss. Given a +\code{recipe}, this step is literally equivalent to + +\if{html}{\out{
}}\preformatted{ recipe \%>\% + recipes::step_naomit(all_predictors(), skip = FALSE) \%>\% + recipes::step_naomit(all_outcomes(), skip = TRUE) +}\if{html}{\out{
}} } \description{ Unified NA omission wrapper function for recipes diff --git a/man/step_epi_shift.Rd b/man/step_epi_shift.Rd index 867410360..3291d95dd 100644 --- a/man/step_epi_shift.Rd +++ b/man/step_epi_shift.Rd @@ -61,16 +61,21 @@ sequence of any existing operations. } \description{ \code{step_epi_lag} and \code{step_epi_ahead} create a \emph{specification} of a recipe step -that will add new columns of shifted data. The former will created a lag -column, while the latter will create a lead column. Shifted data will -by default include NA values where the shift was induced. -These can be properly removed with \code{\link[=step_epi_naomit]{step_epi_naomit()}}, or you may -specify an alternative filler value with the \code{default} -argument. +that will add new columns of shifted data. The \code{step_epi_lag} will created +a lagged \code{predictor} column, while \code{step_epi_ahead} will create a leading +\code{outcome} column. Shifted data will by default include NA values where the +shift was induced. These can be properly removed with \code{\link[=step_epi_naomit]{step_epi_naomit()}}, +or you may specify an alternative filler value with the \code{default} argument. } \details{ -The step assumes that the data are already \emph{in the proper sequential -order} for shifting. +The step assumes that the data's \code{time_value} column is already \emph{in +the proper sequential order} for shifting. + +Our \code{lag/ahead} functions respect the \code{geo_value} and \code{other_keys} of the +\code{epi_df}, and allow for discontiguous \code{time_value}s. Both of these features +are noticably lacking from \code{recipe::step_lag()}. +Our \code{lag/ahead} functions also appropriately adjust the amount of data to +avoid accidentally dropping recent predictors from the test data. The \code{prefix} and \code{id} arguments are unchangeable to ensure that the code runs properly and to avoid inconsistency with naming. For \code{step_epi_ahead}, they diff --git a/man/step_epi_slide.Rd b/man/step_epi_slide.Rd index 4e7ba8bed..1a104c1bc 100644 --- a/man/step_epi_slide.Rd +++ b/man/step_epi_slide.Rd @@ -75,9 +75,10 @@ An updated version of \code{recipe} with the new step added to the sequence of any existing operations. } \description{ -\code{step_epi_slide()} creates a \emph{specification} of a recipe step -that will generate one or more new columns of derived data by "sliding" -a computation along existing data. +\code{step_epi_slide()} creates a \emph{specification} of a recipe step that will +generate one or more new columns of derived data by "sliding" a computation +along existing data. This is a wrapper around \code{epiprocess::epi_slide()} +to allow its use within an \code{epi_recipe()}. } \examples{ jhu <- covid_case_death_rates \%>\% diff --git a/man/step_growth_rate.Rd b/man/step_growth_rate.Rd index e76da1100..999d818ed 100644 --- a/man/step_growth_rate.Rd +++ b/man/step_growth_rate.Rd @@ -68,8 +68,9 @@ An updated version of \code{recipe} with the new step added to the sequence of any existing operations. } \description{ -\code{step_growth_rate()} creates a \emph{specification} of a recipe step -that will generate one or more new columns of derived data. +\code{step_growth_rate()} creates a \emph{specification} of a recipe step that will +generate one or more new columns of derived data. This is a wrapper around +\code{epiprocess::growth_rate()} to allow its use within an \code{epi_recipe()}. } \examples{ tiny_geos <- c("as", "mp", "vi", "gu", "pr") diff --git a/man/step_lag_difference.Rd b/man/step_lag_difference.Rd index 6151bee84..325bcf05c 100644 --- a/man/step_lag_difference.Rd +++ b/man/step_lag_difference.Rd @@ -43,8 +43,16 @@ An updated version of \code{recipe} with the new step added to the sequence of any existing operations. } \description{ -\code{step_lag_difference()} creates a \emph{specification} of a recipe step -that will generate one or more new columns of derived data. +\code{step_lag_difference()} creates a \emph{specification} of a recipe step that will +generate one or more new columns of derived data. For each column in the +specification, \code{step_lag_difference()} will calculate the difference +between the values at a distance of \code{horizon}. For example, with +\code{horizon=1}, this would simply be the difference between adjacent days. +} +\details{ +Much like \code{step_epi_lag()} this step works with the actual time values (so if +there are gaps it will fill with \code{NA} values), and respects the grouping +inherent in the \code{epi_df()} as specified by \code{geo_value} and \code{other_keys}. } \examples{ r <- epi_recipe(covid_case_death_rates) \%>\% diff --git a/man/step_population_scaling.Rd b/man/step_population_scaling.Rd index 427e896ed..7340762ac 100644 --- a/man/step_population_scaling.Rd +++ b/man/step_population_scaling.Rd @@ -26,12 +26,13 @@ sequence of operations for this recipe.} for this step. See \code{\link[recipes:selections]{recipes::selections()}} for more details.} \item{role}{For model terms created by this step, what analysis role should -they be assigned? \code{lag} is default a predictor while \code{ahead} is an outcome.} +they be assigned?} -\item{df}{a data frame that contains the population data to be used for -inverting the existing scaling.} +\item{df}{a data frame containing the scaling data (such as population). The +target column is divided by the value in \code{df_pop_col}.} -\item{by}{A (possibly named) character vector of variables to join by. +\item{by}{A (possibly named) character vector of variables to join \code{df} onto +the \code{epi_df} by. If \code{NULL}, the default, the function will try to infer a reasonable set of columns. First, it will try to join by all variables in the training/test @@ -61,7 +62,7 @@ This should be one column.} Adjustments can be made here. For example, if the original scale is "per 100K", then set \code{rate_rescaling = 1e5} to get rates.} -\item{create_new}{TRUE to create a new column and keep the original column +\item{create_new}{\code{TRUE} to create a new column and keep the original column in the \code{epi_df}} \item{suffix}{a character. The suffix added to the column name if @@ -80,16 +81,15 @@ the computations for subsequent operations.} Scales raw data by the population } \description{ -\code{step_population_scaling} creates a specification of a recipe step -that will perform per-capita scaling. Typical usage would -load a dataset that contains state-level population, and use it to convert -predictions made from a raw scale model to rate-scale by dividing by -the population. -Although, it is worth noting that there is nothing special about "population". -The function can be used to scale by any variable. Population is the -standard use case in the epidemiology forecasting scenario. Any value -passed will \emph{divide} the selected variables while the \code{rate_rescaling} -argument is a common \emph{multiplier} of the selected variables. +\code{step_population_scaling()} creates a specification of a recipe step that +will perform per-capita scaling. Typical usage would set \code{df} to be a dataset +that contains state-level population, and use it to convert predictions made +from a raw scale model to rate-scale by dividing by the population. +Although, it is worth noting that there is nothing special about +"population", and the function can be used to scale by any variable. +Population is the standard use case in the epidemiology forecasting scenario. +Any value passed will \emph{divide} the selected variables while the +\code{rate_rescaling} argument is a common \emph{multiplier} of the selected variables. } \examples{ jhu <- cases_deaths_subset \%>\% diff --git a/man/step_training_window.Rd b/man/step_training_window.Rd index 42f6b9a95..86c9a2952 100644 --- a/man/step_training_window.Rd +++ b/man/step_training_window.Rd @@ -40,8 +40,12 @@ observations in \code{time_value} per group, where the groups are formed based on the remaining \code{epi_keys}. } \details{ -Note that \code{step_epi_lead()} and \code{step_epi_lag()} should come -after any filtering step. +It is recommended to do this after any \code{step_epi_ahead()}, +\code{step_epi_lag()}, or \code{step_epi_naomit()} steps. If \code{step_training_window()} +happens first, there will be less than \code{n_training} remaining examples, +since either leading or lagging will introduce \code{NA}'s later removed by +\code{step_epi_naomit()}. Typical usage will have this function applied after +every other step. } \examples{ tib <- tibble( diff --git a/tests/testthat/test-population_scaling.R b/tests/testthat/test-population_scaling.R index f2efde3c0..f31473f96 100644 --- a/tests/testthat/test-population_scaling.R +++ b/tests/testthat/test-population_scaling.R @@ -88,8 +88,8 @@ test_that("Number of columns and column names returned correctly, Upper and lowe expect_equal(ncol(b), 5L) }) -## Postprocessing -test_that("Postprocessing workflow works and values correct", { +## Post-processing +test_that("Post-processing workflow works and values correct", { jhu <- epidatasets::cases_deaths_subset %>% dplyr::filter(time_value > "2021-11-01", geo_value %in% c("ca", "ny")) %>% dplyr::select(geo_value, time_value, cases) @@ -149,7 +149,7 @@ test_that("Postprocessing workflow works and values correct", { expect_equal(p$.pred_scaled, p$.pred * c(2, 3)) }) -test_that("Postprocessing to get cases from case rate", { +test_that("Post-processing to get cases from case rate", { jhu <- covid_case_death_rates %>% dplyr::filter(time_value > "2021-11-01", geo_value %in% c("ca", "ny")) %>% dplyr::select(geo_value, time_value, case_rate) diff --git a/tests/testthat/test-step_adjust_latency.R b/tests/testthat/test-step_adjust_latency.R index 80e31dc17..99709ecaf 100644 --- a/tests/testthat/test-step_adjust_latency.R +++ b/tests/testthat/test-step_adjust_latency.R @@ -1,6 +1,6 @@ library(dplyr) # Test ideas that were dropped: -# - "epi_adjust_latency works correctly when there's gaps in the timeseries" +# - "epi_adjust_latency works correctly when there's gaps in the time-series" # - "epi_adjust_latency extend_ahead uses the same adjustment when predicting on new data after being baked" # - "`step_adjust_latency` only allows one instance of itself" # - "data with epi_df shorn off works" diff --git a/tests/testthat/test-step_climate.R b/tests/testthat/test-step_climate.R index 3dca9ec67..231908b98 100644 --- a/tests/testthat/test-step_climate.R +++ b/tests/testthat/test-step_climate.R @@ -110,7 +110,7 @@ test_that("prep/bake steps create the correct training data with an incomplete y r <- epi_recipe(x) %>% step_climate(y, time_type = "epiweek") p <- prep(r, x) - expected_res <- tibble(.idx = c(1:44, 999), climate_y = c(2, 3, 3, 4:25, 25, 25, 25:12, 12, 11, 11, 10)) + expected_res <- tibble(.idx = c(1:44, 999), climate_y = c(2, 3, 3, 4:25, 25, 25, 25:12, 12, 11, 11, 2)) expect_equal(p$steps[[1]]$climate_table, expected_res) b <- bake(p, new_data = NULL) diff --git a/vignettes/articles/smooth-qr.Rmd b/vignettes/articles/smooth-qr.Rmd deleted file mode 100644 index ec07272aa..000000000 --- a/vignettes/articles/smooth-qr.Rmd +++ /dev/null @@ -1,544 +0,0 @@ ---- -title: "Smooth quantile regression" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{Smooth quantile regression} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r setup, include = FALSE} -knitr::opts_chunk$set( - collapse = FALSE, - comment = "#>", - warning = FALSE, - message = FALSE, - out.width = "100%" -) -``` - -# Introducing smooth quantile regression - -Whereas other time-series forecasting examples in this package have used -(direct) models for single horizons, in multi-period forecasting, the goal is to -(directly) forecast several horizons simultaneously. This is useful in -epidemiological applications where decisions are based on the trend of a signal. - -The idea underlying smooth quantile regression is that set forecast targets can -be approximated by a smooth curve. This novel approach from -[Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723) -enforces smoothness across the -horizons and can be applied to point estimation by regression or interval -prediction by quantile regression. Our focus in this vignette is the latter. - -# Built-in function for smooth quantile regression and its parameters - -The built-in smooth quantile regression function, `smooth_quantile_reg()` -provides a model specification for smooth quantile regression that works under -the tidymodels framework. It has the following parameters and default values: - -```{r, eval = FALSE} -smooth_quantile_reg( - mode = "regression", - engine = "smoothqr", - outcome_locations = NULL, - quantile_levels = 0.5, - degree = 3L -) -``` - -For smooth quantile regression, the type of model or `mode` is regression. - -The only `engine` that is currently supported is `smooth_qr()` from the -[`smoothqr` package](https://dajmcdon.github.io/smoothqr/). - -The `outcome_locations` indicate the multiple horizon (ie. ahead) values. These -should be specified by the user. - -The `quantile_levels` parameter is a vector of values that indicates the -quantiles to be estimated. The default is the median (0.5 quantile). - -The `degree` parameter indicates the degree of the polynomials used for -smoothing of the response. It should be no more than the number of aheads. If -the degree is precisely equal to the number of aheads, then there is no -smoothing. To better understand this parameter and how it works, we should look -to its origins and how it is used in the model. - -# Model form - -Smooth quantile regression is linear auto-regressive, with the key feature being -a transformation that forces the coefficients to satisfy a smoothing constraint. -The purpose of this is for each model coefficient to be a smooth function of -ahead values, and so each such coefficient is set to be a linear combination of -smooth basis functions (such as a spline or a polynomial). - -The `degree` parameter controls the number of these polynomials used. It should -be no greater than the number of responses. This is a tuning parameter, and so -it can be chosen by performing a grid search with cross-validation. Intuitively, -$d = 1$ corresponds to the constant model, $d = 2$ gives straight line -forecasts, while $d = 3$ gives quadratic forecasts. Since a degree of 3 was -found to work well in the tested applications (see Section 9 of -[Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723)), -it is the default value. - -# Demonstration of smooth quantile regression - -```{r, message = FALSE} -library(epipredict) -library(dplyr) -library(purrr) -library(ggplot2) -theme_set(theme_bw()) -``` - -We will now apply smooth quantile regression on the real data used for COVID-19 -forecasting. The built-in dataset we will use is a subset of JHU daily data on -state cases and deaths. This sample data ranges from Dec. 31, 2020 to -Dec. 31, 2021. - -```{r} -edf <- epidatasets::covid_case_death_rates -``` - -We will set the forecast date to be November 30, 2021 so that we can produce -forecasts for target dates of 1 to 28 days ahead. We construct our test data, -`tedf` from the days beyond this. - -```{r} -fd <- as.Date("2021-11-30") - -tedf <- edf %>% filter(time_value >= fd) -``` - -We will use the most recent 3 months worth of data up to the forecast date for -training. - -```{r} -edf <- edf %>% filter(time_value < fd, time_value >= fd - 90L) -``` - -And for plotting our focus will be on a subset of two states - California and -Utah. - -```{r} -geos <- c("ut", "ca") -``` - -Suppose that our goal with this data is to predict COVID-19 death rates at -several horizons for each state. On day $t$, we want to predict new deaths $y$ -that are $a = 1,\dots, 28$ days ahead at locations $j$ using the death rates -from today, 1 week ago, and 2 weeks ago. So for each location, we'll predict the -median (0.5 quantile) for each of the target dates by using -$$ -\hat{y}_{j}(t+a) = \alpha(a) + \sum_{l = 0}^2 \beta_{l}(a) y_{j}(t - 7l) -$$ -where $\beta_{l}(a) = \sum_{i=1}^d \theta_{il} h_i(a)$ is the smoothing -constraint where ${h_1(a), \dots, h_d(a)}$ are the set of smooth basis functions -and $d$ is a hyperparameter that manages the flexibility of $\beta_{l}(a)$. -Remember that the goal is to have each $\beta_{l}(a)$ to be a smooth function of -the aheads and that is achieved through imposing the smoothing constraint. - -Note that this model is intended to be simple and straightforward. Our only -modification to this model is to add case rates as another predictive feature -(we will leave it to the reader to incorporate additional features beyond this -and the historical response values). We can update the basic model incorporate -the $k = 2$ predictive features of case and death rates for each location j, -$x_j(t) = (x_{j1}(t), x_{j2}(t))$ as follows: - -$$ -\hat{y}_{j}(t+a) = \alpha(a) + \sum_{k = 1}^2 \sum_{l = 0}^2 \beta_{kl}(a) x_{jk}(t - 7l) -$$ -where $\beta_{kl}(a) = \sum_{i=1}^d \theta_{ikl} h_i(a)$. - -Now, we will create our own forecaster from scratch by building up an -`epi_workflow` (there is no canned forecaster that is currently available). -Building our own forecaster allows for customization and control over the -pre-processing and post-processing actions we wish to take. - -The pre-processing steps we take in our `epi_recipe` are simply to lag the -predictor (by 0, 7, and 14 days) and lead the response by the multiple aheads -specified by the function user. - -The post-processing layers we add to our `frosting` are nearly as simple. We -first predict, unnest the prediction list-cols, omit NAs from them, and enforce -that they are greater than 0. - -The third component of an to an `epi_workflow`, the model, is smooth quantile -regression, which has three main arguments - the quantiles, aheads, and degree. - -After creating our `epi_workflow` with these components, we get our test data -based on longest lag period and make the predictions. - -We input our forecaster into a function for ease of use. - -```{r} -smooth_fc <- function(x, aheads = 1:28, degree = 3L, quantiles = 0.5, fd) { - rec <- epi_recipe(x) %>% - step_epi_lag(case_rate, lag = c(0, 7, 14)) %>% - step_epi_lag(death_rate, lag = c(0, 7, 14)) %>% - step_epi_ahead(death_rate, ahead = aheads) - - f <- frosting() %>% - layer_predict() %>% - layer_unnest(.pred) %>% - layer_naomit(distn) %>% - layer_add_forecast_date() %>% - layer_threshold(distn) - - ee <- smooth_quantile_reg( - quantile_levels = quantiles, - outcome_locations = aheads, - degree = degree - ) - - ewf <- epi_workflow(rec, ee, f) - - the_fit <- ewf %>% fit(x) - - latest <- get_test_data(rec, x) - - preds <- predict(the_fit, new_data = latest) %>% - mutate(forecast_date = fd, target_date = fd + ahead) %>% - select(geo_value, target_date, distn, ahead) %>% - pivot_quantiles_wider(distn) - - preds -} -``` - -```{r load-stored-preds, echo=FALSE} -smooth_preds_list <- readRDS("smooth-qr_smooth_preds_list.rds") -baseline_preds <- readRDS("smooth-qr_baseline_preds.rds") -smooth_preds <- smooth_preds_list %>% - filter(degree == 3L) %>% - select(geo_value:ahead, `0.5`) -``` - -Notice that we allow the function user to specify the aheads, degree, and -quantile as they may want to change these parameter values. We also allow for -input of the forecast date as we fixed that at the onset of this demonstration. - -We now can produce smooth quantile regression predictions for our problem: - -```{r, eval = FALSE} -smooth_preds <- smooth_fc(edf, fd = fd) -smooth_preds -``` - -```{r, echo=FALSE} -smooth_preds -smooth_preds <- smooth_preds_list %>% - filter(degree == 3L) %>% - select(-degree) -``` - - -Most often, we're not going to want to limit ourselves to just predicting the -median value as there is uncertainty about the predictions, so let's try to -predict several different quantiles in addition to the median: - -```{r, eval = FALSE} -several_quantiles <- c(.1, .25, .5, .75, .9) -smooth_preds <- smooth_fc(edf, quantiles = several_quantiles, fd = fd) -smooth_preds -``` - -```{r, echo = FALSE} -several_quantiles <- c(.1, .25, .5, .75, .9) -smooth_preds -``` - -We can see that we have different columns for the different quantile -predictions. - -Let's visualize these results for the sample of two states. We will create a -simple plotting function, under which the median predictions are an orange line -and the surrounding quantiles are blue bands around this. For comparison, we -will include the actual values over time as a black line. - -```{r} -plot_preds <- function(preds, geos_to_plot = NULL, train_test_dat, fd) { - if (!is.null(geos_to_plot)) { - preds <- preds %>% filter(geo_value %in% geos_to_plot) - train_test_dat <- train_test_dat %>% filter(geo_value %in% geos_to_plot) - } - - ggplot(preds) + - geom_ribbon(aes(target_date, ymin = `0.1`, ymax = `0.9`), - fill = "cornflowerblue", alpha = .8 - ) + - geom_ribbon(aes(target_date, ymin = `0.25`, ymax = `0.75`), - fill = "#00488E", alpha = .8 - ) + - geom_line(data = train_test_dat, aes(time_value, death_rate)) + - geom_line(aes(target_date, `0.5`), color = "orange") + - geom_vline(xintercept = fd) + - facet_wrap(~geo_value) + - scale_x_date(name = "", date_labels = "%b %Y", date_breaks = "2 months") + - ylab("Deaths per 100K inhabitants") -} -``` - -Since we would like to plot the actual death rates for these states over time, -we bind the training and testing data together and input this into our plotting -function as follows: - -```{r, warning = FALSE} -plot_preds(smooth_preds, geos, bind_rows(tedf, edf), fd) -``` - -We can see that the predictions are smooth curves for each state, as expected -when using smooth quantile regression. In addition while the curvature of the -forecasts matches that of the truth, the forecasts do not look remarkably -accurate. - -## Varying the degrees parameter - -We can test the impact of different degrees by using the `map()` function. -Noting that this may take some time to run, let's try out all degrees from 1 -to 7: - -```{r, eval = FALSE} -smooth_preds_list <- map(1:7, function(x) { - smooth_fc( - edf, - degree = x, - quantiles = c(.1, .25, .5, .75, .9), - fd = fd - ) %>% - mutate(degree = x) -}) %>% list_rbind() -``` - -One way to quantify the impact of these on the forecasting is to look at the -mean absolute error (MAE) or mean squared error (MSE) over the degrees. We can -select the degree that results in the lowest MAE. - -Since the MAE compares the predicted values to the actual values, we will first -join the test data to the predicted data for our comparisons: -```{r, message = FALSE} -tedf_sub <- tedf %>% - rename(target_date = time_value, actual = death_rate) %>% - select(geo_value, target_date, actual) -``` - -And then compute the MAE for each of the degrees: -```{r, message = FALSE} -smooth_preds_df_deg <- smooth_preds_list %>% - left_join(tedf_sub, by = c("geo_value", "target_date")) %>% - group_by(degree) %>% - mutate(error = abs(`0.5` - actual)) %>% - summarise(mean = mean(error)) - -# Arrange the MAE from smallest to largest -smooth_preds_df_deg %>% arrange(mean) -``` - -Instead of just looking at the raw numbers, let's create a simple line plot to -visualize how the MAE changes over degrees for this data: - -```{r} -ggplot(smooth_preds_df_deg, aes(degree, mean)) + - geom_line() + - xlab("Degrees of freedom") + - ylab("Mean MAE") -``` - -We can see that the degree that results in the lowest MAE is 3. Hence, we could -pick this degree for future forecasting work on this data. - -## A brief comparison between smoothing and no smoothing - -Now, we will briefly compare the results from using smooth quantile regression -to those obtained without smoothing. The latter approach amounts to ordinary -quantile regression to get predictions for the intended target date. The main -drawback is that it ignores the fact that the responses all represent the same -signal, just for different ahead values. In contrast, the smooth quantile -regression approach utilizes this information about the data structure - the -fact that the aheads in are not be independent of each other, but that they are -naturally related over time by a smooth curve. - -To get the basic quantile regression results we can utilize the forecaster that -we've already built. We can simply set the degree to be the number of ahead -values to re-run the code without smoothing. - -```{r, eval = FALSE} -baseline_preds <- smooth_fc( - edf, - degree = 28L, quantiles = several_quantiles, fd = fd -) -``` - -And we can produce the corresponding plot to inspect the predictions obtained -under the baseline model: - -```{r, warning = FALSE} -plot_preds(baseline_preds, geos, bind_rows(tedf, edf), fd) -``` - -Unlike for smooth quantile regression, the resulting forecasts are not smooth -curves, but rather jagged and irregular in shape. - -For a more formal comparison between the two approaches, we could compare the -test performance in terms of accuracy through calculating either the, MAE or -MSE, where the performance measure of choice can be calculated over over all -times and locations for each ahead value - - -```{r, message = FALSE} -baseline_preds_mae_df <- baseline_preds %>% - left_join(tedf_sub, by = c("geo_value", "target_date")) %>% - group_by(ahead) %>% - mutate(error = abs(`0.5` - actual)) %>% - summarise(mean = mean(error)) %>% - mutate(type = "baseline") - -smooth_preds_mae_df <- smooth_preds %>% - left_join(tedf_sub, by = c("geo_value", "target_date")) %>% - group_by(ahead) %>% - mutate(error = abs(`0.5` - actual)) %>% - summarise(mean = mean(error)) %>% - mutate(type = "smooth") - -preds_mae_df <- bind_rows(baseline_preds_mae_df, smooth_preds_mae_df) - -ggplot(preds_mae_df, aes(ahead, mean, color = type)) + - geom_line() + - xlab("Ahead") + - ylab("Mean MAE") + - scale_color_manual(values = c("darkred", "#063970"), name = "") -``` - -or over all aheads, times, and locations for a single numerical summary. - -```{r} -mean(baseline_preds_mae_df$mean) -mean(smooth_preds_mae_df$mean) -``` - -The former shows that forecasts for the immediate future and for the distant -future are more inaccurate for both models under consideration. The latter shows -that the smooth quantile regression model and baseline models perform very -similarly overall, with the smooth quantile regression model only slightly -beating the baseline model in terms of overall average MAE. - -One other commonly used metric is the Weighted Interval Score -(WIS, [Bracher et al., 2021](https://arxiv.org/pdf/2005.12881.pdf)), -which a scoring rule that is based on the population quantiles. The point is to -score the interval, whereas MAE only evaluates the accuracy of the point -forecast. - -Let $F$ be a forecast composed of predicted quantiles $q_{\tau}$ for the set of -quantile levels $\tau$. Then, in terms of the predicted quantiles, the WIS for -target variable $Y$ is represented as follows -([McDonald etal., 2021](https://www.pnas.org/doi/full/10.1073/pnas.2111453118)): - -$$ -WIS(F, Y) = 2 \sum_{\tau} \phi_{\tau} (Y - q_{\tau}) -$$ -where $\phi_{\tau}(x) = \tau |x|$ for $x \geq 0$ -and$\phi_{\tau}(x) = (1 - \tau) |x|$ for $x < 0$. - -This form is general as it can accommodate both symmetric and asymmetric -quantile levels. If the quantile levels are symmetric, then we can alternatively -express the WIS as a collection of central prediction intervals -($\ell_{\alpha}, u_{\alpha}$) parametrized by the exclusion probability -$\alpha$: - -$$ -WIS(F, Y) = \sum_{\alpha} \{ (u_{\alpha} - \ell_{\alpha}) + 2 \cdot \text{dist}(Y, [\ell_{\alpha}, u_{\alpha}]) \} -$$ -where $\text{dist}(a,S)$ is the smallest distance between point $a$ and an -element of set $S$. - -While we implement the former representation, we mention this form because it -shows the that the score can be decomposed into the addition of a sharpness -component (first term in the summand) and an under/overprediction component -(second term in the summand). This alternative representation is useful because -from it, we more easily see the major limitation to the WIS, which is that the -score tends to prioritize sharpness (how wide the interval is) relative to -coverage (if the interval contains the truth). - -Now, we write a simple function for the first representation of the score that -is compatible with the latest version of `epipredict` (adapted from the -corresponding function in -[smoothmpf-epipredict](https://github.com/dajmcdon/smoothmpf-epipredict)). The -inputs for it are the actual and predicted values and the quantile levels. - -```{r} -wis_dist_quantile <- function(actual, values, quantile_levels) { - 2 * mean(pmax( - quantile_levels * (actual - values), - (1 - quantile_levels) * (values - actual), - na.rm = TRUE - )) -} -``` - -Next, we apply the `wis_dist_quantile` function to get a WIS score for each -state on each target date. We then compute the mean WIS for each ahead value -over all of the states. The results for each of the smooth and baseline -forecasters are shown in a similar style line plot as we chose for MAE: - -```{r} -smooth_preds_wis_df <- smooth_preds %>% - left_join(tedf_sub, by = c("geo_value", "target_date")) %>% - rowwise() %>% - mutate(wis = wis_dist_quantile( - actual, c(`0.1`, `0.25`, `0.5`, `0.75`, `0.9`), - several_quantiles - )) %>% - group_by(ahead) %>% - summarise(mean = mean(wis)) %>% - mutate(type = "smooth") - -baseline_preds_wis_df <- baseline_preds %>% - left_join(tedf_sub, by = c("geo_value", "target_date")) %>% - rowwise() %>% - mutate(wis = wis_dist_quantile( - actual, c(`0.1`, `0.25`, `0.5`, `0.75`, `0.9`), - several_quantiles - )) %>% - group_by(ahead) %>% - summarise(mean = mean(wis)) %>% - mutate(type = "baseline") - -preds_wis_df <- bind_rows(smooth_preds_wis_df, baseline_preds_wis_df) - -ggplot(preds_wis_df, aes(ahead, mean, color = type)) + - geom_line() + - xlab("Ahead") + - ylab("Mean WIS") + - scale_color_manual(values = c("darkred", "#063970"), name = "") -``` - -The results are consistent with what we saw for MAE: The forecasts for the near -and distant future tend to be inaccurate for both models. The smooth quantile -regression model only slightly outperforms the baseline model. - -Though averaging the WIS score over location and time tends to be the primary -aggregation scheme used in evaluation and model comparisons (see, for example, -[McDonald et al., 2021](https://www.pnas.org/doi/full/10.1073/pnas.2111453118)), -we can also obtain a single numerical summary by averaging over the aheads, -times, and locations: - -```{r} -mean(baseline_preds_wis_df$mean) -mean(smooth_preds_wis_df$mean) -``` - -Overall, both perspectives agree that the smooth quantile regression model tends -to perform only slightly better than the baseline model in terms of average WIS, -illustrating the difficulty of this forecasting problem. - -# What we've learned in a nutshell - -Smooth quantile regression is used in multi-period forecasting for predicting -several horizons simultaneously with a single smooth curve. It operates under -the key assumption that the future of the response can be approximated well by a -smooth curve. - -# Attribution - -The information presented on smooth quantile regression is from -[Tuzhilina et al., 2022](https://arxiv.org/abs/2202.09723). diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd deleted file mode 100644 index 1e2a6949a..000000000 --- a/vignettes/arx-classifier.Rmd +++ /dev/null @@ -1,274 +0,0 @@ ---- -title: "Auto-regressive classifier" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{Auto-regressive classifier} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -source("_common.R") -``` - -## Load required packages - -```{r, message = FALSE, warning = FALSE} -library(dplyr) -library(purrr) -library(ggplot2) -library(epipredict) -``` - -## Introducing the ARX classifier - -The `arx_classifier()` is an autoregressive classification model for `epi_df` -data that is used to predict a discrete class for each case under consideration. -It is a direct forecaster in that it estimates the classes at a specific horizon -or ahead value. - -To get a sense of how the `arx_classifier()` works, let's consider a simple -example with minimal inputs. For this, we will use the built-in -`covid_case_death_rates` that contains confirmed COVID-19 cases and deaths from -JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take -a subset of data for five states over June 4, 2021 to December 31, 2021. Our -objective is to predict whether the case rates are increasing when considering -the 0, 7 and 14 day case rates: - -```{r} -jhu <- covid_case_death_rates %>% - filter( - time_value >= "2021-06-04", - time_value <= "2021-12-31", - geo_value %in% c("ca", "fl", "tx", "ny", "nj") - ) - -out <- arx_classifier(jhu, outcome = "case_rate", predictors = "case_rate") - -out$predictions -``` - -The key takeaway from the predictions is that there are two prediction classes: -(-Inf, 0.25] and (0.25, Inf). This is because for our goal of classification -the classes must be discrete. The discretization of the real-valued outcome is -controlled by the `breaks` argument, which defaults to 0.25. Such breaks will be -automatically extended to cover the entire real line. For example, the default -break of 0.25 is silently extended to breaks = c(-Inf, .25, Inf) and, therefore, -results in two classes: [-Inf, 0.25] and (0.25, Inf). These two classes are -used to discretize the outcome. The conversion of the outcome to such classes is -handled internally. So if discrete classes already exist for the outcome in the -`epi_df`, then we recommend to code a classifier from scratch using the -`epi_workflow` framework for more control. - -The `trainer` is a `parsnip` model describing the type of estimation such that -`mode = "classification"` is enforced. The two typical trainers that are used -are `parsnip::logistic_reg()` for two classes or `parsnip::multinom_reg()` for -more than two classes. - -```{r} -workflows::extract_spec_parsnip(out$epi_workflow) -``` - -From the parsnip model specification, we can see that the trainer used is -logistic regression, which is expected for our binary outcome. More complicated -trainers like `parsnip::naive_Bayes()` or `parsnip::rand_forest()` may also be -used (however, we will stick to the basics in this gentle introduction to the -classifier). - -If you use the default trainer of logistic regression for binary classification -and you decide against using the default break of 0.25, then you should only -input one break so that there are two classification bins to properly -dichotomize the outcome. For example, let's set a break of 0.5 instead of -relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` -argument in `arx_class_args_list()` as follows: - -```{r} -out_break_0.5 <- arx_classifier( - jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.5 - ) -) - -out_break_0.5$predictions -``` -Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, -Inf). See `help(arx_class_args_list)` for other available modifications. - -Additional arguments that may be supplied to `arx_class_args_list()` include the -expected `lags` and `ahead` arguments for an autoregressive-type model. These -have default values of 0, 7, and 14 days for the lags of the predictors and 7 -days ahead of the forecast date for predicting the outcome. There is also -`n_training` to indicate the upper bound for the number of training rows per -key. If you would like some practice with using this, then remove the filtering -command to obtain data within "2021-06-04" and "2021-12-31" and instead set -`n_training` to be the number of days between these two dates, inclusive of the -end points. The end results should be the same. In addition to `n_training`, -there are `forecast_date` and `target_date` to specify the date that the -forecast is created and intended, respectively. We will not dwell on such -arguments here as they are not unique to this classifier or absolutely essential -to understanding how it operates. The remaining arguments will be discussed -organically, as they are needed to serve our purposes. For information on any -remaining arguments that are not discussed here, please see the function -documentation for a complete list and their definitions. - -## Example of using the ARX classifier - -Now, to demonstrate the power and utility of this built-in arx classifier, we -will loosely adapt the classification example that was written from scratch in -`vignette("preprocessing-and-models")`. However, to keep things simple and not -merely a direct translation, we will only consider two prediction categories and -leave the extension to three as an exercise for the reader. - -To motivate this example, a major use of autoregressive classification models is -to predict upswings or downswings like in hotspot prediction models to -anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. -(2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on -these). In our case, one simple question that such models can help answer is... -Do we expect that the future will have increased case rates or not relative to -the present? - -To answer this question, we can create a predictive model for upswings and -downswings of case rates rather than the raw case rates themselves. For this -situation, our target is to predict whether there is an increase in case rates -or not. Following -[McDonald, Bien, Green, Hu, et al.(2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), -we look at the -relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case -rate at location $l$ at time $t$ and the latter is the rate for that location at -time $t+a$. Using these variables, we define a categorical response variable -with two classes - -$$\begin{align} -Z_{l,t} = \left\{\begin{matrix} -\text{up,} & \text{if } Y_{l,t}^\Delta > 0.25\\ -\text{not up,} & \text{otherwise} -\end{matrix}\right. -\end{align}$$ -where $Y_{l,t}^\Delta = (Y_{l, t} - Y_{l, t-7} / Y_{l, t-7}$. If $Y_{l,t}^\Delta$ > 0.25, meaning that the number of new cases over the week has increased by over 25\%, then $Z_{l,t}$ is up. This is the criteria for location $l$ to be a hotspot at time $t$. On the other hand, if $Y_{l,t}^\Delta$ \leq 0.25$, then then $Z_{l,t}$ is categorized as not up, meaning that there has not been a >25\% increase in the new cases over the past week. - -The logistic regression model we use to predict this binary response can be -considered to be a simplification of the multinomial regression model presented -in `vignette("preprocessing-and-models")`: - -$$\begin{align} -\pi_{\text{up}}(x) &= Pr(Z_{l, t} = \text{up}|x) = \frac{e^{g_{\text{up}}(x)}}{1 + e^{g_{\text{up}}(x)}}, \\ -\pi_{\text{not up}}(x)&= Pr(Z_{l, t} = \text{not up}|x) = 1 - Pr(Z_{l, t} = \text{up}|x) = \frac{1}{1 + e^{g_{\text{up}}(x)}} -\end{align}$$ -where - -$$ -g_{\text{up}}(x) = \log\left ( \frac{\Pr(Z_{l, t} = \text{up} \vert x)}{\Pr(Z_{l, t} = \text{not up} \vert x)} \right ) = \beta_{10} + \beta_{11}Y_{l,t}^\Delta + \beta_{12}Y_{l,t-7}^\Delta + \beta_{13}Y_{l,t-14}^\Delta. -$$ - -Now then, we will operate on the same subset of the `covid_case_death_rates` -that we used in our above example. This time, we will use it to investigate -whether the number of newly reported cases over the past 7 days has increased by -at least 25% compared to the preceding week for our sample of states. - -Notice that by using the `arx_classifier()` function we've completely eliminated -the need to manually categorize the response variable and implement -pre-processing steps, which was necessary in -`vignette("preprocessing-and-models")`. - -```{r} -log_res <- arx_classifier( - jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.25 / 7 # division by 7 gives weekly not daily - ) -) - -log_res$epi_workflow -``` - -Comparing the pre-processing steps for this to those in the other vignette, we -can see that they are not precisely the same, but they cover the same essentials -of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), -lagging the predictors (`step_epi_lag()`), leading the response -(`step_epi_ahead()`), which are both constructed from the growth rates, and -constructing the binary classification response variable (`step_mutate()`). - -On this topic, it is important to understand that we are not actually concerned -about the case values themselves. Rather we are concerned whether the quantity -of cases in the future is a lot larger than that in the present. For this -reason, the outcome does not remain as cases, but rather it is transformed by -using either growth rates (as the predictors and outcome in our example are) or -lagged differences. While the latter is closer to the requirements for the -[2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), -and it is conceptually easy to understand because it is simply the change of the -value for the horizon, it is not the default. The default is `growth_rate`. One -reason for this choice is because the growth rate is on a rate scale, not on the -absolute scale, so it fosters comparability across locations without any -conscious effort (on the other hand, when using the `lag_difference` one would -need to take care to operate on rates per 100k and not raw counts). We utilize -`epiprocess::growth_rate()` to create the outcome using some of the additional -arguments. One important argument for the growth rate calculation is the -`method`. Only `rel_change` for relative change should be used as the method -because the test data is the only data that is accessible and the other methods -require access to the training data. - -The other optional arguments for controlling the growth rate calculation (that -can be inputted as `additional_gr_args`) can be found in the documentation for -`epiprocess::growth_rate()` and the related -`vignette("growth_rate", package = "epiprocess")`. - -### Visualizing the results - -To visualize the prediction classes across the states for the target date, we -can plot our results as a heatmap. However, if we were to plot the results for -only one target date, like our 7-day ahead predictions, then that would be a -pretty sad heatmap (which would look more like a bar chart than a heatmap)... So -instead of doing that, let's get predictions for several aheads and plot a -heatmap across the target dates. To get the predictions across several ahead -values, we will use the map function in the same way that we did in other -vignettes: - -```{r} -multi_log_res <- map(1:40, ~ arx_classifier( - jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.25 / 7, # division by 7 gives weekly not daily - ahead = .x - ) -)$predictions) %>% list_rbind() -``` - -We can plot a the heatmap of the results over the aheads to see if there's -anything novel or interesting to take away: - -```{r} -ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + - geom_tile() + - ylab("State") + - xlab("Target date") + - scale_fill_brewer(palette = "Set1") -``` - -While there is a bit of variability near to the end, we can clearly see that -there are upswings for all states starting from the beginning of January 2022, -which we can recall was when there was a massive spike in cases for many states. -So our results seem to align well with what actually happened at the beginning -of January 2022. - -## A brief reflection - -The most noticeable benefit of using the `arx_classifier()` function is the -simplification and reduction of the manual implementation of the classifier from -about 30 down to 3 lines. However, as we noted before, the trade-off for -simplicity is control over the precise pre-processing, post-processing, and -additional features embedded in the coding of a classifier. So the good thing is -that `epipredict` provides both - a built-in `arx_classifer()` or the means to -implement your own classifier from scratch by using the `epi_workflow` -framework. And which you choose will depend on the circumstances. Our advice is -to start with using the built-in classifier for ostensibly simple projects and -begin to implement your own when the modelling project takes a complicated turn. -To get some practice on coding up a classifier by hand, consider translating -this binary classification model example to an `epi_workflow`, akin to that in -`vignette("preprocessing-and-models")`. diff --git a/vignettes/backtesting.Rmd b/vignettes/backtesting.Rmd index 70d18fa30..6121a0499 100644 --- a/vignettes/backtesting.Rmd +++ b/vignettes/backtesting.Rmd @@ -8,10 +8,35 @@ vignette: > --- ```{r, include = FALSE} -source("_common.R") +source(here::here("vignettes/_common.R")) ``` +Backtesting is a crucial step in the development of forecasting models. It +involves testing the model on historical time periods to see how well it generalizes to new +data. + +In the context of +epidemiological forecasting, to do backtesting accurately, we need to account +for the fact that the data available at _the time of the forecast_ would have been +different from the data available at the time of the _backtest_. +This is because +new data is constantly being collected and added to the dataset, and old data potentially revised. +Training and making +predictions only on finalized data can lead to overly optimistic estimates of accuracy +(see, for example, [McDonald et al. +(2021)](https://www.pnas.org/content/118/51/e2111453118/) and the references +therein). + +In the `{epiprocess}` package, we provide the function `epix_slide()` to help conviently perform version-faithful forecasting by only using the data as +it would have been available at forecast reference time. +In this vignette, we will demonstrate how to use `epix_slide()` to backtest an +auto-regressive forecaster constructed using `arx_forecaster()` on historical +COVID-19 case data from the US and Canada. + +# Getting case data from US states into an `epi_archive` + ```{r pkgs, message=FALSE} +# Setup library(epipredict) library(epiprocess) library(epidatr) @@ -23,51 +48,16 @@ library(magrittr) library(purrr) ``` -# Accurately backtesting forecasters - -Backtesting is a crucial step in the development of forecasting models. It -involves testing the model on historical data to see how well it performs. This -is important because it allows us to see how well the model generalizes to new -data and to identify any potential issues with the model. In the context of -epidemiological forecasting, to do backtesting accurately, we need to account -for the fact that the data available at the time of the forecast would have been -different from the data available at the time of the backtest. This is because -new data is constantly being collected and added to the dataset, which can -affect the accuracy of the forecast. - -For this reason, it is important to use version-aware forecasting, where the -model is trained on data that would have been available at the time of the -forecast. This ensures that the model is tested on data that is as close as -possible to what would have been available in real-time; training and making -predictions on finalized data can lead to an overly optimistic sense of accuracy -(see, for example, [McDonald et al. -(2021)](https://www.pnas.org/content/118/51/e2111453118/) and the references -therein). - -In the `{epiprocess}` package, we provide `epix_slide()`, a function that allows -a convenient way to perform version-aware forecasting by only using the data as -it would have been available at forecast reference time. In -`vignette("epi_archive", package = "epiprocess")`, we introduced the concept of -an `epi_archive` and we demonstrated how to use `epix_slide()` to forecast the -future using a simple quantile regression model. In this vignette, we will -demonstrate how to use `epix_slide()` to backtest an auto-regressive forecaster -on historical COVID-19 case data from the US and Canada. Instead of building a -forecaster from scratch as we did in the previous vignette, we will use the -`arx_forecaster()` function from the `{epipredict}` package. - -## Getting case data from US states into an `epi_archive` - -First, we download the version history (ie. archive) of the percentage of -doctor's visits with CLI (COVID-like illness) computed from medical insurance -claims and the number of new confirmed COVID-19 cases per 100,000 population -(daily) for 6 states from the COVIDcast API (as used in the `epiprocess` -vignette mentioned above). +First, we create an `epi_archive()` to store the version history of the +percentage of doctor's visits with CLI (COVID-like illness) computed from +medical insurance claims and the number of new confirmed COVID-19 cases per +100,000 population (daily) for 4 states ```{r grab-epi-data} # Select the `percent_cli` column from the data archive -doctor_visits <- archive_cases_dv_subset$DT %>% - select(geo_value, time_value, version, percent_cli) %>% - drop_na(percent_cli) %>% +doctor_visits <- archive_cases_dv_subset$DT |> + select(geo_value, time_value, version, percent_cli) |> + tidyr::drop_na(percent_cli) |> as_epi_archive(compactify = TRUE) ``` @@ -84,384 +74,362 @@ doctor_visits <- pub_covidcast( geo_values = "ca,fl,ny,tx", time_values = epirange(20200601, 20211201), issues = epirange(20200601, 20211201) -) %>% - rename(version = issue, percent_cli = value) %>% +) |> + # The version date column is called `issue` in the Epidata API. Rename it. + rename(version = issue, percent_cli = value) |> as_epi_archive(compactify = TRUE) ``` -## Backtesting a simple autoregressive forecaster - -One of the most common use cases of `epiprocess::epi_archive()` object -is for accurate model backtesting. - -In this section we will: +In the interest of computational speed, we limit the dataset to 4 states and +2020-2021, but the full archive can be used in the same way and has performed +well in the past. -- develop a simple autoregressive forecaster that predicts the next value of the -signal based on the current and past values of the signal itself, and -- demonstrate how to slide this forecaster over the `epi_archive` object to -produce forecasts at a few dates date, using version-unaware and -aware -computations, -- compare the two approaches. +We choose this dataset in particular partly because it is revision heavy; for +example, here is a plot that compares monthly snapshots of the data. -To start, let's use a simple autoregressive forecaster to predict the percentage -of doctor's hospital visits with CLI (COVID-like illness) (`percent_cli`) in the -future (we choose this target because of the dataset's pattern of substantial -revisions; forecasting doctor's visits is an unusual forecasting target -otherwise). While some AR models output single point forecasts, we will use -quantile regression to produce a point prediction along with an 90\% uncertainty -band, represented by a predictive quantiles at the 5\% and 95\% levels (lower -and upper endpoints of the uncertainty band). +
+Code for plotting +```{r plot_revision_example, warn = FALSE} +geo_choose <- "ca" +forecast_dates <- seq(from = as.Date("2020-08-01"), to = as.Date("2021-11-01"), by = "1 month") +percent_cli_data <- bind_rows( + # Snapshotted data for the version-faithful forecasts + map( + forecast_dates, + ~ doctor_visits |> + epix_as_of(.x) |> + mutate(version = .x) + ) |> + bind_rows() |> + mutate(version_faithful = "Version faithful"), + # Latest data for the version-un-faithful forecasts + doctor_visits |> + epix_as_of(doctor_visits$versions_end) |> + mutate(version_faithful = "Version un-faithful") +) -The `arx_forecaster()` function wraps the autoregressive forecaster we need and -comes with sensible defaults: +p0 <- + ggplot(data = percent_cli_data |> filter(geo_value == geo_choose)) + + geom_vline(aes(color = factor(version), xintercept = version), lty = 2) + + geom_line( + aes(x = time_value, y = percent_cli, color = factor(version)), + inherit.aes = FALSE, na.rm = TRUE + ) + + scale_x_date(breaks = "2 months", date_labels = "%b %Y") + + scale_y_continuous(expand = expansion(c(0, 0.05))) + + labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") + + theme(legend.position = "none") +``` +
-- we specify the predicted outcome to be the percentage of doctor's visits with - CLI (`percent_cli`), -- we use a linear regression model as the engine, -- the autoregressive features assume lags of 0, 7, and 14 days, -- we forecast 7 days ahead. +```{r plot_just_revisioning, echo = FALSE, warn = FALSE, message = FALSE} +p0 +``` -All these default settings and more can be seen by calling `arx_args_list()`: +The snapshots are taken on the first of each month, with the vertical dashed +line representing the issue date for the time series of the corresponding +color. +For example, the snapshot on March 1st, 2021 is aquamarine, and increases to +slightly over 10. +Every series is necessarily to the left of the snapshot date (since all known +values must happen before the snapshot is taken[^4]). +The grey line overlaying the various snapshots represents the "final +value", which is just the snapshot at the last version in the archive (the +`versions_end`). + +Comparing with the grey line tells us how much the value at the time of the +snapshot differs with what was eventually reported. +The drop in January 2021 in the snapshot on `2021-02-01` was initially reported +as much steeper than it eventually turned out to be, while in the period after +that the values were initially reported as higher than they actually were. + +Handling data latency is important in both real-time forecasting and retrospective +forecasting. +Looking at the very first snapshot, `2020-08-01` (the red dotted +vertical line), there is a noticeable gap between the forecast date and the end +of the red time-series to its left. +In fact, if we take a snapshot and get the last `time_value`, ```{r} -arx_args_list() +doctor_visits |> + epix_as_of(as.Date("2020-08-01")) |> + pull(time_value) |> + max() ``` -These can be modified as needed, by sending your desired arguments into -`arx_forecaster(args_list = arx_args_list())`. For now we will use the defaults. +the last day of data is the 25th, a entire week before `2020-08-01`. +This can require some effort to work around, especially if the latency is +variable; see `step_adjust_latency()` for some methods included in this package. +Much of that functionality is built into `arx_forecaster()` using the parameter +`adjust_ahead`, which we will use below. -__Note__: We will use a __geo-pooled approach__, where we train the model on -data from all states and territories combined. This is because the data is quite -similar across states, and pooling the data can help improve the accuracy of the -forecasts, while also reducing the susceptibility of the model to noise. In the -interest of computational speed, we only use the 6 state dataset here, but the -full archive can be used in the same way and has performed well in the past. -Implementation-wise, geo-pooling is achieved by not using `group_by(geo_value)` -prior to `epix_slide()`. In other cases, grouping may be preferrable, so we -leave it to the user to decide, but flag this modeling decision here. -Let's use the `epix_as_of()` method to generate a snapshot of the archive at the -last date, and then run the forecaster. +# Backtesting a simple autoregressive forecaster -```{r} -# Let's forecast 14 days prior to the last date in the archive, to compare. -forecast_date <- doctor_visits$versions_end - 14 -# The .versions argument selects only the last version in the archive and -# produces a forecast only on that date. -forecasts <- doctor_visits %>% +One of the most common use cases of `epiprocess::epi_archive()` object is for +accurate model back-testing. + +To start, let's use a simple autoregressive forecaster to predict `percent_cli`, the percentage +of doctor's hospital visits associated with COVID-like illness, 14 +days in the future. +For increased accuracy we will use quantile regression. + +## Comparing a single day and ahead + +As a sanity check before we backtest the _entire_ dataset, let's +forecast a single day in the middle of the dataset. +We can do this by setting the `.version` argument in `epix_slide()`: + +```{r single_version, warn = FALSE} +forecast_date <- as.Date("2021-04-06") +forecasts <- doctor_visits |> epix_slide( ~ arx_forecaster( .x, outcome = "percent_cli", predictors = "percent_cli", args_list = arx_args_list() - )$predictions %>% + )$predictions |> pivot_quantiles_wider(.pred_distn), .versions = forecast_date ) -# Join the forecasts with with the latest data at the time of the forecast to -# compare. Since `percent_cli` data has a few days of lag, we use `tidyr::fill` to -# fill the missing values with the last observed value. -forecasts %>% +``` + +We need truth data to compare our forecast against. We can construct it by using `epix_as_of()` to snapshot +the archive at the last available date[^1]. + +_Note:_ We always want to compare our forecasts to actual (most recently reported) values because that is the outcome we care about. +`as_of` data is useful for understanding why we're getting the forecasts we're getting, but `as_of` values are not the real outcome +Therefore, it's not meaningful to use them for evaluating the performance of a forecast. +Unfortunately, it's not uncommon for revisions to cause poor (final) performance of a forecaster that was decent at the time of the forecast. + +```{r compare_single_with_result} +forecasts |> inner_join( - doctor_visits %>% - epix_as_of(doctor_visits$versions_end) %>% - group_by(geo_value) %>% - fill(percent_cli), + doctor_visits |> + epix_as_of(doctor_visits$versions_end), by = c("geo_value", "target_date" = "time_value") - ) %>% + ) |> select(geo_value, forecast_date, .pred, `0.05`, `0.95`, percent_cli) ``` -The resulting epi_df now contains two new columns: `.pred` and `.pred_distn`, -corresponding to the point forecast (median) and the quantile distribution -containing our requested quantile forecasts (in this case, 0.05 and 0.95) -respectively. The forecasts fall within the prediction interval, so our +`.pred` corresponds to the point forecast (median), and `0.05` and `0.95` +correspond to the 5th and 95th quantiles. +The `percent_cli` truth data falls within the prediction intervals, so our implementation passes a simple validation. -Now let's go ahead and slide this forecaster in a version unaware way and a -version aware way. For the version unaware way, we need to snapshot the latest -version of the data, and then make a faux archive by setting `version = -time_value`. This has the effect of simulating a data set that receives the -final version updates every day. For the version aware way, we will simply use -the true `epi_archive` object. +## Comparing version faithful and version un-faithful forecasts + +Now let's compare the behavior of this forecaster, both properly considering data versioning +("version faithful") and ignoring data versions ("version un-faithful"). + +For the version un-faithful approach, we need to do some setup if we want to use `epix_slide` for backtesting. +We want to simulate a data set that receives finalized updates every day, that is, a data set with no revisions. +To do this, we will snapshot the latest version of the data to create a synthetic data set, and convert it into an archive +where `version = time_value`[^2]. ```{r} -archive_cases_dv_subset_faux <- doctor_visits %>% - epix_as_of(doctor_visits$versions_end) %>% - mutate(version = time_value) %>% +archive_cases_dv_subset_faux <- doctor_visits |> + epix_as_of(doctor_visits$versions_end) |> + mutate(version = time_value) |> as_epi_archive() ``` -To reduce typing, we create the wrapper function `forecast_k_week_ahead()`. +For the version faithful approach, we will continue using the original `epi_archive` object containing all version updates. + +We will also create the helper function `forecast_wrapper()` to let us easily map across aheads. ```{r arx-kweek-preliminaries, warning = FALSE} -# Latest snapshot of data, and forecast dates -forecast_dates <- seq(from = as.Date("2020-08-01"), to = as.Date("2021-11-01"), by = "1 month") -aheads <- c(7, 14, 21, 28) - -# @param epi_archive The epi_archive object to forecast from -# @param ahead The number of days ahead to forecast -# @param outcome The outcome variable to forecast -# @param predictors The predictors to use in the model -# @param forecast_dates The dates to forecast on -# @param process_data A function to process the data before forecasting -forecast_k_week_ahead <- function( - epi_archive, - ahead = 7, - outcome = NULL, predictors = NULL, forecast_dates = NULL, process_data = identity) { - if (is.null(forecast_dates)) { - forecast_dates <- epi_archive$versions_end - } - if (is.null(outcome) || is.null(predictors)) { - stop("Please specify the outcome and predictors.") - } - epi_archive %>% - epix_slide( - ~ arx_forecaster( - process_data(.x), outcome, predictors, - args_list = arx_args_list(ahead = ahead) - )$predictions %>% - pivot_quantiles_wider(.pred_distn), - .before = 120, - .versions = forecast_dates - ) +forecast_wrapper <- function( + epi_data, aheads, outcome, predictors, + process_data = identity) { + map( + aheads, + \(ahead) { + arx_forecaster( + process_data(epi_data), outcome, predictors, + args_list = arx_args_list( + ahead = ahead, + lags = c(0:7, 14, 21), + adjust_latency = "extend_ahead" + ) + )$predictions |> + pivot_quantiles_wider(.pred_distn) + } + ) |> + bind_rows() } ``` -```{r} -# Generate the forecasts and bind them together -forecasts <- bind_rows( - map(aheads, ~ forecast_k_week_ahead( - archive_cases_dv_subset_faux, - ahead = .x, - outcome = "percent_cli", - predictors = "percent_cli", - forecast_dates = forecast_dates - ) %>% mutate(version_aware = FALSE)), - map(aheads, ~ forecast_k_week_ahead( - doctor_visits, - ahead = .x, - outcome = "percent_cli", - predictors = "percent_cli", - forecast_dates = forecast_dates - ) %>% mutate(version_aware = TRUE)) +_Note:_ In the helper function, we're using the parameter `adjust_latency`. +We need to use it because the most recently released data may still be several days old on any given forecast date (lag > 0); +`adjust_latency` will modify the forecaster to compensate[^5]. +See the function `step_adjust_latency()` for more details and examples. + +Now that we're set up, we can generate forecasts for both the version faithful and un-faithful +archives, and bind the results together. + +```{r generate_forecasts, warning = FALSE} +forecast_dates <- seq( + from = as.Date("2020-09-01"), + to = as.Date("2021-11-01"), + by = "1 month" ) +aheads <- c(1, 7, 14, 21, 28) + +version_unfaithful <- archive_cases_dv_subset_faux |> + epix_slide( + ~ forecast_wrapper(.x, aheads, "percent_cli", "percent_cli"), + .before = 120, + .versions = forecast_dates + ) |> + mutate(version_faithful = "Version un-faithful") + +version_faithful <- doctor_visits |> + epix_slide( + ~ forecast_wrapper(.x, aheads, "percent_cli", "percent_cli"), + .before = 120, + .versions = forecast_dates + ) |> + mutate(version_faithful = "Version faithful") + +forecasts <- + bind_rows( + version_unfaithful, + version_faithful + ) ``` -Here, `arx_forecaster()` does all the heavy lifting. It creates leads of the -target (respecting time stamps and locations) along with lags of the features -(here, the response and doctors visits), estimates a forecasting model using the -specified engine, creates predictions, and non-parametric confidence bands. +`arx_forecaster()` does all the heavy lifting. +It creates and lags copies of the features (here, the response and doctors visits), +creates and leads copies of the target while respecting timestamps and locations, fits a +forecasting model using the specified engine, creates predictions, and +creates non-parametric confidence bands. -To see how the predictions compare, we plot them on top of the latest case -rates. Note that even though we've fitted the model on all states, we'll just -display the results for two states, California (CA) and Florida (FL), to get a -sense of the model performance while keeping the graphic simple. +To see how the version faithful and un-faithful predictions compare, let's plot them on top of the latest case +rates, using the same versioned plotting method as above. +Note that even though we fit the model on four states (California, Texas, Florida, and +New York), we'll just display the results for two states, California (CA) and Florida +(FL), to get a sense of the model performance while keeping the graphic simple.
Code for plotting -```{r} +```{r plot_ca_forecasts, warning = FALSE} geo_choose <- "ca" -forecasts_filtered <- forecasts %>% - filter(geo_value == geo_choose) %>% +forecasts_filtered <- forecasts |> + filter(geo_value == geo_choose) |> mutate(time_value = version) -percent_cli_data <- bind_rows( - # Snapshotted data for the version-aware forecasts - map( - forecast_dates, - ~ doctor_visits %>% - epix_as_of(.x) %>% - mutate(version = .x) - ) %>% - bind_rows() %>% - mutate(version_aware = TRUE), - # Latest data for the version-unaware forecasts - doctor_visits %>% - epix_as_of(doctor_visits$versions_end) %>% - mutate(version_aware = FALSE) -) %>% - filter(geo_value == geo_choose) - -p1 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) + +# we need to add the ground truth data to the version faithful plot as well +plotting_data <- bind_rows( + percent_cli_data, + percent_cli_data %>% + filter(version_faithful == "Version un-faithful") %>% + mutate(version_faithful = "Version faithful") +) + +p1 <- # first plotting the forecasts as bands, lines and points + ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) + geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = factor(time_value)), alpha = 0.4) + geom_line(aes(y = .pred, color = factor(time_value)), linetype = 2L) + geom_point(aes(y = .pred, color = factor(time_value)), size = 0.75) + - geom_vline(data = percent_cli_data, aes(color = factor(version), xintercept = version), lty = 2) + + # the forecast date + geom_vline( + data = percent_cli_data |> filter(geo_value == geo_choose) |> select(-version_faithful), + aes(color = factor(version), xintercept = version), + lty = 2) + + # the underlying data geom_line( - data = percent_cli_data, + data = plotting_data |> filter(geo_value == geo_choose), aes(x = time_value, y = percent_cli, color = factor(version)), inherit.aes = FALSE, na.rm = TRUE ) + - facet_grid(version_aware ~ geo_value, scales = "free") + - scale_x_date(minor_breaks = "month", date_labels = "%b %y") + + facet_grid(version_faithful ~ geo_value, scales = "free") + + scale_x_date(breaks = "2 months", date_labels = "%b %Y") + scale_y_continuous(expand = expansion(c(0, 0.05))) + labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") + theme(legend.position = "none") ``` -```{r} +```{r plot_fl_forecasts, warning = FALSE} geo_choose <- "fl" -forecasts_filtered <- forecasts %>% - filter(geo_value == geo_choose) %>% +forecasts_filtered <- forecasts |> + filter(geo_value == geo_choose) |> mutate(time_value = version) -percent_cli_data <- bind_rows( - # Snapshotted data for the version-aware forecasts - map( - forecast_dates, - ~ doctor_visits %>% - epix_as_of(.x) %>% - mutate(version = .x) - ) %>% - bind_rows() %>% - mutate(version_aware = TRUE), - # Latest data for the version-unaware forecasts - doctor_visits %>% - epix_as_of(doctor_visits$versions_end) %>% - mutate(version_aware = FALSE) -) %>% - filter(geo_value == geo_choose) - -p2 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) + + +p2 <- + ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) + geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = factor(time_value)), alpha = 0.4) + geom_line(aes(y = .pred, color = factor(time_value)), linetype = 2L) + geom_point(aes(y = .pred, color = factor(time_value)), size = 0.75) + - geom_vline(data = percent_cli_data, aes(color = factor(version), xintercept = version), lty = 2) + + geom_vline( + data = percent_cli_data |> filter(geo_value == geo_choose) |> select(-version_faithful), + aes(color = factor(version), xintercept = version), lty = 2 + ) + geom_line( - data = percent_cli_data, + data = plotting_data |> filter(geo_value == geo_choose), aes(x = time_value, y = percent_cli, color = factor(version)), inherit.aes = FALSE, na.rm = TRUE ) + - facet_grid(version_aware ~ geo_value, scales = "free") + - scale_x_date(minor_breaks = "month", date_labels = "%b %y") + + facet_grid(version_faithful ~ geo_value, scales = "free") + + scale_x_date(breaks = "2 months", date_labels = "%b %Y") + scale_y_continuous(expand = expansion(c(0, 0.05))) + labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") + theme(legend.position = "none") ```
-```{r show-plot1, echo=FALSE} +```{r show-plot1, warning = FALSE, echo=FALSE} p1 -p2 ``` -For the two states of interest, neither approach produces amazingly accurate -forecasts. However, the extent to which using versioned data can affect -backtesting, scoring, and therefore model choice for production can be inferred -from these plots. +The version faithful and un-faithful forecasts look moderately similar except for the 1 day horizons +(although neither approach produces amazingly accurate forecasts). -### Example using case data from Canada +In the version faithful case for California, the March 2021 forecast (turquoise) +starts at a value just above 10, which is very well lined up with reported values leading up to that forecast. +The measured and forecasted trends are also concordant (both increasingly moderately fast). -
+Because the data for this time period was later adjusted down with a decreasing trend, the March 2021 forecast looks quite bad compared to finalized data. -Data and forecasts. Similar to the above. - -By leveraging the flexibility of `epiprocess`, we can apply the same techniques -to data from other sources. Since some collaborators are in British Columbia, -Canada, we'll do essentially the same thing for Canada as we did above. - -The [COVID-19 Canada Open Data Working Group](https://opencovid.ca/) collects -daily time series data on COVID-19 cases, deaths, recoveries, testing and -vaccinations at the health region and province levels. Data are collected from -publicly available sources such as government datasets and news releases. -Unfortunately, there is no simple versioned source, so we have created our own -from the Github commit history. - -First, we load versioned case rates at the provincial level. After converting -these to 7-day averages (due to highly variable provincial reporting -mismatches), we then convert the data to an `epi_archive` object, and extract -the latest version from it. Finally, we run the same forcasting exercise as for -the American data, but here we compare the forecasts produced from using simple -linear regression with those from using boosted regression trees. - -```{r get-can-fc, warning = FALSE} -aheads <- c(7, 14, 21, 28) -canada_archive <- can_prov_cases -canada_archive_faux <- epix_as_of(canada_archive, canada_archive$versions_end) %>% - mutate(version = time_value) %>% - as_epi_archive() -# This function will add the 7-day average of the case rate to the data -# before forecasting. -smooth_cases <- function(epi_df) { - epi_df %>% - group_by(geo_value) %>% - epi_slide_mean("case_rate", .window_size = 7, na.rm = TRUE, .suffix = "_{.n}dav") -} -forecast_dates <- seq.Date( - from = min(canada_archive$DT$version), - to = max(canada_archive$DT$version), - by = "1 month" -) +The equivalent version un-faithful forecast starts at a value of 5, which is in line with the finalized data but would have been out of place compared to the version data. -# Generate the forecasts, and bind them together -canada_forecasts <- bind_rows( - map( - aheads, - ~ forecast_k_week_ahead( - canada_archive_faux, - ahead = .x, - outcome = "case_rate_7dav", - predictors = "case_rate_7dav", - forecast_dates = forecast_dates, - process_data = smooth_cases - ) %>% mutate(version_aware = FALSE) - ), - map( - aheads, - ~ forecast_k_week_ahead( - canada_archive, - ahead = .x, - outcome = "case_rate_7dav", - predictors = "case_rate_7dav", - forecast_dates = forecast_dates, - process_data = smooth_cases - ) %>% mutate(version_aware = TRUE) - ) -) +```{r show-plot2, warning = FALSE, echo=FALSE} +p2 ``` -The figures below shows the results for a single province. +Now let's look at Florida. +In the version faithful case, the three late-2021 forecasts (purples and pinks) starting in September predict very low values, near 0. +The trend leading up to each forecast shows a substantial decrease, so these forecasts seem appropriate and we would expect them to score fairly well on various performance metrics when compared to the versioned data. -```{r plot-can-fc-lr, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 12} -geo_choose <- "Alberta" -forecasts_filtered <- canada_forecasts %>% - filter(geo_value == geo_choose) %>% - mutate(time_value = version) -case_rate_data <- bind_rows( - # Snapshotted data for the version-aware forecasts - map( - forecast_dates, - ~ canada_archive %>% - epix_as_of(.x) %>% - smooth_cases() %>% - mutate(case_rate = case_rate_7dav, version = .x) - ) %>% - bind_rows() %>% - mutate(version_aware = TRUE), - # Latest data for the version-unaware forecasts - canada_archive %>% - epix_as_of(doctor_visits$versions_end) %>% - smooth_cases() %>% - mutate(case_rate = case_rate_7dav, version_aware = FALSE) -) %>% - filter(geo_value == geo_choose) - -ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) + - geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = factor(time_value)), alpha = 0.4) + - geom_line(aes(y = .pred, color = factor(time_value)), linetype = 2L) + - geom_point(aes(y = .pred, color = factor(time_value)), size = 0.75) + - geom_vline(data = case_rate_data, aes(color = factor(version), xintercept = version), lty = 2) + - geom_line( - data = case_rate_data, - aes(x = time_value, y = case_rate, color = factor(version)), - inherit.aes = FALSE, na.rm = TRUE - ) + - facet_grid(version_aware ~ geo_value, scales = "free") + - scale_x_date(minor_breaks = "month", date_labels = "%b %y") + - scale_y_continuous(expand = expansion(c(0, 0.05))) + - labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") + - theme(legend.position = "none") -``` +However in hindsight, we know that early versions of the data systematically under-reported COVID-related doctor visits such that these forecasts don't actually perform well compared to _finalized_ data. +In this example, version faithful forecasts predicted values at or near 0 while finalized data shows values in the 5-10 range. +As a result, the version un-faithful forecasts for these same dates are quite a bit higher, and would perform well when scored using the finalized data and poorly with versioned data. -
+In general, the longer ago a forecast was made, the worse its performance is compared to finalized data. Finalized data accumulates revisions over time that make it deviate more and more from the non-finalized data a model was trained on. +Forecasts _trained_ on finalized data will of course appear to perform better when _scored_ on finalized data, but will have unknown performance on the non-finalized data we need to use if we want timely predictions. + +Without using data that would have been available on the actual forecast date, +you have little insight into what level of performance you +can expect in practice. + +Good performance of a version un-faithful model is a mirage; it is only achievable if the training data has no revisions. +If a data source has any revisions, version un-faithful-level performance is unachievable when making forecasts in real time. + + +[^1]: For forecasting a single day like this, we could have actually just used + `doctor_visits |> epix_as_of(forecast_date)` to get the relevant snapshot, and then fed that into `arx_forecaster()` as we did in the [landing +page](../index.html#motivating-example). + + +[^2]: Generally we advise against this; the only time to consider faking + versioning like this are if you're back-testing data with no versions available + at all, or if you're doing an explicit comparison like this. If you have no + versions you should assume performance is worse than what the test would + otherwise suggest. + +[^4]: Until we have a time machine + +[^5]: In this case by adjusting the length of the ahead so that it is actually + forecasting from the last day of data (e.g. for 2 day latent data and a true + ahead of 5, the `extended_ahead` would actually be 7) diff --git a/vignettes/custom_epiworkflows.Rmd b/vignettes/custom_epiworkflows.Rmd new file mode 100644 index 000000000..8e23981a7 --- /dev/null +++ b/vignettes/custom_epiworkflows.Rmd @@ -0,0 +1,629 @@ +--- +title: "Custom Epiworkflows" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Custom Epiworkflows} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +source(here::here("vignettes/_common.R")) +``` + +```{r setup, message=FALSE, include = FALSE} +library(dplyr) +library(parsnip) +library(workflows) +library(recipes) +library(epipredict) +library(epiprocess) +library(ggplot2) +library(rlang) # for %@% +forecast_date <- as.Date("2021-08-01") +used_locations <- c("ca", "ma", "ny", "tx") +library(epidatr) +``` + +If you want to do custom data preprocessing or fit a model that isn't included in the canned workflows, you'll need to write a custom `epi_workflow()`. +An `epi_workflow()` is a sub-class of a `workflows::workflow()` from the +`{workflows}` package designed to handle panel data specifically. + +To understand how to work with custom `epi_workflow()`s, let's recreate and then +modify the `four_week_ahead` example from the [landing +page](../index.html#motivating-example). +Let's first remind ourselves how to use a simple canned workflow: + +```{r make-four-forecasts, warning=FALSE} +training_data <- covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations) +four_week_ahead <- arx_forecaster( + training_data, + outcome = "death_rate", + predictors = c("case_rate", "death_rate"), + args_list = arx_args_list( + lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), + ahead = 4 * 7, + quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9) + ) +) +four_week_ahead$epi_workflow +``` + +# Anatomy of an `epi_workflow` + +An `epi_workflow()` is an extension of a `workflows::workflow()` that is specially designed to handle panel +data, and to apply custom post-processing steps to the output of a model. +All `epi_workflows`, including simple and canned workflows, consist of 3 components, a preprocessor, trainer, and postprocessor. + +### Preprocessor + +A preprocessor (also called a recipe) transforms the data before model training and prediction. +Transformations can include converting counts to rates, applying a running average +to columns, or [any of the `step`s found in `{recipes}`](https://recipes.tidymodels.org/reference/index.html). + +All workflows must include a preprocessor. +The most basic preprocessor just assigns roles to columns, telling the model in the next step which to use as predictors or the outcome. + +However, preprocessors can do much more. +You can think of a preprocessor as a more flexible `formula` that you would pass to `lm()`: `y ~ x1 + log(x2) + lag(x1, 5)`. +The simple model above internally runs 6 of these steps, such as creating lagged predictor columns. + +In general, there are 2 broad classes of transformation that `{recipes}` `step`s handle: + +- Operations that are applied to both training and test data without using stored information. + Examples include taking the log of a variable, leading or lagging columns, + filtering out rows, handling dummy variables, calculating growth rates, + etc. +- Operations that rely on stored information (parameters fit during training) to modify train and test data. + Examples include centering by the mean, and normalizing the variance (whitening). + +We differentiate between these types of transformations because the second type can result in information leakage if not done properly. +Information leakage or data leakage happens when a system has access to information that would not have been available at prediction time and could change our evaluation of the model's real-world performance. + +In the case of centering, we need to store the mean of the predictor from +the training data and use that value on the prediction data, rather than +using the mean of the test predictor for centering or including test data in the mean calculation. + +A major benefit of `{recipes}` is that it prevents information leakage. +However, the _main_ mechanism we rely on to prevent data leakage is proper +[backtesting](backtesting.html). + +### Trainer + +A trainer (aso called a model or engine) fits a `{parsnip}` model on data, and outputs a fitted model object. +Examples include linear regression, quantile regression, or [any `{parsnip}` engine](https://www.tidymodels.org/find/parsnip/). +The `{parsnip}` front-end abstracts away the differences in interface between a wide collection of statistical models. + +All workflows must include a model. + +### Postprocessor + +Postprocessors are unique to `{epipredict}`. +A postprocessor (also known as frosting) modifies and formats the prediction after a model has been fit. + +The postprocessor is _optional_. +It only needs to be included in a workflow if you need to process the model output. + +Each operation within a postprocessor is called a "layer" (functions are named `layer_*`), and the stack of layers is known as `frosting()`, +continuing the metaphor of baking a cake established in `{recipes}`. +Some example operations include: + +- generating quantiles from purely point-prediction models +- reverting transformations done in prior steps, such as converting from rates back to counts +- thresholding forecasts to remove negative values +- generally adapting the format of the prediction to a downstream use. + +# Recreating `four_week_ahead` in an `epi_workflow()` + +To understand how to create custom workflows, let's first recreate the simple canned `arx_forecaster()` from scratch. + +We'll think through the following sub-steps: + +1. Define the `epi_recipe()`, which contains the preprocessing steps +2. Define the `frosting()` which contains the post-processing layers +3. Combine these with a trainer such as `quantile_reg()` into an + `epi_workflow()`, which we can then fit on the training data +4. `fit()` the workflow on some data +5. Grab the right prediction data using `get_test_data()` and apply the fit data + to generate a prediction + +## Define the `epi_recipe()` + +The steps found in `four_week_ahead` look like: + +```{r inspect_fwa_steps, warning=FALSE} +hardhat::extract_recipe(four_week_ahead$epi_workflow) +``` + +There are 6 steps we will need to recreate. +Note that all steps in the extracted recipe are marked as already been +`Trained`. For steps such as `recipes::step_BoxCox()` that have parameters that change their behavior, this means that their +parameters have already been calculated based on the training data set. + +Let's create an `epi_recipe()` to hold the 6 steps: + +```{r make_recipe} +filtered_data <- covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations) +four_week_recipe <- epi_recipe( + filtered_data, + reference_date = (filtered_data %@% metadata)$as_of +) +``` + +The data set passed to `epi_recipe` isn't required to be the actual +data set on which you are going to train the model. +However, it should have the same columns and the same metadata (such as `as_of` +and `other_keys`); it is typically easiest just to use the training data itself. + +This means that you can use the same workflow for multiple data sets as long as the format remains the same. +This might be useful if you continue to get updates to a data set over time and you want to train a new instance of the same model. + +Then we can append each `step` using pipes. In principle, the order matters, though for this +recipe only `step_epi_naomit()` and `step_training_window()` depend on the steps +before them. +The other steps can be thought of as setting parameters that help specify later processing and computation. + +```{r make_steps} +four_week_recipe <- four_week_recipe |> + step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) |> + step_epi_lag(death_rate, lag = c(0, 7, 14)) |> + step_epi_ahead(death_rate, ahead = 4 * 7) |> + step_epi_naomit() |> + step_training_window() +``` + +Note we said before that `four_week_ahead` contained 6 steps. +We've only added _5_ top-level steps here because `step_epi_naomit()` is +actually a wrapper around adding two `step_naomit()`s, one for +`all_predictors()` and one for `all_outcomes()`. +The `step_naomit()`s differ in their treatment of the data at predict time. + +`step_epi_lag()` and `step_epi_ahead()` both accept ["tidy" syntax](https://dplyr.tidyverse.org/reference/select.html) so processing can be applied to multiple columns at once. +For example, if we wanted to use the same lags for both `case_rate` and `death_rate`, we could +specify them in a single step, like `step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`. + +In general, `{recipes}` `step`s assign roles (such as `predictor`, or `outcome`, +see the [Roles vignette for +details](https://recipes.tidymodels.org/articles/Roles.html)) to columns either +by adding new columns or adjusting existing +ones. +`step_epi_lag()`, for example, creates a new column for each lag with the name +`lag_x_column_name` and labels them each with the `predictor` role. +`step_epi_ahead()` creates `ahead_x_column_name` columns and labels each with +the `outcome` role. + +In general, to inspect the 'prepared' steps, we can run `prep()`, which fits any +parameters used in the recipe, calculates new columns, and assigns roles[^4]. +For example, we can use `prep()` to make sure that we are training on the +correct columns: + +```{r prep_recipe} +prepped <- four_week_recipe |> prep(training_data) +prepped$term_info |> print(n = 14) +``` + +`bake()` applies a prepared recipe to a (potentially new) dataset to create the dataset as handed to the `epi_workflow()`. +We can inspect newly-created columns by running `bake()` on the recipe so far: + +```{r bake_recipe} +four_week_recipe |> + prep(training_data) |> + bake(training_data) +``` + +This is also useful for debugging malfunctioning pipelines. +You can run `prep()` and `bake()` on a new recipe containing a subset of `step`s -- all `step`s from the beginning up to the one that is misbehaving -- from the full, original recipe. +This will return an evaluation of the `recipe` up to that point so that you can see the data that the misbehaving `step` is being applied to. +It also allows you to see the exact data that a later `{parsnip}` model is trained on. + +## Define the `frosting()` + +The post-processing `frosting` layers[^1] found in `four_week_ahead` look like: + +```{r inspect_fwa_layers, warning=FALSE} +epipredict::extract_frosting(four_week_ahead$epi_workflow) +``` + +_Note_: since `frosting` is unique to this package, we've defined a custom function `extract_frosting()` to inspect these steps. + +Using the detailed information in the output above, +we can recreate the layers similar to how we defined the +`recipe` `step`s[^2]: + +```{r make_frosting} +four_week_layers <- frosting() |> + layer_predict() |> + layer_residual_quantiles(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |> + layer_add_forecast_date() |> + layer_add_target_date() |> + layer_threshold() +``` + +`layer_predict()` needs to be included in every postprocessor to actually train the model. + +Most layers work with any engine or `step`s. +There are a couple of layers, however, that depend on whether the engine predicts quantiles or point estimates. + +The following layers are only supported by point estimate engines, such as +`linear_reg()`: + +- `layer_residual_quantiles()`: the preferred method of generating quantiles for + models that don't generate quantiles themselves. + This function uses the error residuals of the engine to calculate quantiles. + This will work for most `{parsnip}` engines. +- `layer_predictive_distn()`: alternate method of generating quantiles using + an approximate parametric distribution. This will work for linear regression + specifically. + +On the other hand, the following layers are only supported by engines that +output quantiles, such as `quantile_reg()`: + +- `layer_quantile_distn()`: adds the specified quantiles. + If the user-requested quantile levels differ from the ones actually fit, they will be interpolated and/or + extrapolated. +- `layer_point_from_distn()`: this adds the middle quantile (median) as a point estimate, + and, if used, should be included after `layer_quantile_distn()`. + +## Fitting an `epi_workflow()` + +Now that we have a recipe and some layers, we can assemble the workflow. +This is as simple as passing the component preprocessor, model, and postprocessor into `epi_workflow()`. + +```{r workflow_building} +four_week_workflow <- epi_workflow( + four_week_recipe, + linear_reg(), + four_week_layers +) +``` + +After fitting it, we will have recreated `four_week_ahead$epi_workflow`. + +```{r workflow_fitting} +fit_workflow <- four_week_workflow |> fit(training_data) +``` + +Running `fit()` calculates all preprocessor-required parameters, and trains the model on the data passed in `fit()`. +However, it does not generate any predictions; predictions need to be created in a separate step. + +## Predicting + +To make a prediction, it helps to narrow the data set down to the relevant observations using `get_test_data()`. +We can still generate predictions without doing this first, but it will predict on _every_ day in the data-set, and not just on the `reference_date`. + +```{r grab_data} +relevant_data <- get_test_data( + four_week_recipe, + training_data +) +``` + +In this example, we're creating `relevant_data` from `training_data`, but the data set we want predictions for could be entirely new data, unrelated to the one we used when building the workflow. + +With a trained workflow and data in hand, we can actually make our predictions: + +```{r workflow_pred} +fit_workflow |> predict(relevant_data) +``` + +Note that if we simply plug the full `training_data` into `predict()` we will still get +predictions: + +```{r workflow_pred_training} +fit_workflow |> predict(training_data) +``` + +The resulting tibble is 800 rows long, however. +Passing the non-subsetted data set produces forecasts for not just the requested `reference_date`, but for every +day in the data set that has sufficient data to produce a prediction. +To narrow this down, we could filter to rows where the `time_value` matches the `forecast_date`: + +```{r workflow_pred_training_filter} +fit_workflow |> + predict(training_data) |> + filter(time_value == forecast_date) +``` + +This can be useful as a workaround when `get_test_data()` fails to pull enough +data to produce a forecast. +This is generally a problem when the recipe (preprocessor) is sufficiently complicated, and `get_test_data()` can't determine precisely what data is required. +The forecasts generated with `filter` and `get_test_data` are identical. + +# Extending `four_week_ahead` + +Now that we know how to create `four_week_ahead` from scratch, we can start modifying the workflow to get custom behavior. + +There are many ways we could modify `four_week_ahead`. We might consider: + +- Converting from rates to counts +- Including a growth rate estimate as a predictor +- Including a time component as a predictor -- useful if we +expect there to be a strong seasonal component to the outcome +- Scaling by a factor + +We will demo a couple of these modifications below. + +## Growth rate + +Let's say we're interested in including growth rate as a predictor in our model because +we think it may potentially improve our forecast. +We can easily create a new growth rate column as a step in the `epi_recipe`. + +```{r growth_rate_recipe} +growth_rate_recipe <- epi_recipe( + covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations) +) |> + step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) |> + step_epi_lag(death_rate, lag = c(0, 7, 14)) |> + step_epi_ahead(death_rate, ahead = 4 * 7) |> + step_epi_naomit() |> + # Calculate growth rate from death rate column. + step_growth_rate(death_rate) |> + step_training_window() +``` + +Inspecting the newly added column: + +```{r growth_rate_print} +growth_rate_recipe |> + prep(training_data) |> + bake(training_data) |> + select( + geo_value, time_value, case_rate, + death_rate, gr_7_rel_change_death_rate + ) |> + arrange(geo_value, time_value) |> + tail() +``` + +And the role: + +```{r growth_rate_roles} +prepped <- growth_rate_recipe |> + prep(training_data) +prepped$term_info |> filter(grepl("gr", variable)) +``` + +Let's say we want to use `quantile_reg()` as the model. +Because `quantile_reg()` outputs quantiles only, we need to change our `frosting` to convert a quantile distribution into quantiles and point predictions. +To do that, we need to switch out `layer_residual_quantiles()` (used for converting point + residuals output, e.g. from `linear_reg()` into quantiles) for `layer_quantile_distn()` and `layer_point_from_distn()`: +```{r layer_and_fit} +growth_rate_layers <- frosting() |> + layer_predict() |> + layer_quantile_distn( + quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9) + ) |> + layer_point_from_distn() |> + layer_add_forecast_date() |> + layer_add_target_date() |> + layer_threshold() + +growth_rate_workflow <- epi_workflow( + growth_rate_recipe, + quantile_reg(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)), + growth_rate_layers +) + +relevant_data <- get_test_data( + growth_rate_recipe, + training_data +) +gr_fit_workflow <- growth_rate_workflow |> fit(training_data) +gr_predictions <- gr_fit_workflow |> + predict(relevant_data) |> + filter(time_value == forecast_date) +``` +
+ Plot + +We'll reuse some code from the landing page to plot the result. + +```{r plotting} +forecast_date_label <- + tibble( + geo_value = rep(used_locations, 2), + .response_name = c(rep("case_rate", 4), rep("death_rate", 4)), + dates = rep(forecast_date - 7 * 2, 2 * length(used_locations)), + heights = c(rep(150, 4), rep(0.30, 4)) + ) + +result_plot <- autoplot( + object = gr_fit_workflow, + predictions = gr_predictions, + plot_data = covid_case_death_rates |> + filter(geo_value %in% used_locations, time_value > "2021-07-01") +) + + geom_vline(aes(xintercept = forecast_date)) + + geom_text( + data = forecast_date_label |> filter(.response_name == "death_rate"), + aes(x = dates, label = "forecast\ndate", y = heights), + size = 3, hjust = "right" + ) + + scale_x_date(date_breaks = "3 months", date_labels = "%Y %b") + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` +
+```{r, echo=FALSE} +result_plot +``` + +## Population scaling + +Suppose we want to modify our predictions to apply to counts, rather than rates. +To do that, we can adjust _just_ the `frosting` to perform post-processing on our existing rates forecaster. +Since rates are calculated as counts per 100 000 people, we will convert back to counts by multiplying rates by the factor $\frac{regional \text{ } population}{100000}$. + +```{r rate_scale} +count_layers <- + frosting() |> + layer_predict() |> + layer_residual_quantiles(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |> + layer_population_scaling( + .pred, + .pred_distn, + # `df` contains scaling values for all regions; in this case it is the state populations + df = epidatasets::state_census, + df_pop_col = "pop", + create_new = FALSE, + # `rate_rescaling` gives the denominator of the existing rate predictions + rate_rescaling = 1e5, + by = c("geo_value" = "abbr") + ) |> + layer_add_forecast_date() |> + layer_add_target_date() |> + layer_threshold() + +# building the new workflow +count_workflow <- epi_workflow( + four_week_recipe, + linear_reg(), + count_layers +) +count_pred_data <- get_test_data(four_week_recipe, training_data) +count_predictions <- count_workflow |> + fit(training_data) |> + predict(count_pred_data) + +count_predictions +``` + +# Custom classifier workflow + +Let's work through an example of a more complicated kind of pipeline you can build using +the `epipredict` framework. +This is a hotspot prediction model, which predicts whether case rates are increasing (`up`), decreasing (`down`) or flat +(`flat`). +The model comes from a paper by McDonald, Bien, Green, Hu et al[^3], and roughly +serves as an extension of `arx_classifier()`. + +First, we need to add a factor version of `geo_value`, so that it can be used as a feature. + +```{r training_factor} +training_data <- + covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations) |> + mutate(geo_value_factor = as.factor(geo_value)) +``` + +Then we put together the recipe, using a combination of base `{recipe}` +functions such as `add_role()` and `step_dummy()`, and `{epipredict}` functions +such as `step_growth_rate()`. + +```{r class_recipe} +classifier_recipe <- epi_recipe(training_data) |> + # Label `time_value` as predictor and do no other processing + add_role(time_value, new_role = "predictor") |> + # Use one-hot encoding on `geo_value_factor` and label each resulting column as a predictor + step_dummy(geo_value_factor) |> + # Create and lag `case_rate` growth rate + step_growth_rate(case_rate, role = "none", prefix = "gr_") |> + step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) |> + step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") |> + # Divide growth rate into 3 bins, and label as outcome variable + # Note `recipes::step_cut()` has a bug that prevents us from using it here + step_mutate( + response = cut( + ahead_7_gr_7_rel_change_case_rate, + # Define bin thresholds. + # Divide by 7 to create weekly values. + breaks = c(-Inf, -0.2, 0.25, Inf) / 7, + labels = c("down", "flat", "up") + ), + role = "outcome" + ) |> + # Drop unused columns based on role assignments. This is not strictly + # necessary, as columns with roles unused in the model will be ignored anyway. + step_rm(has_role("none"), has_role("raw")) |> + step_epi_naomit() +``` + +This adds as predictors: + +- time value as a continuous variable (via `add_role()`) +- `geo_value` as a set of indicator variables (via `step_dummy()` and the previous `as.factor()`) +- growth rate of case rate, both at prediction time (no lag), and lagged by one and two weeks + +The outcome variable is created by composing several steps together. `step_epi_ahead()` +creates a column with the growth rate one week into the future, and +`step_mutate()` turns that column into a factor with 3 possible values, + +$$ + Z_{\ell, t}= + \begin{cases} + \text{up}, & \text{if}\ Y^{\Delta}_{\ell, t} > 0.25 \\ + \text{down}, & \text{if}\ Y^{\Delta}_{\ell, t} < -0.20\\ + \text{flat}, & \text{otherwise} + \end{cases} +$$ + +where $Y^{\Delta}_{\ell, t}$ is the growth rate at location $\ell$ and time $t$. +`up` means that the `case_rate` is has increased by at least 25%, while `down` +means it has decreased by at least 20%. + +Note that in both `step_growth_rate()` and `step_epi_ahead()` we explicitly assign the role +`none`. This is because those columns are used as intermediaries to create +predictor and outcome columns. +Afterwards, `step_rm()` drops the temporary columns, along with the original `role = "raw"` columns +`death_rate` and `case_rate`. Both `geo_value_factor` and `time_value` are retained +because their roles have been reassigned. + + +To fit a classification model like this, we will need to use a `{parsnip}` model +that has `mode = "classification"`. +The simplest example of a `{parsnip}` `classification`-`mode` model is `multinomial_reg()`. +The needed layers are more or less the same as the `linear_reg()` regression layers, with the addition that we need to remove some `NA` values: + +```{r, warning=FALSE} +frost <- frosting() |> + layer_naomit(starts_with(".pred")) |> + layer_add_forecast_date() |> + layer_add_target_date() |> + layer_threshold() +``` + +```{r, warning=FALSE} +wf <- epi_workflow( + classifier_recipe, + multinom_reg(), + frost +) |> + fit(training_data) + +forecast(wf) +``` + +And comparing the result with the actual growth rates at that point in time, +```{r growth_rate_results} +growth_rates <- covid_case_death_rates |> + filter(geo_value %in% used_locations) |> + group_by(geo_value) |> + mutate( + # Multiply by 7 to estimate weekly equivalents + case_gr = growth_rate(x = time_value, y = case_rate) * 7 + ) |> + ungroup() + +growth_rates |> filter(time_value == "2021-08-01") +``` + +we see that they're all significantly higher than 25% per week (36%-62%), +which matches the classification model's predictions. + + +See the [tooling book](https://cmu-delphi.github.io/delphi-tooling-book/preprocessing-and-models.html) for a more in-depth discussion of this example. + + +[^1]: Think of baking a cake, where adding the frosting is the last step in the + process of actually baking. + +[^2]: Note that the frosting doesn't require any information about the training + data, since the output of the model only depends on the model used. + +[^3]: McDonald, Bien, Green, Hu, et al. “Can auxiliary indicators improve + COVID-19 forecasting and hotspot prediction?.” Proceedings of the National + Academy of Sciences 118.51 (2021): e2111453118. doi:10.1073/pnas.2111453118 + +[^4]: Note that `prep()` and `bake()` are standard `{recipes}` functions, so any discussion of them there applies just as well here. For example in the [guide to creating a new step](https://www.tidymodels.org/learn/develop/recipes/#create-the-prep-method). diff --git a/vignettes/epipredict.Rmd b/vignettes/epipredict.Rmd index ce0a7e38e..d1600663d 100644 --- a/vignettes/epipredict.Rmd +++ b/vignettes/epipredict.Rmd @@ -1,470 +1,565 @@ --- -title: "Get started with epipredict" +title: "Get started with `epipredict`" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Get started with epipredict} + %\VignetteIndexEntry{Get started with `epipredict`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- +# Introduction + ```{r, include = FALSE} -source("_common.R") +source(here::here("vignettes/_common.R")) ``` -```{r setup, message=FALSE} +```{r setup, message=FALSE, include = FALSE} library(dplyr) library(parsnip) library(workflows) library(recipes) +library(epidatasets) library(epipredict) +library(epiprocess) +library(ggplot2) +library(purrr) +forecast_date <- as.Date("2021-08-01") +used_locations <- c("ca", "ma", "ny", "tx") +library(epidatr) ``` +At a high level, the goal of `{epipredict}` is to make it easy to run simple machine +learning and statistical forecasters for epidemiological data. +To do this, we have extended the [tidymodels](https://www.tidymodels.org/) +framework to handle the case of panel time-series data. -# Goals for the package - -At a high level, our goal with `{epipredict}` is to make running simple Machine -Learning / Statistical forecasters for epidemiology easy. However, this package -is extremely extensible, and that is part of its utility. Our hope is that it is -easy for users with epi training and some statistics to fit baseline models -while still allowing those with more nuanced statistical understanding to create -complicated specializations using the same framework. - -Serving both populations is the main motivation for our efforts, but at the same -time, we have tried hard to make it useful. - - -## Baseline models +Our hope is that it is easy for users with epidemiological training and some statistical knowledge to +fit baseline models, while also allowing those with more nuanced statistical +understanding to create complex custom models using the same framework. +Towards that end, `{epipredict}` provides two main classes of tools: -We provide a set of basic, easy-to-use forecasters that work out of the box. You -should be able to do a reasonably limited amount of customization on them. Any -serious customization happens with the framework discussed below). +## Canned forecasters -For the basic forecasters, we provide: +A set of basic, easy-to-use "canned" forecasters that work out of the box. +We currently provide the following basic forecasters: -* Baseline flat-line forecaster -* Autoregressive forecaster -* Autoregressive classifier - -All the forcasters we provide are built on our framework. So we will use these -basic models to illustrate its flexibility. + * _Flatline forecaster_: predicts as the median the most recently seen value + with increasingly wide quantiles. + * _Climatological forecaster_: predicts the median and quantiles based on the historical values around the same date in previous years. + * _Autoregressive forecaster_: fits a model (e.g. linear regression) on + lagged data to predict quantiles for continuous values. + * _Autoregressive classifier_: fits a model (e.g. logistic regression) on + lagged data to predict a binned version of the growth rate. + * _CDC FluSight flatline forecaster_: a variant of the flatline forecaster that is + used as a baseline in the CDC's [FluSight forecasting competition](https://www.cdc.gov/flu-forecasting/about/index.html). ## Forecasting framework -Our framework for creating custom forecasters views the prediction task as a set -of modular components. There are four types of components: - -1. Preprocessor: make transformations to the data before model training -2. Trainer: train a model on data, resulting in a fitted model object -3. Predictor: make predictions, using a fitted model object and processed test data -4. Postprocessor: manipulate or transform the predictions before returning - -Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially -the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot -of overlap. This is by design, and is in fact a feature. The truth is that -`{epipredict}` is a wrapper around much that is contained in these packages. -Therefore, if you want something from this -verse, it should "just work" (we -hope). - -The reason for the overlap is that `{workflows}` *already implements* the first -three steps. And it does this very well. However, it is missing the -postprocessing stage and currently has no plans for such an implementation. And -this feature is important. The baseline forecaster we provide *requires* -postprocessing. Anything more complicated needs this as well. - -The second omission from `{tidymodels}` is support for panel data. Besides -epidemiological data, economics, psychology, sociology, and many other areas -frequently deal with data of this type. So the framework of behind -`{epipredict}` implements this. In principle, this has nothing to do with -epidemiology, and one could simply use this package as a solution for the -missing functionality in `{tidymodels}`. Again, this should "just work". - -All of the *panel data* functionality is implemented through the `epi_df` data -type in the companion [`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/) -package. There is much more to see there, but for the moment, it's enough to -look at a simple one: - -```{r epidf} -jhu <- covid_case_death_rates -jhu +A framework for creating custom forecasters out of modular components, from +which the canned forecasters were created. There are three types of +components: + + * _Preprocessor_: transform the data before model training, such as converting + counts to rates, creating smoothed columns, or [any `{recipes}` + `step`](https://recipes.tidymodels.org/reference/index.html) + * _Trainer_: train a model on data, resulting in a fitted model object. + Examples include linear regression, quantile regression, or [any `{parsnip}` + engine](https://www.tidymodels.org/find/parsnip/). + * _Postprocessor_: unique to `{epipredict}`; used to transform the + predictions after the model has been fit, such as + - generating quantiles from purely point-prediction models, + - reverting operations done in the `step`s, such as converting from + rates back to counts + - generally adapting the format of the prediction to its eventual use. + +The rest of this "Get Started" vignette will focus on using and modifying the canned forecasters. +Check out the [Custom Epiworkflows vignette](preprocessing-and-models) for examples of using the forecaster +framework to make more complex, custom forecasters. + +If you are interested in time series in a non-panel data context, you may also +want to look at `{timetk}` and `{modeltime}` for some related techniques. + +For a more in-depth treatment with some practical applications, see also the +[Forecasting Book](https://cmu-delphi.github.io/delphi-tooling-book/). + +# Panel forecasting basics +## Example data + +The forecasting methods in this package are designed to work with panel time +series data in `epi_df` format as made available in the `{epiprocess}` +package. +An `epi_df` is a collection of one or more time-series indexed by one or more +categorical variables. +The [`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/) package makes several +pre-compiled example datasets available. +Let's look at an example `epi_df`: + +```{r data_ex} +covid_case_death_rates ``` -This data is built into the package and contains the measured variables -`case_rate` and `death_rate` for COVID-19 at the daily level for each US state -for the year 2021. The "panel" part is because we have repeated measurements -across a number of locations. - -The `epi_df` encodes the time stamp as `time_value` and the `key` as -`geo_value`. While these 2 names are required, the values don't need to actually -represent such objects. Additional `key`'s are also supported (like age group, -ethnicity, taxonomy, etc.). - -The `epi_df` also contains some metadata that describes the keys as well as the -vintage of the data. It's possible that data collected at different times for -the *same set* of `geo_value`'s and `time_value`'s could actually be different. -For more details, see -[`{epiprocess}`](https://cmu-delphi.github.io/epiprocess/articles/epiprocess.html). - -## Why doesn't this package already exist? - -As described above: +This dataset uses a single key, `geo_value`, and two separate +time series, `case_rate` and `death_rate`. +The keys are represented in "long" format, with separate columns for the key and +the value, while separate time series are represented in "wide" format with each +time series stored in a separate column. -* Parts actually DO exist. There's a universe called `{tidymodels}`. It handles -preprocessing, training, and prediction, bound together, through a package called -`{workflows}`. We built `{epipredict}` on top of that setup. In this way, you CAN -use almost everything they provide. +`{epiprocess}` is designed to handle data that always has a geographic key, and +potentially other key values, such as age, ethnicity, or other demographic +information. +For example, `grad_employ_subset` from `{epidatasets}` also has both `age_group` +and `edu_qual` as additional keys: -* However, `{workflows}` doesn't do postprocessing. And nothing in the -verse -handles _panel data_. - -* The tidy-team doesn't have plans to do either of these things. (We checked). - -* There are two packages that do _time series_ built on `{tidymodels}`, but it's -"basic" time series: 1-step AR models, exponential smoothing, STL decomposition, -etc.[^2] Our group has not prioritized these sorts of models for epidemic -forecasting, but one could also integrate these methods into our framework. +```{r extra_keys} +grad_employ_subset +``` -[^2]: These are [`{timetk}`](https://business-science.github.io/timetk/index.html) -and [`{modeltime}`](https://business-science.github.io/timetk/index.html). There -are *lots* of useful methods there than can be used to do fairly complex machine -learning methodology, though not directly for panel data and not for direct -prediction of future targets. +See `{epiprocess}` for [more details on the `epi_df` format](https://cmu-delphi.github.io/epiprocess/articles/epi_df.html). -# Show me the basics +Panel time series are ubiquitous in epidemiology, but are also common in +economics, psychology, sociology, and many other areas. +While this package was designed with epidemiology in mind, many of the +techniques are more broadly applicable. -We start with the `jhu` data displayed above. One of the "canned" forecasters we -provide is an autoregressive forecaster with (or without) covariates that -*directly* trains on the response. This is in contrast to a typical "iterative" -AR model that trains to predict one-step-ahead, and then plugs in the -predictions to "leverage up" to longer horizons. +## Customizing `arx_forecaster()` +Let's expand on the basic example presented on the [landing +page](../index.html#motivating-example), starting with adjusting some parameters in +`arx_forecaster()`. -We'll estimate the model jointly across all locations using only the most -recent 30 days. +The `trainer` argument allows us to set the fitting engine. We can use either one of the +included engines, such as `quantile_reg()`, or one of the relevant [parsnip +models](https://www.tidymodels.org/find/parsnip/): -```{r demo-workflow} -jhu <- jhu %>% filter(time_value >= max(time_value) - 30) -out <- arx_forecaster( - jhu, +```{r make-forecasts, warning=FALSE} +two_week_ahead <- arx_forecaster( + covid_case_death_rates |> filter(time_value <= forecast_date), outcome = "death_rate", - predictors = c("case_rate", "death_rate") + trainer = quantile_reg(), + predictors = c("death_rate"), + args_list = arx_args_list( + lags = list(c(0, 7, 14)), + ahead = 14 + ) ) +hardhat::extract_fit_engine(two_week_ahead$epi_workflow) ``` -The `out` object has two components: - - 1. The predictions which is just another `epi_df`. It contains the predictions for -each location along with additional columns. By default, these are a 90% -predictive interval, the `forecast_date` (the date on which the forecast was -putatively made) and the `target_date` (the date for which the forecast is being -made). - ```{r} -out$predictions - ``` - 2. A list object of class `epi_workflow`. This object encapsulates all the -instructions necessary to create the prediction. More details on this below. - ```{r} -out$epi_workflow - ``` - -By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, -using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), 1 -week back and 2 weeks back. The predictors and outcome can be changed directly. -The rest of the defaults are encapsulated into a list of arguments. This list is -produced by `arx_args_list()`. - -## Simple adjustments - -Basic adjustments can be made through the `args_list`. - -```{r kill-warnings, echo=FALSE} -knitr::opts_chunk$set(warning = FALSE, message = FALSE) -``` +The default trainer is `parsnip::linear_reg()`, which generates quantiles after +the fact in the post-processing layers, rather than as part of the model. +While this does work, it is generally preferable to use `quantile_reg()`, as the +quantiles generated in post-processing can be poorly behaved. +`quantile_reg()` on the other hand directly estimates a different linear model +for each quantile, reflected in the several different columns for `tau` above. -```{r differential-lags} -out2week <- arx_forecaster( - jhu, +Because of the flexibility of `{parsnip}`, there are a whole host of models +available to us[^5]; as an example, we could have just as easily substituted a +non-linear random forest model from `{ranger}`: + +```{r rand_forest_ex, warning=FALSE} +two_week_ahead <- arx_forecaster( + covid_case_death_rates |> filter(time_value <= forecast_date), outcome = "death_rate", - predictors = c("case_rate", "death_rate"), + trainer = rand_forest(mode = "regression"), + predictors = c("death_rate"), args_list = arx_args_list( - lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), + lags = list(c(0, 7, 14)), ahead = 14 ) ) ``` -Here, we've used different lags on the `case_rate` and are now predicting 2 -weeks ahead. This example also illustrates a major difficulty with the -"iterative" versions of AR models. This model doesn't produce forecasts for -`case_rate`, and so, would not have data to "plug in" for the necessary -lags.[^1] - -[^1]: An obvious fix is to instead use a VAR and predict both, but this would -likely increase the variance of the model, and therefore, may lead to less -accurate forecasts for the variable of interest. - -Another property of the basic model is the predictive interval. We describe this -in more detail in a different vignette, but it is easy to request multiple -quantiles. +Any other customization is routed through `arx_args_list()`; for example, if we +wanted to increase the number of quantiles fit: -```{r differential-levels} -out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"), +```{r make-quantile-levels-forecasts, warning=FALSE} +two_week_ahead <- arx_forecaster( + covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations), + outcome = "death_rate", + trainer = quantile_reg(), + predictors = c("death_rate"), args_list = arx_args_list( - quantile_levels = c(.01, .025, 1:19 / 20, .975, .99) + lags = list(c(0, 7, 14)), + ahead = 14, + ############ changing quantile_levels ############ + quantile_levels = c(0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95) + ################################################## ) ) +hardhat::extract_fit_engine(two_week_ahead$epi_workflow) ``` -The column `.pred_dstn` in the `predictions` object is actually a "distribution" -here parameterized by its quantiles. For this default forecaster, these are -created using the quantiles of the residuals of the predictive model (possibly -symmetrized). Here, we used 23 quantiles, but one can grab a particular -quantile, - -```{r q1} -round(head(quantile(out_q$predictions$.pred_distn, p = .4)), 3) +See the function documentation for `arx_args_list()` for more examples of the modifications available. +If you want to make further modifications, you will need a custom +workflow; see the [Custom Epiworkflows vignette](custom_epiworkflows) for details. + +## Generating multiple aheads +We often want to generate a a trajectory +of forecasts over a range of dates, rather than for a single day. +We can do this with `arx_forecaster()` by looping over aheads. +For example, to predict every day over a 4-week time period: + +```{r aheads-loop} +all_canned_results <- lapply( + seq(0, 28), + \(days_ahead) { + arx_forecaster( + covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations), + outcome = "death_rate", + predictors = c("case_rate", "death_rate"), + trainer = quantile_reg(), + args_list = arx_args_list( + lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), + ahead = days_ahead + ) + ) + } +) +# pull out the workflow and the predictions to be able to +# effectively use autoplot +workflow <- all_canned_results[[1]]$epi_workflow +results <- purrr::map_df(all_canned_results, ~ `$`(., "predictions")) +autoplot( + object = workflow, + predictions = results, + plot_data = covid_case_death_rates |> + filter(geo_value %in% used_locations, time_value > "2021-07-01") +) ``` -or extract the entire distribution into a "long" `epi_df` with `quantile_levels` -being the probability and `values` being the value associated to that quantile. - -```{r q2} -out_q$predictions %>% - pivot_quantiles_longer(.pred_distn) +## Other canned forecasters +### `flatline_forecaster()` +The simplest model we provide is the `flatline_forecaster()`, which predicts a +flat line (with quantiles generated from the residuals using +`layer_residual_quantiles()`). +For example, on the same dataset as above: +```{r make-flatline-forecast, warning=FALSE} +all_flatlines <- lapply( + seq(0, 28), + \(days_ahead) { + flatline_forecaster( + covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations), + outcome = "death_rate", + args_list = flatline_args_list( + ahead = days_ahead, + ) + ) + } +) +# same plotting code as in the arx multi-ahead case +workflow <- all_flatlines[[1]]$epi_workflow +results <- purrr::map_df(all_flatlines, ~ `$`(., "predictions")) +autoplot( + object = workflow, + predictions = results, + plot_data = covid_case_death_rates |> filter(geo_value %in% used_locations, time_value > "2021-07-01") +) ``` -Additional simple adjustments to the basic forecaster can be made using the -function: - -```{r, eval = FALSE} -arx_args_list( - lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf, - forecast_date = NULL, target_date = NULL, - quantile_levels = c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95), - symmetrize = TRUE, nonneg = TRUE, quantile_by_key = character(0L), - nafill_buffer = Inf +### `cdc_baseline_forecaster()` + +This is a different method of generating a flatline forecast, used as a baseline +for [the CDC COVID-19 Forecasting Hub](https://covid19forecasthub.org). + +```{r make-cdc-forecast, warning=FALSE} +all_cdc_flatline <- + cdc_baseline_forecaster( + covid_case_death_rates |> + filter(time_value <= forecast_date, geo_value %in% used_locations), + outcome = "death_rate", + args_list = cdc_baseline_args_list( + aheads = 1:28, + data_frequency = "1 day" + ) + ) +# same plotting code as in the arx multi-ahead case +workflow <- all_cdc_flatline$epi_workflow +results <- all_cdc_flatline$predictions +autoplot( + object = workflow, + predictions = results, + plot_data = covid_case_death_rates |> filter(geo_value %in% used_locations, time_value > "2021-07-01") ) ``` -## Changing the engine - -So far, our forecasts have been produced using simple linear regression. But -this is not the only way to estimate such a model. The `trainer` argument -determines the type of model we want. This takes a -[`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear -regression, but we could instead use a random forest with the `{ranger}` -package: - -```{r ranger, warning = FALSE} -out_rf <- arx_forecaster( - jhu, +`cdc_baseline_forecaster()` and `flatline_forecaster()` generate medians in the same way, +but `cdc_baseline_forecaster()`'s quantiles are generated using +`layer_cdc_flatline_quantiles()` instead of `layer_residual_quantiles()`. +Both quantile-generating methods use the residuals to compute quantiles, but +`layer_cdc_flatline_quantiles()` extrapolates the quantiles by repeatedly +sampling the initial quantiles to generate the next set. +This results in much smoother quantiles, but ones that only capture the +one-ahead uncertainty. + +### `climatological_forecaster()` +The `climatological_forecaster()` is a different kind of baseline. It produces a +point forecast and quantiles based on the historical values for a given time of +year, rather than extrapolating from recent values. +For example, on the same dataset as above: +```{r make-climatological-forecast, warning=FALSE} +all_climate <- climatological_forecaster( + covid_case_death_rates_extended |> + filter(time_value <= forecast_date, geo_value %in% used_locations), outcome = "death_rate", - predictors = c("case_rate", "death_rate"), - trainer = rand_forest(mode = "regression") + args_list = climate_args_list( + forecast_horizon = seq(0, 28), + window_size = 14, + time_type = "day", + forecast_date = forecast_date + ) +) +workflow <- all_climate$epi_workflow +results <- all_climate$predictions +autoplot( + object = workflow, + predictions = results, + plot_data = covid_case_death_rates_extended |> filter(geo_value %in% used_locations, time_value > "2021-07-01") ) ``` -Or boosted regression trees with `{xgboost}`: +Note that to have enough training data for this method, we're using +`covid_case_death_rates_extended`, which starts in March 2020, rather than +`covid_case_death_rates`, which starts in December. +Without at least a year's worth of historical data, it is impossible to do a +climatological model. +Even with one year of data, as we have here, the resulting forecasts are unreliable. -```{r xgboost, warning = FALSE} -out_gb <- arx_forecaster( - jhu, - outcome = "death_rate", - predictors = c("case_rate", "death_rate"), - trainer = boost_tree(mode = "regression", trees = 20) -) -``` +One feature of the climatological baseline is that it forecasts multiple aheads +simultaneously. +This is possible for `arx_forecaster()`, but only using `trainer = +smooth_quantile_reg()`, which is built to handle multiple aheads simultaneously. -Or quantile regression, using our custom forecasting engine `quantile_reg()`: +### `arx_classifier()` -```{r quantreg, warning = FALSE} -out_qr <- arx_forecaster( - jhu, +Unlike the other canned forecasters, `arx_classifier` predicts binned growth rate. +The forecaster converts the raw outcome variable into a growth rate, which it then bins and predicts, using bin thresholds provided by the user. +For example, on the same dataset and `forecast_date` as above, this model outputs: + +```{r discrete-rt} +classifier <- arx_classifier( + covid_case_death_rates |> + filter(geo_value %in% used_locations, time_value < forecast_date), outcome = "death_rate", - predictors = c("case_rate", "death_rate"), - trainer = quantile_reg() + predictors = c("death_rate", "case_rate"), + trainer = multinom_reg(), + args_list = arx_class_args_list( + lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), + ahead = 2 * 7, + breaks = c(-0.01, 0.01, 0.1) + ) ) +classifier$predictions ``` -FWIW, this last case (using quantile regression), is not far from what the -Delphi production forecast team used for its Covid forecasts over the past few -years. - -## Inner workings - -Underneath the hood, this forecaster creates (and returns) an `epi_workflow`. -Essentially, this is a big S3 object that wraps up the 4 modular steps -(preprocessing - postprocessing) described above. - -### Preprocessing - -Preprocessing is accomplished through a `recipe` (imagine baking a cake) as -provided in the [`{recipes}`](https://recipes.tidymodels.org) package. -We've made a few modifications (to handle -panel data) as well as added some additional options. The recipe gives a -specification of how to handle training data. Think of it like a fancified -`formula` that you would pass to `lm()`: `y ~ x1 + log(x2)`. In general, -there are 2 extensions to the `formula` that `{recipes}` handles: - - 1. Doing transformations of both training and test data that can always be - applied. These are things like taking the log of a variable, leading or - lagging, filtering out rows, handling dummy variables, etc. - 2. Using statistics from the training data to eventually process test data. - This is a major benefit of `{recipes}`. It prevents what the tidy team calls - "data leakage". A simple example is centering a predictor by its mean. We - need to store the mean of the predictor from the training data and use that - value on the test data rather than accidentally calculating the mean of - the test predictor for centering. - -A recipe is processed in 2 steps, first it is "prepped". This calculates and -stores any intermediate statistics necessary for use on the test data. -Then it is "baked" -resulting in training data ready for passing into a statistical model (like `lm`). - -We have introduced an `epi_recipe`. It's just a `recipe` that knows how to handle -the `time_value`, `geo_value`, and any additional keys so that these are available -when necessary. - -The `epi_recipe` from `out_gb` can be extracted from the result: - -```{r} -extract_recipe(out_gb$epi_workflow) +The number and size of the growth rate categories is controlled by `breaks`, which define the +bin boundaries. + +In this example, the custom `breaks` passed to `arx_class_args_list()` correspond to 4 bins: +`(-∞, -0.01)`, `(-0.01, 0.01)`, `(0.01, 0.1)`, and `(0.1, ∞)`. +The bins can be interpreted as: the outcome variable is decreasing, approximately stable, slightly increasing, or increasing quickly. + +The returned `predictions` assigns each state to one of the growth rate bins. +In this case, the classifier expects the growth rate for all 4 of the states to fall into the same category, +`(-0.01, 0.01]`. + +To see how this model performed, let's compare to the actual growth rates for the `target_date`, as computed using +`{epiprocess}`: + +```{r growth_rate_results} +growth_rates <- covid_case_death_rates |> + filter(geo_value %in% used_locations) |> + group_by(geo_value) |> + mutate( + deaths_gr = growth_rate(x = time_value, y = death_rate) + ) |> + ungroup() +growth_rates |> filter(time_value == "2021-08-14") ``` -The "Inputs" are the original `epi_df` and the "roles" that these are assigned. -None of these are predictors or outcomes. Those will be created -by the recipe when it is prepped. The "Operations" are the sequence of -instructions to create the cake (baked training data). -Here we create lagged predictors, lead the outcome, and then remove `NA`s. -Some models like `lm` internally handle `NA`s, but not everything does, so we -deal with them explicitly. The code to do this (inside the forecaster) is - -```{r} -er <- epi_recipe(jhu) %>% - step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>% - step_epi_ahead(death_rate, ahead = 7) %>% - step_epi_naomit() -``` +Unfortunately, this forecast was not particularly accurate. All real growth rates were larger than the predicted growth rates, with California (real growth rate `-1.39`) not remotely in the interval (`(-0.01, 0.01]`). -While `{recipes}` provides a function `step_lag()`, it assumes that the data -have no breaks in the sequence of `time_values`. This is a bit dangerous, so -we avoid that behaviour. Our `lag/ahead` functions also appropriately adjust the -amount of data to avoid accidentally dropping recent predictors from the test -data. - -### The model specification - -Users with familiarity with the `{parsnip}` package will have no trouble here. -Basically, `{parsnip}` unifies the function signature across statistical models. -For example, `lm()` "likes" to work with formulas, but `glmnet::glmnet()` uses -`x` and `y` for predictors and response. `{parsnip}` is agnostic. Both of these -do "linear regression". Above we switched from `lm()` to `xgboost()` without -any issue despite the fact that these functions couldn't be more different. - -```{r, eval = FALSE} -lm(formula, data, subset, weights, na.action, - method = "qr", - model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, - contrasts = NULL, offset, ... -) -xgboost( - data = NULL, label = NULL, missing = NA, weight = NULL, - params = list(), nrounds, verbose = 1, print_every_n = 1L, - early_stopping_rounds = NULL, maximize = NULL, save_period = NULL, - save_name = "xgboost.model", xgb_model = NULL, callbacks = list(), - ... -) -``` +## Fitting multi-key panel data -`{epipredict}` provides a few engines/modules (the flatline forecaster and -quantile regression), but you should be able to use any available models -listed [here](https://www.tidymodels.org/find/parsnip/). +If multiple keys are set in the `epi_df` as `other_keys`, +`arx_forecaster` will automatically group by those in addition to the required geographic key. +For example, predicting the number of graduates in each of the categories in `grad_employ_subset` from above: -To estimate (fit) a preprocessed model, one calls `fit()` on the `epi_workflow`. +```{r multi_key_forecast, warning=FALSE} +# only fitting a subset, otherwise there are ~550 distinct pairs, which is bad for plotting +edu_quals <- c("Undergraduate degree", "Professional degree") +geo_values <- c("Quebec", "British Columbia") -```{r} -ewf <- epi_workflow(er, linear_reg()) %>% fit(jhu) -``` +grad_employ <- grad_employ_subset |> + filter(time_value < 2017) |> + filter(edu_qual %in% edu_quals, geo_value %in% geo_values) -### Postprocessing +grad_employ -To stretch the metaphor of preparing a cake to its natural limits, we have -created postprocessing functionality called "frosting". Much like the recipe, -each postprocessing operation is a "layer" and we "slather" these onto our -baked cake. To fix ideas, below is the postprocessing `frosting` for -`arx_forecaster()` +grad_forecast <- arx_forecaster( + grad_employ |> + filter(time_value < 2017), + outcome = "num_graduates", + predictors = c("num_graduates"), + args_list = arx_args_list( + lags = list(c(0, 1, 2)), + ahead = 1 + ) +) +# and plotting +autoplot( + grad_forecast$epi_workflow, + grad_forecast$predictions, + grad_employ, +) +``` -```{r} -extract_frosting(out_q$epi_workflow) +The 8 graphs represent all combinations of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`). + +## Fitting a non-geo-pooled model + +The methods shown so far fit a single model across all geographic regions. +This is called "geo-pooling". +To fit a non-geo-pooled model that fits each geography separately, one either needs a multi-level +engine (which at the moment `{parsnip}` doesn't support), or one needs to loop over +geographies. +Here, we're using `purrr::map` to perform the loop. + +```{r fit_non_geo_pooled, warning=FALSE} +geo_values <- covid_case_death_rates |> + pull(geo_value) |> + unique() + +all_fits <- + purrr::map(geo_values, \(geo) { + covid_case_death_rates |> + filter( + geo_value == geo, + time_value <= forecast_date + ) |> + arx_forecaster( + outcome = "death_rate", + trainer = linear_reg(), + predictors = c("death_rate"), + args_list = arx_args_list( + lags = list(c(0, 7, 14)), + ahead = 14 + ) + ) + }) +map_df(all_fits, ~ pluck(., "predictions")) ``` -Here we have 5 layers of frosting. The first generates the forecasts from the test data. -The second uses quantiles of the residuals to create distributional -forecasts. The next two add columns for the date the forecast was made and the -date for which it is intended to occur. Because we are predicting rates, they -should be non-negative, so the last layer thresholds both predicted values and -intervals at 0. The code to do this (inside the forecaster) is - -```{r} -f <- frosting() %>% - layer_predict() %>% - layer_residual_quantiles( - quantile_levels = c(.01, .025, 1:19 / 20, .975, .99), - symmetrize = TRUE - ) %>% - layer_add_forecast_date() %>% - layer_add_target_date() %>% - layer_threshold(starts_with(".pred")) +Fitting separate models for each geography is both 56 times slower[^7] than geo-pooling, and fits each model on far less data. +If a dataset contains relatively few observations for each geography, fitting a geo-pooled model is likely to produce better, more stable results. +However, geo-pooling can only be used if values are comparable in meaning and scale across geographies or can be made comparable, for example by normalization. + +If we wanted to build a geo-aware model, such as a linear regression with a different intercept for each geography, we would need to build a [custom workflow](custom_epiworkflows) with geography as a factor. + +# Anatomy of a canned forecaster +## Code object +Let's dissect the forecaster we trained back on the [landing +page](../index.html#motivating-example): + +```{r make-four-forecasts, warning=FALSE} +four_week_ahead <- arx_forecaster( + covid_case_death_rates |> filter(time_value <= forecast_date), + outcome = "death_rate", + predictors = c("case_rate", "death_rate"), + args_list = arx_args_list( + lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)), + ahead = 4 * 7, + quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9) + ) +) ``` -At predict time, we add this object onto the `epi_workflow` and call `forecast()` +`four_week_ahead` has two components: an `epi_workflow`, and a table of +`predictions`. +The table of predictions is a simple tibble, -```{r, warning=FALSE} -ewf %>% - add_frosting(f) %>% - forecast() +```{r show_predictions} +four_week_ahead$predictions ``` -The above `get_test_data()` function examines the recipe and ensures that enough -test data is available to create the necessary lags and produce a prediction -for the desired future time point (after the end of the training data). This mimics -what would happen if `jhu` contained the most recent available historical data and -we wanted to actually predict the future. We could have instead used any test data -that contained the necessary predictors. +where `.pred` gives the point/median prediction, and `.pred_distn` is a +`dist_quantiles()` object representing a distribution through various quantile +levels. +The `[6]` in the name refers to the number of quantiles that have been +explicitly created[^4]. +By default, `.pred_distn` covers a 90% prediction interval, reporting the 5% and 95% quantiles. + +The `epi_workflow` is a significantly more complicated object, extending a +`workflows::workflow()` to include post-processing steps: +```{r show_workflow} +four_week_ahead$epi_workflow +``` -## Conclusion +An `epi_workflow()` consists of 3 parts: + +- `Preprocessor`: a collection of steps that transform the data to be ready for + modelling. Steps can be custom, as are those included in this package, + or [be defined in `{recipes}`](https://recipes.tidymodels.org/reference/index.html). + `four_week_ahead` has 5 steps; you can inspect them more closely by + running `hardhat::extract_recipe(four_week_ahead$epi_workflow)`.[^6] +- `Model`: the actual model that does the fitting, given by a + `parsnip::model_spec`. `four_week_ahead` uses the default of + `parsnip::linear_reg()`, which is a `{parsnip}` wrapper for + `stats::lm()`. You can inspect the model more closely by running + `hardhat::extract_fit_recipe(four_week_ahead$epi_workflow)`. +- `Postprocessor`: a collection of layers to be applied to the resulting + forecast. Layers are internal to this package. `four_week_ahead` just so happens to have + 5 of as these well. You can inspect the layers more closely by running + `epipredict::extract_layers(four_week_ahead$epi_workflow)`. + +See the [Custom Epiworkflows vignette](custom_epiworkflows) for recreating and then +extending `four_week_ahead` using the custom forecaster framework. + +## Mathematical description + +Let's look at the mathematical details of the model in more detail, using a minimal version of +`four_week_ahead`: + +```{r, four_week_again} +four_week_small <- arx_forecaster( + covid_case_death_rates |> filter(time_value <= forecast_date), + outcome = "death_rate", + predictors = c("case_rate", "death_rate"), + args_list = arx_args_list( + lags = list(c(0, 7, 14), c(0, 7, 14)), + ahead = 4 * 7, + quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9) + ) +) +hardhat::extract_fit_engine(four_week_small$epi_workflow) +``` -Internally, we provide some simple functions to create reasonable forecasts. -But ideally, a user could create their own forecasters by building up the -components we provide. In other vignettes, we try to walk through some of these -customizations. +If $d_t$ is the death rate on day $t$ and $c_t$ is the case rate, then the model +we're fitting is: -To illustrate everything above, here is (roughly) the code for the -`flatline_forecaster()` applied to the `case_rate`. +$$ +d_{t+28} = a_0 + a_1 d_t + a_2 d_{t-7} + a_3 d_{t-14} + a_4 c_t + a_5 c_{t-7} + a_6 c_{t-14}. +$$ -```{r} -r <- epi_recipe(jhu) %>% - step_epi_ahead(case_rate, ahead = 7, skip = TRUE) %>% - update_role(case_rate, new_role = "predictor") %>% - add_role(all_of(key_colnames(jhu)), new_role = "predictor") +For example, $a_1$ is `lag_0_death_rate` above, with a value of `r hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"] `, +while $a_5$ is `r hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"] `. -f <- frosting() %>% - layer_predict() %>% - layer_residual_quantiles() %>% - layer_add_forecast_date() %>% - layer_add_target_date() %>% - layer_threshold(starts_with(".pred")) +The training data for fitting this linear model is constructed within the `arx_forecaster()` function by shifting a series +of columns the appropriate amount -- based on the requested `lags`. +Each row containing no `NA` values is used as a training observation to fit the coefficients $a_0,\ldots, a_6$. -eng <- linear_reg() %>% set_engine("flatline") -wf <- epi_workflow(r, eng, f) %>% fit(jhu) -preds <- forecast(wf) -``` +[^4]: in the case of a `{parsnip}` engine which doesn't explicitly predict + quantiles, these quantiles are created using `layer_residual_quantiles()`, + which infers the quantiles from the residuals of the fit. -All that really differs from the `arx_forecaster()` is the `recipe`, the -test data, and the engine. The `frosting` is identical, as is the fitting -and predicting procedure. +[^5]: in the case of `arx_forecaster`, this is any model with + `mode="regression"` from [this + list](https://www.tidymodels.org/find/parsnip/). -```{r} -preds -``` +[^6]: alternatively, for an unfit version of the preprocessor, you can call + `hardhat::extract_preprocessor(four_week_ahead$epi_workflow)` +[^7]: the number of geographies diff --git a/vignettes/panel-data.Rmd b/vignettes/panel-data.Rmd index e99057897..eeb0fb7ad 100644 --- a/vignettes/panel-data.Rmd +++ b/vignettes/panel-data.Rmd @@ -8,7 +8,7 @@ vignette: > --- ```{r, include = FALSE} -source("_common.R") +source(here::here("vignettes/_common.R")) ``` ```{r libraries, warning=FALSE, message=FALSE} @@ -234,9 +234,9 @@ summary(extract_fit_engine(wf_linreg)) This output tells us the coefficients of the fitted model; for instance, the estimated intercept is $\widehat{\alpha}_0 =$ -`r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for +`r round(coef(hardhat::extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for $y_{tijk}$ is -$\widehat\alpha_1 =$ `r round(coef(extract_fit_engine(wf_linreg))[2], 3)`. +$\widehat\alpha_1 =$ `r round(coef(hardhat::extract_fit_engine(wf_linreg))[2], 3)`. The summary also tells us that all estimated coefficients are significantly different from zero. Extracting the 95% confidence intervals for the coefficients also leads us to diff --git a/vignettes/preprocessing-and-models.Rmd b/vignettes/preprocessing-and-models.Rmd deleted file mode 100644 index 6bff45611..000000000 --- a/vignettes/preprocessing-and-models.Rmd +++ /dev/null @@ -1,622 +0,0 @@ ---- -title: Examples of Preprocessing and Models -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{Examples of Preprocessing and Models} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -source("_common.R") -``` - -## Introduction - -The `epipredict` package utilizes the `tidymodels` framework, namely -[`{recipes}`](https://recipes.tidymodels.org/) for -[dplyr](https://dplyr.tidyverse.org/)-like pipeable sequences -of feature engineering and [`{parsnip}`](https://parsnip.tidymodels.org/) for a -unified interface to a range of models. - -`epipredict` has additional customized feature engineering and preprocessing -steps, such as `step_epi_lag()`, `step_population_scaling()`, -`step_epi_naomit()`. They can be used along with -steps from the `{recipes}` package for more feature engineering. - -In this vignette, we will illustrate some examples of how to use `epipredict` -with `recipes` and `parsnip` for different purposes of epidemiological forecasting. -We will focus on basic autoregressive models, in which COVID cases and -deaths in the near future are predicted using a linear combination of cases and -deaths in the near past. - -The remaining vignette will be split into three sections. The first section, we -will use a Poisson regression to predict death counts. In the second section, -we will use a linear regression to predict death rates. Last but not least, we -will create a classification model for hotspot predictions. - -```{r, warning=FALSE, message=FALSE} -library(tidyr) -library(dplyr) -library(epipredict) -library(recipes) -library(workflows) -library(poissonreg) -``` - -## Poisson Regression - -During COVID-19, the US Center for Disease Control and Prevention (CDC) collected -models -and forecasts to characterize the state of an outbreak and its course. They use -it to inform public health decision makers on potential consequences of -deploying control measures. - -One of the outcomes that the CDC forecasts is -[death counts from COVID-19](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasting-us.html). -Although there are many state-of-the-art models, we choose to use Poisson -regression, the textbook example for modeling count data, as an illustration -for using the `epipredict` package with other existing tidymodels packages. - -The `counts_subset` dataset is available in the [`epidatasets` package](https://cmu-delphi.github.io/epidatasets/)), -and contains the number of confirmed cases and deaths from June 4, 2021 to Dec -31, 2021 in some U.S. states. It can be loaded with: - -```{r poisson-reg-data} -x <- epidatasets::counts_subset -``` - -The data can also be fetched from the Delphi API with the following query: -```{r, eval = FALSE} -library(epidatr) - -d <- as.Date("2024-03-20") - -x <- pub_covidcast( - source = "jhu-csse", - signals = "confirmed_incidence_num", - time_type = "day", - geo_type = "state", - time_values = epirange(20210604, 20211231), - geo_values = "ca,fl,tx,ny,nj", - as_of = d -) %>% - select(geo_value, time_value, cases = value) - -y <- pub_covidcast( - source = "jhu-csse", - signals = "deaths_incidence_num", - time_type = "day", - geo_type = "state", - time_values = epirange(20210604, 20211231), - geo_values = "ca,fl,tx,ny,nj", - as_of = d -) %>% - select(geo_value, time_value, deaths = value) - -x <- full_join(x, y, by = c("geo_value", "time_value")) %>% - as_epi_df(as_of = d) -``` - -We wish to predict the 7-day ahead death counts with lagged cases and deaths. -Furthermore, we will let each state be a dummy variable. Using differential -intercept coefficients, we can allow for an intercept shift between states. - -The model takes the form -\begin{aligned} -\log\left( \mu_{t+7} \right) &= \beta_0 + \delta_1 s_{\text{state}_1} + -\delta_2 s_{\text{state}_2} + \cdots + \nonumber \\ -&\quad\beta_1 \text{deaths}_{t} + -\beta_2 \text{deaths}_{t-7} + \beta_3 \text{cases}_{t} + -\beta_4 \text{cases}_{t-7}, -\end{aligned} -where $\mu_{t+7} = \mathbb{E}(y_{t+7})$, and $y_{t+7}$ is assumed to follow a -Poisson distribution with mean $\mu_{t+7}$; $s_{\text{state}}$ are dummy -variables for each state and take values of either 0 or 1. - -Preprocessing steps will be performed to prepare the -data for model fitting. But before diving into them, it will be helpful to -understand what `roles` are in the `recipes` framework. - ---- - -#### Aside on `recipes` - -`recipes` can assign one or more roles to each column in the data. The roles -are not restricted to a predefined set; they can be anything. -For most conventional situations, they are typically “predictor” and/or -"outcome". Additional roles enable targeted `step_*()` operations on specific -variables or groups of variables. - -In our case, the role `predictor` is given to explanatory variables on the -right-hand side of the model (in the equation above). -The role `outcome` is the response variable -that we wish to predict. `geo_value` and `time_value` are predefined roles -that are unique to the `epipredict` package. Since we work with `epi_df` -objects, all datasets should have `geo_value` and `time_value` passed through -automatically with these two roles assigned to the appropriate columns in the data. - -The `recipes` package also allows [manual alterations of roles](https://recipes.tidymodels.org/reference/roles.html) -in bulk. There are a few handy functions that can be used together to help us -manipulate variable roles easily. - -> `update_role()` alters an existing role in the recipe or assigns an initial role -> to variables that do not yet have a declared role. -> -> `add_role()` adds an additional role to variables that already have a role in -> the recipe, without overwriting old roles. -> -> `remove_role()` eliminates a single existing role in the recipe. - -#### End aside - ---- - -Notice in the following preprocessing steps, we used `add_role()` on -`geo_value_factor` since, currently, the default role for it is `raw`, but -we would like to reuse this variable as `predictor`s. - -```{r} -counts_subset <- counts_subset %>% - mutate(geo_value_factor = as.factor(geo_value)) %>% - as_epi_df() - -epi_recipe(counts_subset) - -r <- epi_recipe(counts_subset) %>% - add_role(geo_value_factor, new_role = "predictor") %>% - step_dummy(geo_value_factor) %>% - ## Occasionally, data reporting errors / corrections result in negative - ## cases / deaths - step_mutate(cases = pmax(cases, 0), deaths = pmax(deaths, 0)) %>% - step_epi_lag(cases, deaths, lag = c(0, 7)) %>% - step_epi_ahead(deaths, ahead = 7, role = "outcome") %>% - step_epi_naomit() -``` - -After specifying the preprocessing steps, we will use the `parsnip` package for -modeling and producing the prediction for death count, 7 days after the -latest available date in the dataset. - -```{r} -latest <- get_test_data(r, counts_subset) - -wf <- epi_workflow(r, parsnip::poisson_reg()) %>% - fit(counts_subset) - -predict(wf, latest) %>% filter(!is.na(.pred)) -``` - -Note that the `time_value` corresponds to the last available date in the -training set, **NOT** to the target date of the forecast -(`r max(latest$time_value) + 7`). - -Let's take a look at the fit: - -```{r} -extract_fit_engine(wf) -``` - -Up to now, we've used the Poisson regression to model count data. Poisson -regression can also be used to model rate data, such as case rates or death -rates, by incorporating offset terms in the model. - -To model death rates, the Poisson regression would be expressed as: -\begin{aligned} -\log\left( \mu_{t+7} \right) &= \log(\text{population}) + -\beta_0 + \delta_1 s_{\text{state}_1} + -\delta_2 s_{\text{state}_2} + \cdots + \nonumber \\ -&\quad\beta_1 \text{deaths}_{t} + -\beta_2 \text{deaths}_{t-7} + \beta_3 \text{cases}_{t} + -\beta_4 \text{cases}_{t-7} -\end{aligned} -where $\log(\text{population})$ is the log of the state population that was -used to scale the count data on the left-hand side of the equation. This offset -is simply a predictor with coefficient fixed at 1 rather than estimated. - -There are several ways to model rate data given count and population data. -First, in the `parsnip` framework, we could specify the formula in `fit()`. -However, by doing so we lose the ability to use the `recipes` framework to -create new variables since variables that do not exist in the -original dataset (such as, here, the lags and leads) cannot be called directly in `fit()`. - -Alternatively, `step_population_scaling()` and `layer_population_scaling()` -in the `epipredict` package can perform the population scaling if we provide the -population data, which we will illustrate in the next section. - -## Linear Regression - -For COVID-19, the CDC required submission of case and death count predictions. -However, the Delphi Group preferred to train on rate data instead, because it -puts different locations on a similar scale (eliminating the need for location-specific intercepts). -We can use a liner regression to predict the death -rates and use state population data to scale the rates to counts.[^pois] We will do so -using `layer_population_scaling()` from the `epipredict` package. - -[^pois]: We could continue with the Poisson model, but we'll switch to the Gaussian likelihood just for simplicity. - -Additionally, when forecasts are submitted, prediction intervals should be -provided along with the point estimates. This can be obtained via postprocessing -using -`layer_residual_quantiles()`. It is worth pointing out, however, that -`layer_residual_quantiles()` should be used before population scaling or else -the transformation will make the results uninterpretable. - -We wish, now, to predict the 7-day ahead death counts with lagged case rates and -death rates, along with some extra behaviourial predictors. Namely, we will use -survey data from -[COVID-19 Trends and Impact Survey](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/fb-survey.html#behavior-indicators). - -The survey data provides the estimated percentage of people who wore a mask for -most or all of the time while in public in the past 7 days and the estimated -percentage of respondents who reported that all or most people they encountered -in public in the past 7 days maintained a distance of at least 6 feet. - -State-wise population data from the 2019 U.S. Census will be used in -`layer_population_scaling()`. - -Both datasets are available in the [`epidatasets` package](https://cmu-delphi.github.io/epidatasets/)), -and can be loaded with: - -```{r} -behav_ind <- epidatasets::ctis_covid_behaviours -pop_dat <- epidatasets::state_census %>% select(abbr, pop) -``` - -The data can also be fetched from the Delphi API with the following query: -```{r, eval = FALSE} -library(epidatr) - -d <- as.Date("2024-03-20") - -behav_ind_mask <- pub_covidcast( - source = "fb-survey", - signals = "smoothed_wwearing_mask_7d", - time_type = "day", - geo_type = "state", - time_values = epirange(20210604, 20211231), - geo_values = "ca,fl,tx,ny,nj", - as_of = d -) %>% - select(geo_value, time_value, masking = value) - -behav_ind_distancing <- pub_covidcast( - source = "fb-survey", - signals = "smoothed_wothers_distanced_public", - time_type = "day", - geo_type = "state", - time_values = epirange(20210604, 20211231), - geo_values = "ca,fl,tx,ny,nj", - as_of = d -) %>% - select(geo_value, time_value, distancing = value) - -behav_ind <- behav_ind_mask %>% - full_join(behav_ind_distancing, by = c("geo_value", "time_value")) %>% - as_epi_df(as_of = d) - -pop_dat <- state_census %>% select(abbr, pop) -``` - -Rather than using raw mask-wearing / social-distancing metrics, for the sake -of illustration, we'll convert both into categorical predictors. - -```{r, echo=FALSE, message=FALSE,fig.align='center', fig.width=6, fig.height=4} -library(ggplot2) -behav_ind %>% - pivot_longer(masking:distancing) %>% - ggplot(aes(value, fill = geo_value)) + - geom_density(alpha = 0.5) + - scale_fill_brewer(palette = "Set1", name = "") + - theme_bw() + - scale_x_continuous(expand = c(0, 0)) + - scale_y_continuous(expand = expansion(c(0, .05))) + - facet_wrap(~name, scales = "free") + - theme(legend.position = "bottom") -``` - -We will take a subset of death rate and case rate data from the built-in dataset -`covid_case_death_rates`. - -```{r} -jhu <- filter( - covid_case_death_rates, - time_value >= "2021-06-04", - time_value <= "2021-12-31", - geo_value %in% c("ca", "fl", "tx", "ny", "nj") -) -``` - -Preprocessing steps will again rely on functions from the `epipredict` package as well -as the `recipes` package. -There are also many functions in the `recipes` package that allow for -[scalar transformations](https://recipes.tidymodels.org/reference/#step-functions-individual-transformations), -such as log transformations and data centering. In our case, we will -center the numerical predictors to allow for a more meaningful interpretation of the -intercept. - -```{r} -jhu <- jhu %>% - mutate(geo_value_factor = as.factor(geo_value)) %>% - left_join(behav_ind, by = c("geo_value", "time_value")) %>% - as_epi_df() - -r <- epi_recipe(jhu) %>% - add_role(geo_value_factor, new_role = "predictor") %>% - step_dummy(geo_value_factor) %>% - step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>% - step_mutate( - masking = cut_number(masking, 5), - distancing = cut_number(distancing, 5) - ) %>% - step_epi_ahead(death_rate, ahead = 7, role = "outcome") %>% - step_center(contains("lag"), role = "predictor") %>% - step_epi_naomit() -``` - -As a sanity check we can examine the structure of the training data: - -```{r, warning = FALSE} -glimpse(slice_sample(bake(prep(r, jhu), jhu), n = 6)) -``` - -Before directly predicting the results, we need to add postprocessing layers to -obtain the death counts instead of death rates. Note that the rates used so -far are "per 100K people" rather than "per person". We'll also use quantile -regression with the `quantile_reg` engine rather than ordinary least squares -to create median predictions and a 90% prediction interval. - -```{r, warning=FALSE} -f <- frosting() %>% - layer_predict() %>% - layer_add_target_date("2022-01-07") %>% - layer_threshold(.pred, lower = 0) %>% - layer_quantile_distn() %>% - layer_naomit(.pred) %>% - layer_population_scaling( - .pred, .pred_distn, - df = pop_dat, - rate_rescaling = 1e5, - by = c("geo_value" = "abbr"), - df_pop_col = "pop" - ) - -wf <- epi_workflow(r, quantile_reg()) %>% - fit(jhu) %>% - add_frosting(f) - -p <- forecast(wf) -p -``` - -The columns marked `*_scaled` have been rescaled to the correct units, in this -case `deaths` rather than deaths per 100K people (these remain in `.pred`). - -To look at the prediction intervals: - -```{r} -p %>% - select(geo_value, target_date, .pred_scaled, .pred_distn_scaled) %>% - pivot_quantiles_wider(.pred_distn_scaled) -``` - -Last but not least, let's take a look at the regression fit and check the -coefficients: - -```{r, echo =FALSE} -extract_fit_engine(wf) -``` - -## Classification - -Sometimes it is preferable to create a predictive model for surges or upswings -rather than for raw values. In this case, -the target is to predict if the future will have increased case rates (denoted `up`), -decreased case rates (`down`), or flat case rates (`flat`) relative to the current -level. Such models may be -referred to as "hotspot prediction models". We will follow the analysis -in [McDonald, Bien, Green, Hu, et al.](#references) but extend the application -to predict three categories instead of two. - -Hotspot prediction uses a categorical outcome variable defined in terms of the -relative change of $Y_{\ell, t+a}$ compared to $Y_{\ell, t}$. -Where $Y_{\ell, t}$ denotes the case rates in location $\ell$ at time $t$. -We define the response variables as follows: - -$$ - Z_{\ell, t}= - \begin{cases} - \text{up}, & \text{if}\ Y^{\Delta}_{\ell, t} > 0.25 \\ - \text{down}, & \text{if}\ Y^{\Delta}_{\ell, t} < -0.20\\ - \text{flat}, & \text{otherwise} - \end{cases} -$$ - -where $Y^{\Delta}_{\ell, t} = (Y_{\ell, t}- Y_{\ell, t-7})\ /\ (Y_{\ell, t-7})$. -We say location $\ell$ is a hotspot at time $t$ when $Z_{\ell,t}$ is -`up`, meaning the number of newly reported cases over the past 7 days has -increased by at least 25% compared to the preceding week. When $Z_{\ell,t}$ -is categorized as `down`, it suggests that there has been at least a 20% -decrease in newly reported cases over the past 7 days (a 20% decrease is the -inverse of a 25% increase). Otherwise, we will -consider the trend to be `flat`. - -The expression of the multinomial regression we will use is as follows: - -$$ -\pi_{j}(x) = \text{Pr}(Z_{\ell,t} = j|x) = \frac{e^{g_j(x)}}{1 + \sum_{k=1}^{2}e^{g_k(x)} } -$$ - -where $j$ is either down, flat, or up - -\begin{aligned} -g_{\text{down}}(x) &= 0.\\ -g_{\text{flat}}(x) &= \log\left(\frac{Pr(Z_{\ell,t}=\text{flat}\mid x)}{Pr(Z_{\ell,t}=\text{down}\mid x)}\right) = -\beta_{10} + \beta_{11} t + \delta_{10} s_{\text{state_1}} + -\delta_{11} s_{\text{state_2}} + \cdots \nonumber \\ -&\quad + \beta_{12} Y^{\Delta}_{\ell, t} + -\beta_{13} Y^{\Delta}_{\ell, t-7} + \beta_{14} Y^{\Delta}_{\ell, t-14}\\ -g_{\text{up}}(x) &= \log\left(\frac{Pr(Z_{\ell,t}=\text{up}\mid x)}{Pr(Z_{\ell,t}=\text{down} \mid x)}\right) = -\beta_{20} + \beta_{21}t + \delta_{20} s_{\text{state_1}} + -\delta_{21} s_{\text{state}\_2} + \cdots \nonumber \\ -&\quad + \beta_{22} Y^{\Delta}_{\ell, t} + -\beta_{23} Y^{\Delta}_{\ell, t-7} + \beta_{24} Y^{\Delta}_{\ell, t-14} -\end{aligned} - -Preprocessing steps are similar to the previous models with an additional step -of categorizing the response variables. Again, we will use a subset of death -rate and case rate data from our built-in dataset -`covid_case_death_rates`. - -```{r} -jhu <- covid_case_death_rates %>% - dplyr::filter( - time_value >= "2021-06-04", - time_value <= "2021-12-31", - geo_value %in% c("ca", "fl", "tx", "ny", "nj") - ) %>% - mutate(geo_value_factor = as.factor(geo_value)) - -r <- epi_recipe(jhu) %>% - add_role(time_value, new_role = "predictor") %>% - step_dummy(geo_value_factor) %>% - step_growth_rate(case_rate, role = "none", prefix = "gr_") %>% - step_epi_lag(starts_with("gr_"), lag = c(0, 7, 14)) %>% - step_epi_ahead(starts_with("gr_"), ahead = 7, role = "none") %>% - # note recipes::step_cut() has a bug in it, or we could use that here - step_mutate( - response = cut( - ahead_7_gr_7_rel_change_case_rate, - breaks = c(-Inf, -0.2, 0.25, Inf) / 7, # division gives weekly not daily - labels = c("down", "flat", "up") - ), - role = "outcome" - ) %>% - step_rm(has_role("none"), has_role("raw")) %>% - step_epi_naomit() -``` - -We will fit the multinomial regression and examine the predictions: - -```{r, warning=FALSE} -wf <- epi_workflow(r, multinom_reg()) %>% - fit(jhu) - -forecast(wf) %>% filter(!is.na(.pred_class)) -``` - -We can also look at the estimated coefficients and model summary information: - -```{r} -extract_fit_engine(wf) -``` - -One could also use a formula in `epi_recipe()` to achieve the same results as -above. However, only one of `add_formula()`, `add_recipe()`, or -`workflow_variables()` can be specified. For the purpose of demonstrating -`add_formula` rather than `add_recipe`, we will `prep` and `bake` our recipe to -return a `data.frame` that could be used for model fitting. - -```{r} -b <- bake(prep(r, jhu), jhu) - -epi_workflow() %>% - add_formula( - response ~ geo_value + time_value + lag_0_gr_7_rel_change_case_rate + - lag_7_gr_7_rel_change_case_rate + lag_14_gr_7_rel_change_case_rate - ) %>% - add_model(parsnip::multinom_reg()) %>% - fit(data = b) -``` - -## Benefits of Lagging and Leading in `epipredict` - -The `step_epi_ahead` and `step_epi_lag` functions in the `epipredict` package -is handy for creating correct lags and leads for future predictions. - -Let's start with a simple dataset and preprocessing: - -```{r} -ex <- filter( - covid_case_death_rates, - time_value >= "2021-12-01", - time_value <= "2021-12-31", - geo_value == "ca" -) - -dim(ex) -``` - -We want to predict death rates on `r max(ex$time_value) + 7`, which is 7 days ahead of the -latest available date in our dataset. - -We will compare two methods of trying to create lags and leads: - -```{r} -p1 <- epi_recipe(ex) %>% - step_epi_lag(case_rate, lag = c(0, 7, 14)) %>% - step_epi_lag(death_rate, lag = c(0, 7, 14)) %>% - step_epi_ahead(death_rate, ahead = 7, role = "outcome") %>% - step_epi_naomit() %>% - prep(ex) - -b1 <- bake(p1, ex) -b1 - - -p2 <- epi_recipe(ex) %>% - step_mutate( - lag0case_rate = lag(case_rate, 0), - lag7case_rate = lag(case_rate, 7), - lag14case_rate = lag(case_rate, 14), - lag0death_rate = lag(death_rate, 0), - lag7death_rate = lag(death_rate, 7), - lag14death_rate = lag(death_rate, 14), - ahead7death_rate = lead(death_rate, 7) - ) %>% - step_epi_naomit() %>% - prep(ex) - -b2 <- bake(p2, ex) -b2 -``` - -Notice the difference in number of rows `b1` and `b2` returns. This is because -the second version, the one that doesn't use `step_epi_ahead` and `step_epi_lag`, -has omitted dates compared to the one that used the `epipredict` functions. - -```{r} -dates_used_in_training1 <- b1 %>% - select(-ahead_7_death_rate) %>% - na.omit() %>% - pull(time_value) -dates_used_in_training1 - -dates_used_in_training2 <- b2 %>% - select(-ahead7death_rate) %>% - na.omit() %>% - pull(time_value) -dates_used_in_training2 -``` - -The model that is trained based on the `{recipes}` functions will predict 7 days ahead from -`r max(dates_used_in_training2)` -instead of 7 days ahead from `r max(dates_used_in_training1)`. - -## References - -McDonald, Bien, Green, Hu, et al. "Can auxiliary indicators improve COVID-19 -forecasting and hotspot prediction?." Proceedings of the National Academy of -Sciences 118.51 (2021): e2111453118. [doi:10.1073/pnas.2111453118](https://doi.org/10.1073/pnas.2111453118) - -## Attribution - -This object contains a modified part of the -[COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) -as [republished in the COVIDcast Epidata API.](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html) - -This data set is licensed under the terms of the -[Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) -by the Johns Hopkins -University on behalf of its Center for Systems Science in Engineering. Copyright -Johns Hopkins University 2020. diff --git a/vignettes/update.Rmd b/vignettes/update.Rmd index 02fe71be7..fbda6be37 100644 --- a/vignettes/update.Rmd +++ b/vignettes/update.Rmd @@ -8,7 +8,7 @@ vignette: > --- ```{r, include = FALSE} -source("_common.R") +source(here::here("vignettes/_common.R")) ``` ```{r setup, message=FALSE}