diff --git a/_freeze/epipredict/execute-results/html.json b/_freeze/epipredict/execute-results/html.json index e6a72f4..ec3390c 100644 --- a/_freeze/epipredict/execute-results/html.json +++ b/_freeze/epipredict/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "9c99451eb5966f60d81e919fe85842d1", + "hash": "ab2a744465ae1b001b0f3eb1f7e7f4fa", "result": { - "markdown": "# Overview\n\n\n\n\n\nAt a high level, our goal with `{epipredict}` is to make running simple machine learning / statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epidemiology training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.\n\nServing both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.\n\n\n## Baseline models\n\nWe provide a set of basic, easy-to-use forecasters that work out of the box. \nYou should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).\n\nFor the basic forecasters, we provide: \n \n* Flatline (basic) forecaster \n* Autoregressive forecaster\n* Autoregressive classifier\n* Smooth AR forecaster\n\nAll the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.\n\n## Forecasting framework\n\nAt its core, `{epipredict}` is a **framework** for creating custom forecasters.\nBy that we mean that we view the process of creating custom forecasters as\na collection of modular components. All of them should be easy to swap out\nor exchange for others, and massive variety should be available by fairly \nsimple modifications through the addition of steps or layers. \nThere are four types of components:\n \n1. Preprocessor: make transformations to the data before model training\n2. Trainer: train a model on data, resulting in a fitted model object\n3. Predictor: make predictions, using a fitted model object and processed test data\n4. Postprocessor: manipulate or transform the predictions before returning\n \nUsers familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially \nthe [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot \nof overlap. This is by design, and is in fact a feature. The truth is that\n`{epipredict}` is a wrapper around much that is contained in these packages.\nTherefore, if you want something from this -verse, it should \"just work\" (we hope).\n\nThe reason for the overlap is that `{workflows}` _already implements_ the first \nthree steps. And it does this very well. However, it is missing the \npostprocessing stage and currently has no plans for such an implementation. \nAnd this feature is important. The baseline forecaster we provide _requires_\npostprocessing. Anything more complicated (which is nearly everything) \nneeds this as well.\n\nThe second omission from `{tidymodels}` is support for panel data. Besides\nepidemiological data, economics, psychology, sociology, and many other areas\nfrequently deal with data of this type. So the framework of behind `{epipredict}`\nimplements this. In principle, this has nothing to do with epidemiology, and \none could simply use this package as a solution for the missing functionality in\n`{tidymodels}`. Again, this should \"just work\" (we hope).\n\nAll of the _panel data_ functionality is implemented through the `epi_df` data type\ndescribed in the previous part. If you have different panel data, just force it\ninto an `epi_df` as described in @sec-additional-keys.\n\n## Why doesn't this package already exist?\n\n- Parts of it actually DO exist. There's a universe called `tidymodels`. It \nhandles pre-processing, training, and prediction, bound together, through a \npackage called workflows. We built `epipredict` on top of that setup. In this \nway, you CAN use almost everything they provide.\n- However, workflows doesn't do post-processing to the extent envisioned here.\nAnd nothing in `tidymodels` handles panel data.\n- The tidy-team doesn't have plans to do either of these things. (We checked).\n- There are two packages that do time series built on `tidymodels`, but it's \n\"basic\" time series: 1-step AR models, exponential smoothing, STL decomposition,\netc.[^1] \n\n[^1]: Our group has not prioritized these sorts of models for epidemic \nforecasting, but one could also integrate these methods into our framework.\n\n\n## Show me the basics\n\nFor now, we'll just demonstrate one of the \"canned\" forecasters we provide: an autoregressive forecaster with (or without) covariates that _directly_ trains on the response. This is in contrast to a typical \"iterative\" AR model that trains to predict one-step-ahead, and then plugs in the predictions to \"leverage up\" to longer horizons. You saw this function in @sec-local-forecaster, but now we'll explain\nthe arguments a bit more thoroughly. Below, you'll see how to make a number of modifications to this\nforecaster, but understanding the inner workings, and **why** you would want\nsomething like this (as well as how to do elaborate customizations) \nwill be the subject of the rest of this book. \n\nWe'll use some of the same data we've examined earlier and estimate a model jointly across all locations using only the most recent 30 days of data (available\nin the built-in data frame).\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/demo-workflow_c1aed8396eb1b411c4ab154ecd15e222'}\n\n```{.r .cell-code}\njhu <- case_death_rate_subset %>%\n filter(time_value >= max(time_value) - 30)\n\nout <- arx_forecaster(\n jhu,\n outcome = \"death_rate\",\n predictors = c(\"case_rate\", \"death_rate\")\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n#> Warning: The forecast_date is less than the most recent update date of the\n#> data.forecast_date = 2021-12-31 while data is from 2022-05-31.\n```\n:::\n:::\n\n\nThis call produces a warning, which we'll ignore for now. But essentially, it's telling us that our data comes from May 2022 but we're trying to do a forecast for January 2022. The result is likely not an accurate measure of real-time forecast performance, because the data have been revised over time. \n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-2_7b68ec6f4741ca9ebb25c7a13be54061'}\n\n```{.r .cell-code}\nout\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type ARX Forecaster ════════════════════════════════\n#> \n#> This forecaster was fit on 2023-06-19 20:31:07\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-07.\n```\n:::\n:::\n\n\nPrinting the S3 object provides a bunch of summary information describing the \noriginal training data used to estimate the model as well as some information\nof what the predictions are for. It contains three main components:\n \n1. Metadata about the training data and when the forecast was created\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-3_2d86a3bbe62cc3a5e7b6c3e02059257d'}\n\n```{.r .cell-code}\nstr(out$metadata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> List of 2\n#> $ training :List of 3\n#> ..$ geo_type : chr \"state\"\n#> ..$ time_type: chr \"day\"\n#> ..$ as_of : POSIXct[1:1], format: \"2022-05-31 12:08:25\"\n#> $ forecast_created: POSIXct[1:1], format: \"2023-06-19 20:31:07\"\n```\n:::\n:::\n\n2. The predictions in a tibble. The columns give the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made).\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-4_75eac823f860d76a7dd42bdea9e94ee1'}\n\n```{.r .cell-code}\nout$predictions\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 56 × 5\n#> geo_value .pred .pred_distn forecast_date target_date\n#> \n#> 1 ak 0.355 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 2 al 0.325 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 3 ar 0.496 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 4 as 0.0836 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 5 az 0.614 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 6 ca 0.327 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> # ℹ 50 more rows\n```\n:::\n:::\n\n3. An S3 object of class `epi_workflow`. This object encapsulates all the instructions necessary to create the prediction. More details on this below.\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-5_f9e3ce37b5ca3e9f59bd1977f249b01f'}\n\n```{.r .cell-code}\nout$epi_workflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> Postprocessor: Frosting\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 6 Recipe Steps\n#> \n#> • step_epi_lag()\n#> • step_epi_lag()\n#> • step_epi_ahead()\n#> • step_naomit()\n#> • step_naomit()\n#> • step_training_window()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) lag_0_case_rate lag_7_case_rate lag_14_case_rate \n#> 0.0829475 0.0009830 0.0027035 -0.0005651 \n#> lag_0_death_rate lag_7_death_rate lag_14_death_rate \n#> 0.2466110 0.1964921 0.0752998 \n#> \n#> ── Postprocessor ────────────────────────────────────────────────────────────\n#> 5 Frosting Layers\n#> \n#> • layer_predict()\n#> • layer_residual_quantiles()\n#> • layer_add_forecast_date()\n#> • layer_add_target_date()\n#> • layer_threshold()\n```\n:::\n:::\n\n\nBy default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, \nusing 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), \n1 week back and 2 weeks back. The predictors and outcome can be changed \ndirectly. The rest of the defaults are encapsulated into a list of arguments. \nThis list is produced by `arx_args_list()`.\n\n## Simple adjustments\n\nBasic adjustments can be made through the `args_list`.\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/differential-lags_6b8588f2553c80d191c85d09836876f9'}\n\n```{.r .cell-code}\nout2week <- arx_forecaster(\n epi_data = jhu,\n outcome = \"death_rate\",\n predictors = c(\"case_rate\", \"death_rate\"),\n args_list = arx_args_list(\n lags = list(case_rate = c(0, 1, 2, 3, 7, 14), death_rate = c(0, 7, 14)),\n ahead = 14\n )\n)\n```\n:::\n\n\nHere, we've used different lags on the `case_rate` and are now predicting 2 \nweeks ahead. Note that `lags` and `aheads` are in the same units as the \n`time_value` of the `epi_df` used for training (same as the `epi_slide()` \narguments discussed in @sec-sliding). This example also illustrates\na major difficulty with the \"iterative\" versions of AR models. This model \ndoesn't produce forecasts for `case_rate`, and so, would not have data to \n\"plug in\" for the necessary lags.[^2]\n\n[^2]: An obvious fix is to instead use a VAR and predict both, but this would \nlikely increase the variance of the model, and therefore, may lead to less \naccurate forecasts for the variable of interest.\n\n\nAnother property of the basic model is the predictive interval. We describe this in more detail in a coming chapter, but it is easy to request multiple quantiles.\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/differential-levels_a9da683d7e7fef5e4cb6288ad9899809'}\n\n```{.r .cell-code}\nout_q <- arx_forecaster(jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n args_list = arx_args_list(\n levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99)\n )\n)\n```\n:::\n\n\nThe column `.pred_dstn` in the `predictions` object is actually a \"distribution\" here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile,\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/q1_5df8261d92421ead6dd2d77e0a127517'}\n\n```{.r .cell-code}\nhead(quantile(out_q$predictions$.pred_distn, p = .4))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> 40% 40% 40% 40% 40% 40% \n#> 0.30277798 0.27213225 0.44345734 0.03120647 0.56121844 0.27492711\n```\n:::\n:::\n\n\nor extract the entire distribution into a \"long\" `epi_df` with `tau` being the probability and `q` being the value associated to that quantile.\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/q2_439cb3bc49eb03d8b4c34070ac5ba21d'}\n\n```{.r .cell-code}\nout_q$predictions %>%\n # first create a \"nested\" list-column\n mutate(.pred_distn = nested_quantiles(.pred_distn)) %>%\n unnest(.pred_distn) # then unnest it\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 1,288 × 6\n#> geo_value .pred q tau forecast_date target_date\n#> \n#> 1 ak 0.355 0 0.01 2021-12-31 2022-01-07 \n#> 2 ak 0.355 0 0.025 2021-12-31 2022-01-07 \n#> 3 ak 0.355 0.0371 0.05 2021-12-31 2022-01-07 \n#> 4 ak 0.355 0.123 0.1 2021-12-31 2022-01-07 \n#> 5 ak 0.355 0.174 0.15 2021-12-31 2022-01-07 \n#> 6 ak 0.355 0.211 0.2 2021-12-31 2022-01-07 \n#> # ℹ 1,282 more rows\n```\n:::\n:::\n\n\nAdditional simple adjustments to the basic forecaster can be made using the function:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-6_07ecdd97f1e2f61c667029fbe5d0d406'}\n\n```{.r .cell-code}\narx_args_list(\n lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf,\n forecast_date = NULL, target_date = NULL, levels = c(0.05, 0.95),\n symmetrize = TRUE, nonneg = TRUE, quantile_by_key = \"geo_value\"\n)\n```\n:::\n\n\n## Changing the engine\n\nSo far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model.\nThe `trainer` argument determines the type of model we want. \nThis takes a [`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/ranger_165b974f4c4580092d3398b1d2bee018'}\n\n```{.r .cell-code}\nout_rf <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n rand_forest(mode = \"regression\")\n)\n```\n:::\n\n\nOr boosted regression trees with `{xgboost}`:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/xgboost_59219c946dd5689d23feb924a64c64be'}\n\n```{.r .cell-code}\nout_gb <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n boost_tree(mode = \"regression\", trees = 20)\n)\n```\n:::\n\n\nOr quantile regression, using our custom forecasting engine `quantile_reg()`:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/quantreg_4e2d216ee037905d2417be2d184f5664'}\n\n```{.r .cell-code}\nout_gb <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n quantile_reg()\n)\n```\n:::\n\n\nFWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.\n\n", + "markdown": "# Overview\n\n\n\n\n\nAt a high level, our goal with `{epipredict}` is to make running simple machine learning / statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epidemiology training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.\n\nServing both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.\n\n\n## Baseline models\n\nWe provide a set of basic, easy-to-use forecasters that work out of the box. \nYou should be able to do a limited amount of customization on them. Any serious customization happens with the framework discussed below).\n\nFor the basic forecasters, we provide: \n \n* Flatline (basic) forecaster \n* Autoregressive forecaster\n* Autoregressive classifier\n* Smooth AR forecaster\n\nAll the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.\n\n## Forecasting framework\n\nAt its core, `{epipredict}` is a **framework** for creating custom forecasters.\nBy that we mean that we view the process of creating custom forecasters as\na collection of modular components. All of them should be easy to swap out\nor exchange for others, and massive variety should be available by fairly \nsimple modifications through the addition of steps or layers. \nThere are four types of components:\n \n1. Preprocessor: make transformations to the data before model training\n2. Trainer: train a model on data, resulting in a fitted model object\n3. Predictor: make predictions, using a fitted model object and processed test data\n4. Postprocessor: manipulate or transform the predictions before returning\n \nUsers familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially \nthe [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot \nof overlap. This is by design, and is in fact a feature. The truth is that\n`{epipredict}` is a wrapper around much that is contained in these packages.\nTherefore, if you want something from this -verse, it should \"just work\" (we hope).\n\nThe reason for the overlap is that `{workflows}` _already implements_ the first \nthree steps. And it does this very well. However, it is missing the \npostprocessing stage and currently has no plans for such an implementation. \nAnd this feature is important. The baseline forecaster we provide _requires_\npostprocessing. Anything more complicated (which is nearly everything) \nneeds this as well.\n\nThe second omission from `{tidymodels}` is support for panel data. Besides\nepidemiological data, economics, psychology, sociology, and many other areas\nfrequently deal with data of this type. So the framework of behind `{epipredict}`\nimplements this. In principle, this has nothing to do with epidemiology, and \none could simply use this package as a solution for the missing functionality in\n`{tidymodels}`. Again, this should \"just work\" (we hope).\n\nAll of the _panel data_ functionality is implemented through the `epi_df` data type\ndescribed in the previous part. If you have different panel data, just force it\ninto an `epi_df` as described in @sec-additional-keys.\n\n## Why doesn't this package already exist?\n\n- Parts of it actually DO exist. There's a universe called `tidymodels`. It \nhandles pre-processing, training, and prediction, bound together, through a \npackage called workflows. We built `epipredict` on top of that setup. In this \nway, you CAN use almost everything they provide.\n- However, workflows doesn't do post-processing to the extent envisioned here.\nAnd nothing in `tidymodels` handles panel data.\n- The tidy-team doesn't have plans to do either of these things. (We checked).\n- There are two packages that do time series built on `tidymodels`, but it's \n\"basic\" time series: 1-step AR models, exponential smoothing, STL decomposition,\netc.[^1] \n\n[^1]: Our group has not prioritized these sorts of models for epidemic \nforecasting, but one could also integrate these methods into our framework.\n\n\n## Show me the basics\n\nFor now, we'll just demonstrate one of the \"canned\" forecasters we provide: an autoregressive forecaster with (or without) covariates that _directly_ trains on the response. This is in contrast to a typical \"iterative\" AR model that trains to predict one-step-ahead, and then plugs in the predictions to \"leverage up\" to longer horizons. You saw this function in @sec-local-forecaster, but now we'll explain\nthe arguments a bit more thoroughly. Below, you'll see how to make a number of modifications to this\nforecaster, but understanding the inner workings, and **why** you would want\nsomething like this (as well as how to do elaborate customizations) \nwill be the subject of the rest of this book. \n\nWe'll use some of the same data we've examined earlier and estimate a model jointly across all locations using only the most recent 30 days of data (available\nin the built-in data frame).\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/demo-workflow_c1aed8396eb1b411c4ab154ecd15e222'}\n\n```{.r .cell-code}\njhu <- case_death_rate_subset %>%\n filter(time_value >= max(time_value) - 30)\n\nout <- arx_forecaster(\n jhu,\n outcome = \"death_rate\",\n predictors = c(\"case_rate\", \"death_rate\")\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n#> Warning: The forecast_date is less than the most recent update date of the\n#> data.forecast_date = 2021-12-31 while data is from 2022-05-31.\n```\n:::\n:::\n\n\nThis call produces a warning, which we'll ignore for now. But essentially, it's telling us that our data comes from May 2022 but we're trying to do a forecast for January 2022. The result is likely not an accurate measure of real-time forecast performance, because the data have been revised over time. \n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-2_7b68ec6f4741ca9ebb25c7a13be54061'}\n\n```{.r .cell-code}\nout\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type ARX Forecaster ════════════════════════════════\n#> \n#> This forecaster was fit on 2023-07-10 02:02:45\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-07.\n```\n:::\n:::\n\n\nPrinting the S3 object provides a bunch of summary information describing the \noriginal training data used to estimate the model as well as some information\nof what the predictions are for. It contains three main components:\n \n1. Metadata about the training data and when the forecast was created\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-3_2d86a3bbe62cc3a5e7b6c3e02059257d'}\n\n```{.r .cell-code}\nstr(out$metadata)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> List of 2\n#> $ training :List of 3\n#> ..$ geo_type : chr \"state\"\n#> ..$ time_type: chr \"day\"\n#> ..$ as_of : POSIXct[1:1], format: \"2022-05-31 12:08:25\"\n#> $ forecast_created: POSIXct[1:1], format: \"2023-07-10 02:02:45\"\n```\n:::\n:::\n\n2. The predictions in a tibble. The columns give the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made).\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-4_75eac823f860d76a7dd42bdea9e94ee1'}\n\n```{.r .cell-code}\nout$predictions\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 56 × 5\n#> geo_value .pred .pred_distn forecast_date target_date\n#> \n#> 1 ak 0.355 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 2 al 0.325 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 3 ar 0.496 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 4 as 0.0836 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 5 az 0.614 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 6 ca 0.327 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> # ℹ 50 more rows\n```\n:::\n:::\n\n3. An S3 object of class `epi_workflow`. This object encapsulates all the instructions necessary to create the prediction. More details on this below.\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-5_f9e3ce37b5ca3e9f59bd1977f249b01f'}\n\n```{.r .cell-code}\nout$epi_workflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> Postprocessor: Frosting\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 6 Recipe Steps\n#> \n#> • step_epi_lag()\n#> • step_epi_lag()\n#> • step_epi_ahead()\n#> • step_naomit()\n#> • step_naomit()\n#> • step_training_window()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) lag_0_case_rate lag_7_case_rate lag_14_case_rate \n#> 0.0829475 0.0009830 0.0027035 -0.0005651 \n#> lag_0_death_rate lag_7_death_rate lag_14_death_rate \n#> 0.2466110 0.1964921 0.0752998 \n#> \n#> ── Postprocessor ────────────────────────────────────────────────────────────\n#> 5 Frosting Layers\n#> \n#> • layer_predict()\n#> • layer_residual_quantiles()\n#> • layer_add_forecast_date()\n#> • layer_add_target_date()\n#> • layer_threshold()\n```\n:::\n:::\n\n\nBy default, the forecaster predicts the outcome (`death_rate`) 1-week ahead, \nusing 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today), \n1 week back and 2 weeks back. The predictors and outcome can be changed \ndirectly. The rest of the defaults are encapsulated into a list of arguments. \nThis list is produced by `arx_args_list()`.\n\n## Simple adjustments\n\nBasic adjustments can be made through the `args_list`.\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/differential-lags_6b8588f2553c80d191c85d09836876f9'}\n\n```{.r .cell-code}\nout2week <- arx_forecaster(\n epi_data = jhu,\n outcome = \"death_rate\",\n predictors = c(\"case_rate\", \"death_rate\"),\n args_list = arx_args_list(\n lags = list(case_rate = c(0, 1, 2, 3, 7, 14), death_rate = c(0, 7, 14)),\n ahead = 14\n )\n)\n```\n:::\n\n\nHere, we've used different lags on the `case_rate` and are now predicting 2 \nweeks ahead. Note that `lags` and `aheads` are in the same units as the \n`time_value` of the `epi_df` used for training (same as the `epi_slide()` \narguments discussed in @sec-sliding). This example also illustrates\na major difficulty with the \"iterative\" versions of AR models. This model \ndoesn't produce forecasts for `case_rate`, and so, would not have data to \n\"plug in\" for the necessary lags.[^2]\n\n[^2]: An obvious fix is to instead use a VAR and predict both, but this would \nlikely increase the variance of the model, and therefore, may lead to less \naccurate forecasts for the variable of interest.\n\n\nAnother property of the basic model is the predictive interval. We describe this in more detail in a coming chapter, but it is easy to request multiple quantiles.\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/differential-levels_a9da683d7e7fef5e4cb6288ad9899809'}\n\n```{.r .cell-code}\nout_q <- arx_forecaster(jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n args_list = arx_args_list(\n levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99)\n )\n)\n```\n:::\n\n\nThe column `.pred_dstn` in the `predictions` object is actually a \"distribution\" here parameterized by its quantiles. For this default forecaster, these are created using the quantiles of the residuals of the predictive model (possibly symmetrized). Here, we used 23 quantiles, but one can grab a particular quantile,\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/q1_5df8261d92421ead6dd2d77e0a127517'}\n\n```{.r .cell-code}\nhead(quantile(out_q$predictions$.pred_distn, p = .4))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> 40% 40% 40% 40% 40% 40% \n#> 0.30277798 0.27213225 0.44345734 0.03120647 0.56121844 0.27492711\n```\n:::\n:::\n\n\nor extract the entire distribution into a \"long\" `epi_df` with `tau` being the probability and `q` being the value associated to that quantile.\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/q2_439cb3bc49eb03d8b4c34070ac5ba21d'}\n\n```{.r .cell-code}\nout_q$predictions %>%\n # first create a \"nested\" list-column\n mutate(.pred_distn = nested_quantiles(.pred_distn)) %>%\n unnest(.pred_distn) # then unnest it\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 1,288 × 6\n#> geo_value .pred q tau forecast_date target_date\n#> \n#> 1 ak 0.355 0 0.01 2021-12-31 2022-01-07 \n#> 2 ak 0.355 0 0.025 2021-12-31 2022-01-07 \n#> 3 ak 0.355 0.0371 0.05 2021-12-31 2022-01-07 \n#> 4 ak 0.355 0.123 0.1 2021-12-31 2022-01-07 \n#> 5 ak 0.355 0.174 0.15 2021-12-31 2022-01-07 \n#> 6 ak 0.355 0.211 0.2 2021-12-31 2022-01-07 \n#> # ℹ 1,282 more rows\n```\n:::\n:::\n\n\nAdditional simple adjustments to the basic forecaster can be made using the function:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/unnamed-chunk-6_07ecdd97f1e2f61c667029fbe5d0d406'}\n\n```{.r .cell-code}\narx_args_list(\n lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf,\n forecast_date = NULL, target_date = NULL, levels = c(0.05, 0.95),\n symmetrize = TRUE, nonneg = TRUE, quantile_by_key = \"geo_value\"\n)\n```\n:::\n\n\n## Changing the engine\n\nSo far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model.\nThe `trainer` argument determines the type of model we want. \nThis takes a [`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/ranger_165b974f4c4580092d3398b1d2bee018'}\n\n```{.r .cell-code}\nout_rf <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n rand_forest(mode = \"regression\")\n)\n```\n:::\n\n\nOr boosted regression trees with `{xgboost}`:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/xgboost_59219c946dd5689d23feb924a64c64be'}\n\n```{.r .cell-code}\nout_gb <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n boost_tree(mode = \"regression\", trees = 20)\n)\n```\n:::\n\n\nOr quantile regression, using our custom forecasting engine `quantile_reg()`:\n\n\n::: {.cell layout-align=\"center\" hash='epipredict_cache/html/quantreg_8040389d329f60e5a8ff9eabea6c01ff'}\n\n```{.r .cell-code}\nout_qr <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n quantile_reg()\n)\n```\n:::\n\n\nFWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/flatline-forecaster/execute-results/html.json b/_freeze/flatline-forecaster/execute-results/html.json index b3c1406..07e4be9 100644 --- a/_freeze/flatline-forecaster/execute-results/html.json +++ b/_freeze/flatline-forecaster/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "eb1205f9e5fddf491d80688b96439477", + "hash": "139703e42fc15d16af77dfd4968da99c", "result": { - "markdown": "# Introducing the flatline forecaster\n\nThe flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The predictive intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).\n\n## Example of using the flatline forecaster\n\n\n::: {.cell hash='flatline-forecaster_cache/html/unnamed-chunk-1_7ecbc96792f1278d5e72283951cc2098'}\n\n:::\n\n\n\nWe will continue to use the `case_death_rate_subset` dataset that comes with the\n`epipredict` package. In brief, this is a subset of the JHU daily COVID-19 cases\nand deaths by state. While this dataset ranges from Dec 31, 2020 to Dec 31, \n2021, we will only consider a small subset at the end of that range to keep our\nexample relatively simple.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-2_bdb1903df82234e00ed17b2089a9dcf7'}\n\n```{.r .cell-code}\njhu <- case_death_rate_subset %>%\n dplyr::filter(time_value >= as.Date(\"2021-09-01\"))\n\njhu\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> An `epi_df` object, 6,832 x 4 with metadata:\n#> * geo_type = state\n#> * time_type = day\n#> * as_of = 2022-05-31 12:08:25.791826\n#> \n#> # A tibble: 6,832 × 4\n#> geo_value time_value case_rate death_rate\n#> * \n#> 1 ak 2021-09-01 75.3 0.198\n#> 2 al 2021-09-01 113. 0.845\n#> 3 ar 2021-09-01 68.5 0.919\n#> 4 as 2021-09-01 0 0 \n#> 5 az 2021-09-01 48.8 0.414\n#> 6 ca 2021-09-01 38.4 0.246\n#> # ℹ 6,826 more rows\n```\n:::\n:::\n\n\n### The basic mechanics of the flatline forecaster\n\nThe simplest way to create and train a flatline forecaster to predict the d\neath rate one week into the future, is to input the `epi_df` and the name of \nthe column from it that we want to predict in the `flatline_forecaster` function.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-3_d9bff13811c7ad711033cc306bce0068'}\n\n```{.r .cell-code}\none_week_ahead <- flatline_forecaster(jhu, outcome = \"death_rate\")\none_week_ahead\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type flatline ══════════════════════════════════════\n#> \n#> This forecaster was fit on 2023-06-19 20:31:13\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-07.\n```\n:::\n:::\n\n\nThe result is both a fitted model object which could be used any time in the \nfuture to create different forecasts, as well as a set of predicted values and\nprediction intervals for each location 7 days after the last available time\nvalue in the data, which is Dec 31, 2021. Note that 7 days is the default\nnumber of time steps ahead of the forecast date in which forecasts should be\nproduced. To change this, you must change the value of the `ahead` parameter\nin the list of additional arguments `flatline_args_list()`. Let's change this\nto 5 days to get some practice.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-4_013bf1041937f63c3b495c7ed5b211b9'}\n\n```{.r .cell-code}\nfive_days_ahead <- flatline_forecaster(\n jhu,\n outcome = \"death_rate\",\n flatline_args_list(ahead = 5L)\n)\n\nfive_days_ahead\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type flatline ══════════════════════════════════════\n#> \n#> This forecaster was fit on 2023-06-19 20:31:14\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-05.\n```\n:::\n:::\n\n\nWe could also specify that we want a 80% predictive interval by changing the \nlevels. The default 0.05 and 0.95 levels/quantiles give us 90% predictive \ninterval.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-5_04fc200f272f42d862a3c5693cc610b6'}\n\n```{.r .cell-code}\nfive_days_ahead <- flatline_forecaster(\n jhu,\n outcome = \"death_rate\",\n flatline_args_list(ahead = 5L, levels = c(0.1, 0.9))\n)\n\nfive_days_ahead\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type flatline ══════════════════════════════════════\n#> \n#> This forecaster was fit on 2023-06-19 20:31:14\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-05.\n```\n:::\n:::\n\n\nTo see the other arguments that you may modify, please see `?flatline_args_list()`. For now, we will move on to looking at the workflow.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-6_23d12864d38c6b2fb9458907db32530f'}\n\n```{.r .cell-code}\nfive_days_ahead$epi_workflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> Postprocessor: Frosting\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 2 Recipe Steps\n#> \n#> • step_epi_ahead()\n#> • step_training_window()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> Flatline forecaster\n#> \n#> Predictions produced by geo_value resulting in 56 total forecasts.\n#> A total of 7112 residuals are available from the training set.\n#> \n#> ── Postprocessor ────────────────────────────────────────────────────────────\n#> 5 Frosting Layers\n#> \n#> • layer_predict()\n#> • layer_residual_quantiles()\n#> • layer_add_forecast_date()\n#> • layer_add_target_date()\n#> • layer_threshold()\n```\n:::\n:::\n\n\nThe fitted model here was based on minimal pre-processing of the data, \nestimating a flatline model, and then post-processing the results to be \nmeaningful for epidemiological tasks. To look deeper into the pre-processing, \nmodel and processing parts individually, you may use the `$` operator after `epi_workflow`. For example, let's examine the pre-processing part in more detail.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-7_7ebad54744c08badf9d3dac19334894f'}\n\n```{.r .cell-code}\nlibrary(workflows)\nextract_preprocessor(five_days_ahead$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-8_3b5f80e746b0846bc6204aa82a69e29e'}\n\n```\n#> \n#> ── Recipe ───────────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> predictor: 3\n#> geo_value: 1\n#> raw: 1\n#> time_value: 1\n#> \n#> ── Operations\n#> • Leading: death_rate by 5\n#> • # of recent observations per key limited to:: Inf\n```\n:::\n\n\n\nUnder Operations, we can see that the pre-processing operations were to lead the\ndeath rate by 5 days (`step_epi_ahead()`) and that the \\# of recent observations\nused in the training window were not limited (in `step_training_window()` as\n`n_training = Inf` in `flatline_args_list()`). You should also see the\nmolded/pre-processed training data.\n\nFor symmetry, let's have a look at the post-processing.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-9_0808824fe21d98fc697b5f19bc27352c'}\n\n```{.r .cell-code}\nextract_frosting(five_days_ahead$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-10_72d5eb3c2909ae3ae55f4bacd445ad68'}\n\n```\n#> \n#> ── Frosting ─────────────────────────────────────────────────────────────────\n#> \n#> ── Layers\n#> • Creating predictions: \"\"\n#> • Resampling residuals for predictive quantiles: \"\" levels 0.1,\n#> 0.9\n#> • Adding forecast date: \"2021-12-31\"\n#> • Adding target date: \"2022-01-05\"\n#> • Thresholding predictions: dplyr::starts_with(\".pred\") to ]0, Inf)\n```\n:::\n\n\n\nThe post-processing operations in the order the that were performed were to create the predictions and the predictive intervals, add the forecast and target dates and bound the predictions at zero.\n\nWe can also easily examine the predictions themselves.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-11_926c2f170768c33a5e9312e52477a59e'}\n\n```{.r .cell-code}\nfive_days_ahead$predictions\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 56 × 5\n#> geo_value .pred .pred_distn forecast_date target_date\n#> \n#> 1 ak 0.0395 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 2 al 0.107 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 3 ar 0.490 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 4 as 0 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 5 az 0.608 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 6 ca 0.142 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> # ℹ 50 more rows\n```\n:::\n:::\n\n\nThe results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 95% predictive interval is available for every state (`geo_value`).\n\nThe figure below displays the prediction and prediction interval for three sample states: Arizona, New York, and Florida.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-12_338012827629211189d9c8b338a8fa27'}\n\n```{.r .cell-code code-fold=\"true\"}\nsamp_geos <- c(\"az\", \"ny\", \"fl\")\n\nhist <- jhu %>%\n filter(geo_value %in% samp_geos)\n\npreds <- five_days_ahead$predictions %>%\n filter(geo_value %in% samp_geos) %>%\n mutate(q = nested_quantiles(.pred_distn)) %>%\n unnest(q) %>%\n pivot_wider(names_from = tau, values_from = q)\n\nggplot(hist, aes(color = geo_value)) +\n geom_line(aes(time_value, death_rate)) +\n theme_bw() +\n geom_errorbar(data = preds, aes(x = target_date, ymin = `0.1`, ymax = `0.9`)) +\n geom_point(data = preds, aes(target_date, .pred)) +\n geom_vline(data = preds, aes(xintercept = forecast_date)) +\n scale_colour_viridis_d(name = \"\") +\n scale_x_date(date_labels = \"%b %Y\", date_breaks = \"1 month\") +\n facet_grid(geo_value ~ ., scales = \"free_y\") +\n theme(legend.position = \"none\") +\n labs(x = \"\", y = \"Incident deaths per 100K\\n inhabitants\")\n```\n\n::: {.cell-output-display}\n![](flatline-forecaster_files/figure-html/unnamed-chunk-12-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nThe vertical black line is the forecast date. Here the forecast seems pretty reasonable based on the past observations shown. In cases where the recent past is highly predictive of the near future, a simple flatline forecast may be respectable, but in more complex situations where there is more uncertainty of what's to come, the flatline forecaster may be best relegated to being a baseline model and nothing more.\n\nTake for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 5 to 25 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. Then, we row bind the list of results.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-13_7cca912cb9c507cdcc24b5736638e0bd'}\n\n```{.r .cell-code}\nout_df <- map(1:28, ~ flatline_forecaster(\n epi_data = jhu,\n outcome = \"death_rate\",\n args_list = flatline_args_list(ahead = .x)\n)$predictions) %>%\n list_rbind()\n```\n:::\n\n\nThen, we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-14_1548e56fb2990349e1650fd2101d3bf7'}\n\n```{.r .cell-code code-fold=\"true\"}\npreds <- out_df %>%\n filter(geo_value %in% samp_geos) %>%\n mutate(q = nested_quantiles(.pred_distn)) %>%\n unnest(q) %>%\n pivot_wider(names_from = tau, values_from = q)\n\nggplot(hist) +\n geom_line(aes(time_value, death_rate)) +\n geom_ribbon(\n data = preds,\n aes(x = target_date, ymin = `0.05`, ymax = `0.95`, fill = geo_value)\n ) +\n geom_point(data = preds, aes(target_date, .pred, colour = geo_value)) +\n geom_vline(data = preds, aes(xintercept = forecast_date)) +\n scale_colour_viridis_d() +\n scale_fill_viridis_d(alpha = .4) +\n scale_x_date(date_labels = \"%b %Y\", date_breaks = \"1 month\") +\n scale_y_continuous(expand = expansion(c(0, .05))) +\n facet_grid(geo_value ~ ., scales = \"free_y\") +\n labs(x = \"\", y = \"Incident deaths per 100K\\n inhabitants\") +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![](flatline-forecaster_files/figure-html/unnamed-chunk-14-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nNow, you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural.\n\nSo naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. Moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model.\n\n## What we've learned in a nutshell\n\nThough the flatline forecaster is a very basic model with limited customization, it is about as steady and predictable as a model can get. So it provides a good reference or baseline to compare more complicated models to.\n", + "markdown": "# Introducing the flatline forecaster\n\nThe flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The prediction intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).\n\n## Example of using the flatline forecaster\n\n\n::: {.cell}\n\n:::\n\n\n\nWe will continue to use the `case_death_rate_subset` dataset that comes with the\n`epipredict` package. In brief, this is a subset of the JHU daily COVID-19 cases\nand deaths by state. While this dataset ranges from Dec 31, 2020 to Dec 31, \n2021, we will only consider a small subset at the end of that range to keep our\nexample relatively simple.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-2_bdb1903df82234e00ed17b2089a9dcf7'}\n\n```{.r .cell-code}\njhu <- case_death_rate_subset %>%\n dplyr::filter(time_value >= as.Date(\"2021-09-01\"))\n\njhu\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> An `epi_df` object, 6,832 x 4 with metadata:\n#> * geo_type = state\n#> * time_type = day\n#> * as_of = 2022-05-31 12:08:25.791826\n#> \n#> # A tibble: 6,832 × 4\n#> geo_value time_value case_rate death_rate\n#> * \n#> 1 ak 2021-09-01 75.3 0.198\n#> 2 al 2021-09-01 113. 0.845\n#> 3 ar 2021-09-01 68.5 0.919\n#> 4 as 2021-09-01 0 0 \n#> 5 az 2021-09-01 48.8 0.414\n#> 6 ca 2021-09-01 38.4 0.246\n#> # ℹ 6,826 more rows\n```\n:::\n:::\n\n\n### The basic mechanics of the flatline forecaster\n\nSuppose that our goal is to predict death rates one week ahead of the last date available for each state. Mathematically, on day $t$, we want to predict new deaths $y$ that are $h$ days ahead at many locations $j$. So for each location, we'll predict\n\n$$\n\\hat{y}_{j, {t+h}} = y_{j, t}\n$$\n\nwhere $t$ is 2021-12-31, $h$ is 7 days, and $j$ is the state in our example.\n\nNow, the simplest way to create and train a flatline forecaster to predict the death rate one week into the future is to input the `epi_df` and the name of the column from it that we want to predict in the `flatline_forecaster` function. \n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-3_94c07db78256abd967ba464944619d54'}\n\n```{.r .cell-code}\none_week_ahead <- flatline_forecaster(jhu,\n outcome = \"death_rate\"\n)\none_week_ahead\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type flatline ══════════════════════════════════════\n#> \n#> This forecaster was fit on 2023-07-10 03:03:21\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-07.\n```\n:::\n:::\n\n\nThe result is a S3 object, which contains three objects - metadata, a tibble of predictions, \nand a S3 object of class `epi_workflow`. There are a few important things to note here.\nFirst, the fitted model object contained in the `epi_workflow`\nobject could be used any time in the future to create different forecasts. \nNext, the tibble of predicted values and prediction intervals is for each location 7 days \nafter the last available time value in the data, which is Dec 31, 2021. Note that 7 days is the default\nnumber of time steps ahead of the forecast date in which forecasts should be\nproduced. To change this, you must change the value of the `ahead` parameter\nin the list of additional arguments `flatline_args_list()`. Let's change this\nto 5 days to get some practice.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-4_013bf1041937f63c3b495c7ed5b211b9'}\n\n```{.r .cell-code}\nfive_days_ahead <- flatline_forecaster(\n jhu,\n outcome = \"death_rate\",\n flatline_args_list(ahead = 5L)\n)\n\nfive_days_ahead\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type flatline ══════════════════════════════════════\n#> \n#> This forecaster was fit on 2023-07-10 01:35:27\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-05.\n```\n:::\n:::\n\n\nWe could also specify that we want a 80% prediction interval by changing the \nlevels. The default 0.05 and 0.95 levels/quantiles give us 90% prediction \ninterval.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-5_4203e0f1079b4e8ff638baa408843211'}\n\n```{.r .cell-code}\nfive_days_ahead <- flatline_forecaster(\n jhu,\n outcome = \"death_rate\",\n flatline_args_list(\n ahead = 5L,\n levels = c(0.1, 0.9)\n )\n)\n\nfive_days_ahead\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> ══ A basic forecaster of type flatline ══════════════════════════════════════\n#> \n#> This forecaster was fit on 2023-07-10 03:03:21\n#> \n#> Training data was an `epi_df` with\n#> • Geography: state,\n#> • Time type: day,\n#> • Using data up-to-date as of: 2022-05-31 12:08:25.\n#> \n#> ── Predictions ──────────────────────────────────────────────────────────────\n#> \n#> A total of 56 predictions are available for\n#> • 56 unique geographic regions,\n#> • At forecast dates: 2021-12-31,\n#> • For target dates: 2022-01-05.\n```\n:::\n:::\n\n\nTo see the other arguments that you may modify, please see `?flatline_args_list()`. For now, we will move on to looking at the workflow.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-6_23d12864d38c6b2fb9458907db32530f'}\n\n```{.r .cell-code}\nfive_days_ahead$epi_workflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> Postprocessor: Frosting\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 2 Recipe Steps\n#> \n#> • step_epi_ahead()\n#> • step_training_window()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> Flatline forecaster\n#> \n#> Predictions produced by geo_value resulting in 56 total forecasts.\n#> A total of 7112 residuals are available from the training set.\n#> \n#> ── Postprocessor ────────────────────────────────────────────────────────────\n#> 5 Frosting Layers\n#> \n#> • layer_predict()\n#> • layer_residual_quantiles()\n#> • layer_add_forecast_date()\n#> • layer_add_target_date()\n#> • layer_threshold()\n```\n:::\n:::\n\n\nThe fitted model here was based on minimal pre-processing of the data, \nestimating a flatline model, and then post-processing the results to be \nmeaningful for epidemiological tasks. To look deeper into the pre-processing \nor post-processing parts individually, you can use the `extract_preprocessor()`\nor `extract_frosting()` functions on the `epi_workflow`, respectively. \nFor example, let’s examine the pre-processing part in more detail.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-7_7ebad54744c08badf9d3dac19334894f'}\n\n```{.r .cell-code}\nlibrary(workflows)\nextract_preprocessor(five_days_ahead$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-8_3b5f80e746b0846bc6204aa82a69e29e'}\n\n```\n#> \n#> ── Recipe ───────────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> predictor: 3\n#> geo_value: 1\n#> raw: 1\n#> time_value: 1\n#> \n#> ── Operations\n#> • Leading: death_rate by 5\n#> • # of recent observations per key limited to:: Inf\n```\n:::\n\n\nUnder Operations, we can see that the pre-processing operations were to lead the\ndeath rate by 5 days (`step_epi_ahead()`) and that the \\# of recent observations\nused in the training window were not limited (in `step_training_window()` as\n`n_training = Inf` in `flatline_args_list()`).\n\nFor symmetry, let's have a look at the post-processing.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-9_0808824fe21d98fc697b5f19bc27352c'}\n\n```{.r .cell-code}\nextract_frosting(five_days_ahead$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-10_72d5eb3c2909ae3ae55f4bacd445ad68'}\n\n```\n#> \n#> ── Frosting ─────────────────────────────────────────────────────────────────\n#> \n#> ── Layers\n#> • Creating predictions: \"\"\n#> • Resampling residuals for predictive quantiles: \"\" levels 0.1,\n#> 0.9\n#> • Adding forecast date: \"2021-12-31\"\n#> • Adding target date: \"2022-01-05\"\n#> • Thresholding predictions: dplyr::starts_with(\".pred\") to ]0, Inf)\n```\n:::\n\n\nThe post-processing operations in the order the that were performed were to create the predictions and the prediction intervals, add the forecast and target dates and bound the predictions at zero.\n\nWe can also easily examine the predictions themselves.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-11_926c2f170768c33a5e9312e52477a59e'}\n\n```{.r .cell-code}\nfive_days_ahead$predictions\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 56 × 5\n#> geo_value .pred .pred_distn forecast_date target_date\n#> \n#> 1 ak 0.0395 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 2 al 0.107 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 3 ar 0.490 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 4 as 0 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 5 az 0.608 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> 6 ca 0.142 [0.1, 0.9] 2021-12-31 2022-01-05 \n#> # ℹ 50 more rows\n```\n:::\n:::\n\n\nThe results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 80% prediction interval is available for every state (`geo_value`).\n\nThe figure below displays the prediction and prediction interval for three sample states: Arizona, Florida, and New York.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-12_338012827629211189d9c8b338a8fa27'}\n\n```{.r .cell-code code-fold=\"true\"}\nsamp_geos <- c(\"az\", \"ny\", \"fl\")\n\nhist <- jhu %>%\n filter(geo_value %in% samp_geos)\n\npreds <- five_days_ahead$predictions %>%\n filter(geo_value %in% samp_geos) %>%\n mutate(q = nested_quantiles(.pred_distn)) %>%\n unnest(q) %>%\n pivot_wider(names_from = tau, values_from = q)\n\nggplot(hist, aes(color = geo_value)) +\n geom_line(aes(time_value, death_rate)) +\n theme_bw() +\n geom_errorbar(data = preds, aes(x = target_date, ymin = `0.1`, ymax = `0.9`)) +\n geom_point(data = preds, aes(target_date, .pred)) +\n geom_vline(data = preds, aes(xintercept = forecast_date)) +\n scale_colour_viridis_d(name = \"\") +\n scale_x_date(date_labels = \"%b %Y\", date_breaks = \"1 month\") +\n facet_grid(geo_value ~ ., scales = \"free_y\") +\n theme(legend.position = \"none\") +\n labs(x = \"\", y = \"Incident deaths per 100K\\n inhabitants\")\n```\n\n::: {.cell-output-display}\n![](flatline-forecaster_files/figure-html/unnamed-chunk-12-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nThe vertical black line is the forecast date. Here the forecast seems pretty reasonable based on the past observations shown. In cases where the recent past is highly predictive of the near future, a simple flatline forecast may be respectable, but in more complex situations where there is more uncertainty of what's to come, the flatline forecaster may be best relegated to being a baseline model and nothing more.\n\nTake for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 1 to 28 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. And then we row bind the list of results.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-13_7cca912cb9c507cdcc24b5736638e0bd'}\n\n```{.r .cell-code}\nout_df <- map(1:28, ~ flatline_forecaster(\n epi_data = jhu,\n outcome = \"death_rate\",\n args_list = flatline_args_list(ahead = .x)\n)$predictions) %>%\n list_rbind()\n```\n:::\n\n\nThen we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`.\n\n\n::: {.cell layout-align=\"center\" hash='flatline-forecaster_cache/html/unnamed-chunk-14_1548e56fb2990349e1650fd2101d3bf7'}\n\n```{.r .cell-code code-fold=\"true\"}\npreds <- out_df %>%\n filter(geo_value %in% samp_geos) %>%\n mutate(q = nested_quantiles(.pred_distn)) %>%\n unnest(q) %>%\n pivot_wider(names_from = tau, values_from = q)\n\nggplot(hist) +\n geom_line(aes(time_value, death_rate)) +\n geom_ribbon(\n data = preds,\n aes(x = target_date, ymin = `0.05`, ymax = `0.95`, fill = geo_value)\n ) +\n geom_point(data = preds, aes(target_date, .pred, colour = geo_value)) +\n geom_vline(data = preds, aes(xintercept = forecast_date)) +\n scale_colour_viridis_d() +\n scale_fill_viridis_d(alpha = .4) +\n scale_x_date(date_labels = \"%b %Y\", date_breaks = \"1 month\") +\n scale_y_continuous(expand = expansion(c(0, .05))) +\n facet_grid(geo_value ~ ., scales = \"free_y\") +\n labs(x = \"\", y = \"Incident deaths per 100K\\n inhabitants\") +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![](flatline-forecaster_files/figure-html/unnamed-chunk-14-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nNow you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural.\n\nSo naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. And moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model.\n\n## What we've learned in a nutshell\n\nThough the flatline forecaster is a very basic model with limited customization, it is about as steady and predictable as a model can get. So it provides a good reference or baseline to compare more complicated models to.\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/forecast-framework/execute-results/html.json b/_freeze/forecast-framework/execute-results/html.json index 808585b..1cf616b 100644 --- a/_freeze/forecast-framework/execute-results/html.json +++ b/_freeze/forecast-framework/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "6093bfe321dbf60920ca2db6eb89d5bd", + "hash": "1357b3716fcd8deb2a9f766430d18664", "result": { - "markdown": "# Inner workings of the framework\n\n\n\n\n\nUnderneath the hood, the `arx_forecaster()` (and all our canned\nforecasters) creates (and returns) an `epi_workflow`. \nEssentially, this is a big S3 object that wraps up the 4 modular steps \n(preprocessing - postprocessing) described in the last chapter.\n\n1. Preprocessor: make transformations to the data before model training\n2. Trainer: train a model on data, resulting in a fitted model object\n3. Predictor: make predictions, using a fitted model object and processed test data\n4. Postprocessor: manipulate or transform the predictions before returning\n\nLet's investigate how these interact with `{tidymodels}` and why it's important\nto think of forecasting this way. To have something to play with, we'll continue\nto examine the data and an estimated canned corecaster.\n\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/demo-workflow_38ac956904953873a24c7e4dd0648ab5'}\n\n```{.r .cell-code}\njhu <- case_death_rate_subset %>%\n filter(time_value >= max(time_value) - 30)\n\nout_gb <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n boost_tree(mode = \"regression\", trees = 20)\n)\n```\n:::\n\n\n## Preprocessing\n\nPreprocessing is accomplished through a `recipe` (imagine baking a cake) as \nprovided in the [`{recipes}`](https://recipes.tidymodels.org) package. \nWe've made a few modifications (to handle\npanel data) as well as added some additional options. The recipe gives a\nspecification of how to handle training data. Think of it like a fancified\n`formula` that you would pass to `lm()`: `y ~ x1 + log(x2)`. In general, \nthere are 2 extensions to the `formula` that `{recipes}` handles: \n\n 1. Doing transformations of both training and test data that can always be \n applied. These are things like taking the log of a variable, leading or \n lagging, filtering out rows, handling dummy variables, etc.\n 2. Using statistics from the training data to eventually process test data. \n This is a major benefit of `{recipes}`. It prevents what the tidy team calls\n \"data leakage\". A simple example is centering a predictor by its mean. We\n need to store the mean of the predictor from the training data and use that\n value on the test data rather than accidentally calculating the mean of\n the test predictor for centering.\n \nA recipe is processed in 2 steps, first it is \"prepped\". This calculates and\nstores any intermediate statistics necessary for use on the test data. \nThen it is \"baked\"\nresulting in training data ready for passing into a statistical model (like `lm`).\n\nWe have introduced an `epi_recipe`. It's just a `recipe` that knows how to handle\nthe `time_value`, `geo_value`, and any additional keys so that these are available\nwhen necessary.\n\nThe `epi_recipe` from `out_gb` can be extracted from the result:\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-2_701136b2c4657141ea38354f5aad0130'}\n\n```{.r .cell-code}\nlibrary(workflows)\nlibrary(recipes)\nextract_recipe(out_gb$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-3_003b8015fdb7de99adc7e1683848927f'}\n\n```\n#> \n#> ── Recipe ─────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> raw: 2\n#> geo_value: 1\n#> time_value: 1\n#> \n#> ── Training information\n#> Training data contained 1736 data points and no incomplete rows.\n#> \n#> ── Operations\n#> • Lagging: case_rate by 0, 7, 14 | Trained\n#> • Lagging: death_rate by 0, 7, 14 | Trained\n#> • Leading: death_rate by 7 | Trained\n#> • Removing rows with NA values in: lag_0_case_rate, ... | Trained\n#> • Removing rows with NA values in: ahead_7_death_rate | Trained\n#> • # of recent observations per key limited to:: Inf | Trained\n```\n:::\n\n\n\nThe \"Inputs\" are the original `epi_df` and the \"roles\" that these are assigned.\nNone of these are predictors or outcomes. Those will be created \nby the recipe when it is prepped. The \"Operations\" are the sequence of \ninstructions to create the cake (baked training data).\nHere we create lagged predictors, lead the outcome, and then remove `NA`s.\nSome models like `lm` internally handle `NA`s, but not everything does, so we\ndeal with them explicitly. The code to do this (inside the forecaster) is\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-4_7b1a5f279cb0216eb98e81324ade71aa'}\n\n```{.r .cell-code}\ner <- epi_recipe(jhu) %>%\n step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%\n step_epi_ahead(death_rate, ahead = 7) %>%\n step_epi_naomit()\n```\n:::\n\n\nWhile `{recipes}` provides a function `step_lag()`, it assumes that the data\nhave no breaks in the sequence of `time_values`. This is a bit dangerous, so\nwe avoid that behaviour. Our `lag/ahead` functions also appropriately adjust the\namount of data to avoid accidentally dropping recent predictors from the test\ndata.\n\n## The model specification\n\nUsers familiar with the `{parsnip}` package will have no trouble here.\nBasically, `{parsnip}` unifies the function signature across statistical models.\nFor example, `lm()` \"likes\" to work with formulas, but `glmnet::glmnet()` uses\n`x` and `y` for predictors and response. `{parsnip}` is agnostic. Both of these\ndo \"linear regression\". Above we switched from `lm()` to `xgboost()` without \nany issue despite the fact that these functions couldn't be more different.\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-5_904b72e53b7ebec817c2ff42657fface'}\n\n```{.r .cell-code}\nlm(\n formula, data, subset, weights, na.action,\n method = \"qr\",\n model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,\n contrasts = NULL, offset, ...\n)\n\nxgboost(\n data = NULL, label = NULL, missing = NA, weight = NULL,\n params = list(), nrounds, verbose = 1, print_every_n = 1L,\n early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,\n save_name = \"xgboost.model\", xgb_model = NULL, callbacks = list(),\n ...\n)\n```\n:::\n\n\n`{epipredict}` provides a few engines/modules like `flatline()` and \n`quantile_reg()` to power the `flatline_forecaster()` and provide quantile \nregression, but you should be able to use almost any available models\nlisted [here](https://www.tidymodels.org/find/parsnip/).\n\n\nTo estimate (fit) a preprocessed model, one calls `fit()` on the `epi_workflow`.\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-6_aa2a47d336b8de9c69394816f02e101e'}\n\n```{.r .cell-code}\newf <- epi_workflow(er, linear_reg()) %>% fit(jhu)\n```\n:::\n\n\n## Predicting and Postprocessing (bound together)\n\nTo stretch the metaphor of preparing a cake to its natural limits, we have\ncreated postprocessing functionality called \"frosting\". Much like the recipe,\neach postprocessing operation is a \"layer\" and we \"slather\" these onto our \nbaked cake. To fix ideas, below is the postprocessing `frosting` for \n`arx_forecaster()`\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-7_de6af11afc0b108fae8b915da6125069'}\n\n```{.r .cell-code}\nextract_frosting(out_gb$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-8_9b4f4402e9fc39b22a2aca0aeca7a049'}\n\n```\n#> \n#> ── Frosting ─────────────────────────────────────────────────────────────────\n#> \n#> ── Layers\n#> • Creating predictions: \"\"\n#> • Resampling residuals for predictive quantiles: \"\" levels 0.05,\n#> 0.95\n#> • Adding forecast date: \"2021-12-31\"\n#> • Adding target date: \"2022-01-07\"\n#> • Thresholding predictions: dplyr::starts_with(\".pred\") to ]0, Inf)\n```\n:::\n\n\n\nHere we have 5 layers of frosting. The first generates the forecasts from the test data.\nThe second uses quantiles of the residuals to create distributional\nforecasts. The next two add columns for the date the forecast was made and the\ndate for which it is intended to occur. Because we are predicting rates, they \nshould be non-negative, so the last layer thresholds both predicted values and\nintervals at 0. The code to do this (inside the forecaster) is\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-9_9d652e9268c146341e4fb33091fd643d'}\n\n```{.r .cell-code}\nf <- frosting() %>%\n layer_predict() %>%\n layer_residual_quantiles(\n probs = c(.01, .025, seq(.05, .95, by = .05), .975, .99),\n symmetrize = TRUE\n ) %>%\n layer_add_forecast_date() %>%\n layer_add_target_date() %>%\n layer_threshold(starts_with(\".pred\"))\n```\n:::\n\n\nAt predict time, we add this object onto the `epi_workflow` and call `predict()`\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-10_1af407921661ddcfded119d9726e1a59'}\n\n```{.r .cell-code}\ntest_data <- get_test_data(er, jhu)\newf %>%\n add_frosting(f) %>%\n predict(test_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> An `epi_df` object, 56 x 6 with metadata:\n#> * geo_type = state\n#> * time_type = day\n#> * as_of = 2022-05-31 12:08:25.791826\n#> \n#> # A tibble: 56 × 6\n#> geo_value time_value .pred .pred_distn forecast_date target_date\n#> * \n#> 1 ak 2021-12-31 0.355 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 2 al 2021-12-31 0.325 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 3 ar 2021-12-31 0.496 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 4 as 2021-12-31 0.0836 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 5 az 2021-12-31 0.614 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 6 ca 2021-12-31 0.327 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> # ℹ 50 more rows\n```\n:::\n:::\n\n\nThe above `get_test_data()` function examines the recipe and ensures that enough\ntest data is available to create the necessary lags and produce a prediction\nfor the desired future time point (after the end of the training data). This mimics\nwhat would happen if `jhu` contained the most recent available historical data and\nwe wanted to actually predict the future. We could have instead used any test data\nthat contained the necessary predictors.\n\n:::{.callout-note}\nIn the predictions above, you'll see a `time_value` column. That's because we \ncould use **any training data**. We happened to use training data corresponding\nto the most recent available, and it's lags. But we could have instead used\nlast week's or we could use the data that arrives next year, or we could use multiple\n`time_values` for multiple locations. This is completely allowed, though not\nnecessarily what you expect.\n\nIn production forecasting, you'd probably reestimate the model and produce new\npredictions whenever new data arrives. This is exactly what all the canned \nforecasters we provide do. So those strip out the `time_value` column.\n\nBut the next most likely procedure would be\nto feed your previously estimated model (without refitting) the new data.\nTo do this, you'd just call `get_test_data()` on that new data. And the \n`time_value` would still be the same as your `forecast_date`.\n\nGetting many forecasts (multiple `time_values`) for each location, is not\nexactly a typical desire in this context. But it's also not unheard of, so\nit is possible (and analogous to standard, non-time series forecasting). \n:::\n\n\n## Conclusion\n\nInternally, we provide some canned forecaster functions to create reasonable forecasts. \nBut ideally, a user could create their own forecasters by building up the \ncomponents we provide. In other chapters, we try to walk through some of these\ncustomizations. \n\nTo illustrate everything above, here is (roughly) the code for the \n`arx_forecaster()` to predict the death rate, 1 week ahead:\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-11_d45e3276831a1b40877b2251297fbb9d'}\n\n```{.r .cell-code}\nr <- epi_recipe(jhu) %>%\n step_epi_ahead(death_rate, ahead = 7) %>%\n step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%\n step_epi_naomit()\n\nlatest <- get_test_data(r, jhu)\n\nf <- frosting() %>%\n layer_predict() %>%\n layer_residual_quantiles() %>%\n layer_add_forecast_date() %>%\n layer_add_target_date() %>%\n layer_threshold(starts_with(\".pred\"))\n\neng <- linear_reg()\nwf <- epi_workflow(r, eng, f) %>% fit(jhu)\npreds <- predict(wf, latest)\n```\n:::\n\nThe code for `arx_forecaster()` simply generalizes this, passing along arguments as needed.\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-12_b7f75d610d9c4f0ced30040e9aa3a481'}\n\n```{.r .cell-code}\npreds\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> An `epi_df` object, 56 x 6 with metadata:\n#> * geo_type = state\n#> * time_type = day\n#> * as_of = 2022-05-31 12:08:25.791826\n#> \n#> # A tibble: 56 × 6\n#> geo_value time_value .pred .pred_distn forecast_date target_date\n#> * \n#> 1 ak 2021-12-31 36.4 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 2 al 2021-12-31 89.9 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 3 ar 2021-12-31 82.6 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 4 as 2021-12-31 0 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 5 az 2021-12-31 58.3 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 6 ca 2021-12-31 84.4 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> # ℹ 50 more rows\n```\n:::\n:::\n", + "markdown": "# Inner workings of the framework\n\n\n\n\n\nUnderneath the hood, the `arx_forecaster()` (and all our canned\nforecasters) creates and returns an `epi_workflow`. \nEssentially, this is a big S3 object that wraps up the 4 modular steps \n(preprocessing - postprocessing) described in the last chapter.\n\n1. Preprocessor: make transformations to the data before model training\n2. Trainer: train a model on data, resulting in a fitted model object\n3. Predictor: make predictions, using a fitted model object and processed test data\n4. Postprocessor: manipulate or transform the predictions before returning\n\nLet's investigate how these interact with `{tidymodels}` and why it's important\nto think of forecasting this way. To have something to play with, we'll continue\nto examine the data and an estimated canned forecaster.\n\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/demo-workflow_38ac956904953873a24c7e4dd0648ab5'}\n\n```{.r .cell-code}\njhu <- case_death_rate_subset %>%\n filter(time_value >= max(time_value) - 30)\n\nout_gb <- arx_forecaster(\n jhu, \"death_rate\", c(\"case_rate\", \"death_rate\"),\n boost_tree(mode = \"regression\", trees = 20)\n)\n```\n:::\n\n\n## Preprocessing\n\nPreprocessing is accomplished through a `recipe` (imagine baking a cake) as \nprovided in the [`{recipes}`](https://recipes.tidymodels.org) package. \nWe've made a few modifications (to handle\npanel data) as well as added some additional options. The recipe gives a\nspecification of how to handle training data. Think of it like a fancified\n`formula` that you would pass to `lm()`: `y ~ x1 + log(x2)`. In general, \nthere are 2 extensions to the `formula` that `{recipes}` handles: \n\n 1. Doing transformations of both training and test data that can always be \n applied. These are things like taking the log of a variable, leading or \n lagging, filtering out rows, handling dummy variables, etc.\n 2. Using statistics from the training data to eventually process test data. \n This is a major benefit of `{recipes}`. It prevents what the tidy team calls\n \"data leakage\". A simple example is centering a predictor by its mean. We\n need to store the mean of the predictor from the training data and use that\n value on the test data rather than accidentally calculating the mean of\n the test predictor for centering.\n \nA recipe is processed in 2 steps, first it is \"prepped\". This calculates and\nstores any intermediate statistics necessary for use on the test data. \nThen it is \"baked\"\nresulting in training data ready for passing into a statistical model (like `lm`).\n\nWe have introduced an `epi_recipe`. It's just a `recipe` that knows how to handle\nthe `time_value`, `geo_value`, and any additional keys so that these are available\nwhen necessary.\n\nThe `epi_recipe` from `out_gb` can be extracted from the result:\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-2_701136b2c4657141ea38354f5aad0130'}\n\n```{.r .cell-code}\nlibrary(workflows)\nlibrary(recipes)\nextract_recipe(out_gb$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-3_003b8015fdb7de99adc7e1683848927f'}\n\n```\n#> \n#> ── Recipe ───────────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> raw: 2\n#> geo_value: 1\n#> time_value: 1\n#> \n#> ── Training information\n#> Training data contained 1736 data points and no incomplete rows.\n#> \n#> ── Operations\n#> • Lagging: case_rate by 0, 7, 14 | Trained\n#> • Lagging: death_rate by 0, 7, 14 | Trained\n#> • Leading: death_rate by 7 | Trained\n#> • Removing rows with NA values in: lag_0_case_rate, ... | Trained\n#> • Removing rows with NA values in: ahead_7_death_rate | Trained\n#> • # of recent observations per key limited to:: Inf | Trained\n```\n:::\n\n\n\nThe \"Inputs\" are the original `epi_df` and the \"roles\" that these are assigned.\nNone of these are predictors or outcomes. Those will be created \nby the recipe when it is prepped. The \"Operations\" are the sequence of \ninstructions to create the cake (baked training data).\nHere we create lagged predictors, lead the outcome, and then remove `NA`s.\nSome models like `lm` internally handle `NA`s, but not everything does, so we\ndeal with them explicitly. The code to do this (inside the forecaster) is\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-4_7b1a5f279cb0216eb98e81324ade71aa'}\n\n```{.r .cell-code}\ner <- epi_recipe(jhu) %>%\n step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%\n step_epi_ahead(death_rate, ahead = 7) %>%\n step_epi_naomit()\n```\n:::\n\n\nWhile `{recipes}` provides a function `step_lag()`, it assumes that the data\nhave no breaks in the sequence of `time_values`. This is a bit dangerous, so\nwe avoid that behaviour. Our `lag/ahead` functions also appropriately adjust the\namount of data to avoid accidentally dropping recent predictors from the test\ndata.\n\n## The model specification\n\nUsers familiar with the `{parsnip}` package will have no trouble here.\nBasically, `{parsnip}` unifies the function signature across statistical models.\nFor example, `lm()` \"likes\" to work with formulas, but `glmnet::glmnet()` uses\n`x` and `y` for predictors and response. `{parsnip}` is agnostic. Both of these\ndo \"linear regression\". Above we switched from `lm()` to `xgboost()` without \nany issue despite the fact that these functions couldn't be more different.\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-5_904b72e53b7ebec817c2ff42657fface'}\n\n```{.r .cell-code}\nlm(\n formula, data, subset, weights, na.action,\n method = \"qr\",\n model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,\n contrasts = NULL, offset, ...\n)\n\nxgboost(\n data = NULL, label = NULL, missing = NA, weight = NULL,\n params = list(), nrounds, verbose = 1, print_every_n = 1L,\n early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,\n save_name = \"xgboost.model\", xgb_model = NULL, callbacks = list(),\n ...\n)\n```\n:::\n\n\n`{epipredict}` provides a few engines/modules like `flatline()` and \n`quantile_reg()` to power the `flatline_forecaster()` and provide quantile \nregression, but you should be able to use almost any available models\nlisted [here](https://www.tidymodels.org/find/parsnip/).\n\n\nTo estimate (fit) a preprocessed model, one calls `fit()` on the `epi_workflow`.\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-6_aa2a47d336b8de9c69394816f02e101e'}\n\n```{.r .cell-code}\newf <- epi_workflow(er, linear_reg()) %>% fit(jhu)\n```\n:::\n\n\n## Predicting and Postprocessing (bound together)\n\nTo stretch the metaphor of preparing a cake to its natural limits, we have\ncreated postprocessing functionality called \"frosting\". Much like the recipe,\neach postprocessing operation is a \"layer\" and we \"slather\" these onto our \nbaked cake. To fix ideas, below is the postprocessing `frosting` for \n`arx_forecaster()`\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-7_de6af11afc0b108fae8b915da6125069'}\n\n```{.r .cell-code}\nextract_frosting(out_gb$epi_workflow)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-8_9b4f4402e9fc39b22a2aca0aeca7a049'}\n\n```\n#> \n#> ── Frosting ─────────────────────────────────────────────────────────────────\n#> \n#> ── Layers\n#> • Creating predictions: \"\"\n#> • Resampling residuals for predictive quantiles: \"\" levels 0.05,\n#> 0.95\n#> • Adding forecast date: \"2021-12-31\"\n#> • Adding target date: \"2022-01-07\"\n#> • Thresholding predictions: dplyr::starts_with(\".pred\") to ]0, Inf)\n```\n:::\n\n\n\nHere we have 5 layers of frosting. The first generates the forecasts from the test data.\nThe second uses quantiles of the residuals to create distributional\nforecasts. The next two add columns for the date the forecast was made and the\ndate for which it is intended to occur. Because we are predicting rates, they \nshould be non-negative, so the last layer thresholds both predicted values and\nintervals at 0. The code to do this (inside the forecaster) is\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-9_9d652e9268c146341e4fb33091fd643d'}\n\n```{.r .cell-code}\nf <- frosting() %>%\n layer_predict() %>%\n layer_residual_quantiles(\n probs = c(.01, .025, seq(.05, .95, by = .05), .975, .99),\n symmetrize = TRUE\n ) %>%\n layer_add_forecast_date() %>%\n layer_add_target_date() %>%\n layer_threshold(starts_with(\".pred\"))\n```\n:::\n\n\nAt predict time, we add this object onto the `epi_workflow` and call `predict()`\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-10_1af407921661ddcfded119d9726e1a59'}\n\n```{.r .cell-code}\ntest_data <- get_test_data(er, jhu)\newf %>%\n add_frosting(f) %>%\n predict(test_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> An `epi_df` object, 56 x 6 with metadata:\n#> * geo_type = state\n#> * time_type = day\n#> * as_of = 2022-05-31 12:08:25.791826\n#> \n#> # A tibble: 56 × 6\n#> geo_value time_value .pred .pred_distn forecast_date target_date\n#> * \n#> 1 ak 2021-12-31 0.355 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 2 al 2021-12-31 0.325 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 3 ar 2021-12-31 0.496 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 4 as 2021-12-31 0.0836 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 5 az 2021-12-31 0.614 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> 6 ca 2021-12-31 0.327 [0.01, 0.99] 2021-12-31 2022-01-07 \n#> # ℹ 50 more rows\n```\n:::\n:::\n\n\nThe above `get_test_data()` function examines the recipe and ensures that enough\ntest data is available to create the necessary lags and produce a prediction\nfor the desired future time point (after the end of the training data). This mimics\nwhat would happen if `jhu` contained the most recent available historical data and\nwe wanted to actually predict the future. We could have instead used any test data\nthat contained the necessary predictors.\n\n:::{.callout-note}\nIn the predictions above, you'll see a `time_value` column. That's because we \ncould use **any training data**. We happened to use training data corresponding\nto the most recent available, and it's lags. But we could have instead used\nlast week's or we could use the data that arrives next year, or we could use multiple\n`time_values` for multiple locations. This is completely allowed, though not\nnecessarily what you expect.\n\nIn production forecasting, you'd probably reestimate the model and produce new\npredictions whenever new data arrives. This is exactly what all the canned \nforecasters we provide do. So those strip out the `time_value` column.\n\nBut the next most likely procedure would be\nto feed your previously estimated model (without refitting) the new data.\nTo do this, you'd just call `get_test_data()` on that new data. And the \n`time_value` would still be the same as your `forecast_date`.\n\nGetting many forecasts (multiple `time_values`) for each location, is not\nexactly a typical desire in this context. But it's also not unheard of, so\nit is possible (and analogous to standard, non-time series forecasting). \n:::\n\n\n## Conclusion\n\nInternally, we provide some canned forecaster functions to create reasonable forecasts. \nBut ideally, a user could create their own forecasters by building up the \ncomponents we provide. In other chapters, we try to walk through some of these\ncustomizations. \n\nTo illustrate everything above, here is (roughly) the code for the \n`arx_forecaster()` to predict the death rate, 1 week ahead:\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-11_d45e3276831a1b40877b2251297fbb9d'}\n\n```{.r .cell-code}\nr <- epi_recipe(jhu) %>%\n step_epi_ahead(death_rate, ahead = 7) %>%\n step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%\n step_epi_naomit()\n\nlatest <- get_test_data(r, jhu)\n\nf <- frosting() %>%\n layer_predict() %>%\n layer_residual_quantiles() %>%\n layer_add_forecast_date() %>%\n layer_add_target_date() %>%\n layer_threshold(starts_with(\".pred\"))\n\neng <- linear_reg()\nwf <- epi_workflow(r, eng, f) %>% fit(jhu)\npreds <- predict(wf, latest)\n```\n:::\n\nThe code for `arx_forecaster()` simply generalizes this, passing along arguments as needed.\n\n\n::: {.cell layout-align=\"center\" hash='forecast-framework_cache/html/unnamed-chunk-12_b7f75d610d9c4f0ced30040e9aa3a481'}\n\n```{.r .cell-code}\npreds\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> An `epi_df` object, 56 x 6 with metadata:\n#> * geo_type = state\n#> * time_type = day\n#> * as_of = 2022-05-31 12:08:25.791826\n#> \n#> # A tibble: 56 × 6\n#> geo_value time_value .pred .pred_distn forecast_date target_date\n#> * \n#> 1 ak 2021-12-31 0.355 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 2 al 2021-12-31 0.325 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 3 ar 2021-12-31 0.496 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 4 as 2021-12-31 0.0836 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 5 az 2021-12-31 0.614 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> 6 ca 2021-12-31 0.327 [0.05, 0.95] 2021-12-31 2022-01-07 \n#> # ℹ 50 more rows\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/tidymodels-intro/execute-results/html.json b/_freeze/tidymodels-intro/execute-results/html.json index 779760d..76d203c 100644 --- a/_freeze/tidymodels-intro/execute-results/html.json +++ b/_freeze/tidymodels-intro/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "3cb64a626abfcc9bc16433ba4801ec08", + "hash": "2636eb3d9741e7f0c6d008d2fbf90142", "result": { - "markdown": "\n# Introduction to Tidymodels\n\n\n::: {.cell hash='tidymodels-intro_cache/html/unnamed-chunk-1_d402ade014dce906a00ea52c5e7637ed'}\n\n:::\n\n\nR contains a universe of packages that each have their own unique interfaces (functions and argument names) and return types. For instance, simple linear regression in R is traditionally performed using `lm()` from the stats package, but there's also the option to use `glm`, `glmnet` or other packages. Similarly for random forest - a user has the option to use `ranger`, `randomForest`, or `xgboost` amongst other options. Having such a bevy of options is great, but it also adds complications to the modelling process.\n\nIf only there was a unifying interface available to help simplify and streamline the modelling process. This is the purpose of `tidymodels`, which provides a unified interface for modeling that abides by the [tidy philosphy](https://tidyverse.tidyverse.org/articles/paper.html#:~:text=Its%20primary%20goal%20is%20to,easier%20to%20learn%20the%20next) and that fits nicely into the tidyverse. From pre-processing to model training to prediction to validation, `tidymodels` provides the necessary tools to perform many modelling tasks.\n\nIt is important to understand that the `tidymodels` packages do not aim to implement the algorithms themseves, rather they provide the interface to bring together many disparate approaches under the same roof. And as a result of this, model fitting tasks are easier to carry out. In the grand scheme of things, here's where `tidymodels` tends to fit into a data analysis project.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-2_9ebbde4ead853f4c75fd41c8da8aac8a'}\n::: {.cell-output-display}\n![](img/tidymodels_packages.png){fig-align='center' width=90%}\n:::\n:::\n\n\nNow, modelling can be broken down into several sub-tasks, and `tidymodels` recognizes this by providing different packages for different tasks. So `tidymodels` can be considered a metapackage - when you load `tidymodels`, several packages are in fact loaded including `rsample`, `recipes`, `parsniup` and `yardstick`. Each of these packages has their own role to play in the modelling process.\n\n- `rsample` is intended for sampling and subsetting tasks (such as splitting the data into test and train sets)\n- `recipes` allows the user to easily and neatly record the steps to take in data pre-processing\n- `parsnip` provides a common interface for model training to help standardize the interface for model fitting and output\n- `yardstick` gives access to model performance measures\n\nThe following diagram shows where each package comes into play in a general workflow for modelling using `tidymodels`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-3_3e419eaced9bcacb0667e00ae55cb6b5'}\n::: {.cell-output-display}\n![](img/tidymodels_model_substeps.png){fig-align='center' width=90%}\n:::\n:::\n\n\n## An example using the penguins dataset\n\nWe will now explore the `tidymodels` functions using the `penguins` dataset that we introduced and used in [Regression in Tidymodels](LINK%20TO%20VIGNETTE).\n\n### Load packages\n\nNote that `tidymodels` automatically loads some very useful `tidyverse` packages for us, including fan favourites like `dplyr` and `ggplot2`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-4_2264c225b8ea3011e2081f83c730b65e'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n:::\n\n\n### Simplify dataset\n\nTo keep the focus on learning how to use `tidymodels`, we will work with a simplified version of the dataset in which we will only use the complete cases/rows in the `penguins` dataset\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-5_67af100b153dab40f26460d41ac7fb15'}\n\n```{.r .cell-code}\npenguins <- penguins %>%\n filter(complete.cases(.))\n\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> \n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen 36.7 19.3 193\n#> 5 Adelie Torgersen 39.3 20.6 190\n#> 6 Adelie Torgersen 38.9 17.8 181\n#> # ℹ 2 more variables: body_mass_g , sex \n```\n:::\n:::\n\n\nand we will only use the `species`, `bill_length_mm`, `bill_depth_mm`, and `flipper_length_mm` variables.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-6_ef5ed1a8ec1402118f9763f6a2bf242e'}\n\n```{.r .cell-code}\npenguins <- penguins %>%\n select(c(species, bill_length_mm, bill_depth_mm, flipper_length_mm))\n```\n:::\n\n\n### Data sampling\n\nAfter fitting a model, make sure it is a good model. That is, don't forget to test how the model performs. For this reason, it is customary to split data into distinct training and test sets at the onset. The training data is used to fit the model and the test data is used to assess model performance.\n\nThe `initial_split()` function from the `rsample` package is what we will use to split our dataset into a training and test set. The function by default uses 3/4 of data for training and reserves the remaining 1/4 for testing. Use the `prop` argument to change the proportion used for training. Note that this function gives a `rsplit` object and not a data frame and the output of the object shows the number of rows used for testing, training and the grand total.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-7_9b76d434319551c35f834ca8a8a0e7b1'}\n\n```{.r .cell-code}\nset.seed(123) # For reproduciblity, as when we split the data below\npenguins_split <- initial_split(penguins, prop = 0.7)\npenguins_split\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> <233/100/333>\n```\n:::\n:::\n\n\nTo see what observations were used for training and testing, use the `training()` and `testing()` functions respectively.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-8_965d451cfc0d5363da62984b5411a2e3'}\n\n```{.r .cell-code}\npenguins_split %>%\n training() %>%\n glimpse()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> Rows: 233\n#> Columns: 4\n#> $ species Gentoo, Adelie, Gentoo, Chinstrap, Adelie, Chinst…\n#> $ bill_length_mm 59.6, 34.4, 45.2, 49.0, 41.4, 51.0, 44.9, 51.1, 5…\n#> $ bill_depth_mm 17.0, 18.4, 15.8, 19.5, 18.5, 18.8, 13.8, 16.5, 1…\n#> $ flipper_length_mm 230, 184, 215, 210, 202, 203, 212, 225, 210, 211,…\n```\n:::\n:::\n\n\nNow, we'll create a data frame for each of the training and test set:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-9_4dbd72c658687ca6c5c9dfc6962b1cbc'}\n\n```{.r .cell-code}\ntrain_data <- training(penguins_split)\ntest_data <- testing(penguins_split)\n```\n:::\n\n\n### Pre-processing\n\nThe main goal of this step is to use data transformations to make the data suitable for modeling. Most transformations that one required for standard data analysis tasks can be achieved by `dplyr`, or another `tidyverse` package.\n\n#### The pre-processing interface\n\nBefore training the model, a recipe can be used to carry out the pre-processing required by the model.\n\nThe `recipe()` has two main arguments: a formula (with the same format as when doing \\[LINK TO VIGNETTE\\]) and a data argument, which is usually the training set as that is the data used to create the model. Hence, we have `data = train_data` here.\n\nIn our example, suppose that our goal is to predict penguin species from bill length, bill depth and flipper length, then our recipe function would look as follows:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-10_77ac862fee16d19160e6cf03adee22dc'}\n\n```{.r .cell-code}\nrecipe(species ~ ., data = train_data)\n```\n:::\n\n\nThe point of recipe is to be a more general purpose formula. A number of packages are not formula-based. The ever-popular `glmnet()` function is one example because it takes in matrices for the x and y variables instead of a formula. So a recipe is useful because you can use a package like `glmnet` by following the same standard formula-based recipe format and simply specify later on in the modelling stage that the you would like to use `glmnet`.\n\nNow, after saying that you are making a recipe by way of the `recipe()` function, simply specify the transformations that you want to apply to your data in the necessary steps. Each data transformation is one step and all of the available pre-processing transformations all have the prefix of `step_`. Now, while there are many step functions available ([here's](https://recipes.tidymodels.org/reference/index.html) a list), we will only use the following three in our example.\n\n- `step_corr()` to remove variables which have large absolute correlations with others\n\n- `step_center()` to normalize numeric data to have a mean of zero\n\n- `step_scale()` to normalize numeric data to have a standard deviation of one\n\nOne of the advantages of having these pre-processing steps is that they help to simplify concepts that are difficult or a pain to enforce in coding. For example, centering could be a nuisance to implement from scratch because we would first have to calculate statistics (variable averages) from the training data and then use them on both the training and on the test data. Note that centering should not be done on the test data, rather on the training data to avoid data leakage (contamination of the test data by using statistics from the test data). In a recipe, the the estimation of the variable means using the training data and the application of these to center new data sets is done automatically, under the hood, and so spares the coder from having to manually implement it. The situation is similar for scaling numeric data (`step_scale()`).\n\nAnother useful feature of the `tidymodels` pre-processing interface is that each step can be applied to one specified variable, a group of variables, or all variables. The `all_predictors()` and `all_outcomes()` functions are particularly convenient to help minimize the amount of typing you need to do. For instance, if you wanted to apply `step_center()` to only the predictor variables, simply type `step_center(all_predictors())` instead of listing out each and every predictor in the step function.\n\nNow, let's try this all out on our example.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-11_007b318e8247cae5fee6fa765ca79333'}\n\n```{.r .cell-code}\npenguins_recipe <- recipe(species ~ ., data = train_data) %>%\n step_corr(all_predictors()) %>%\n step_center(all_predictors(), -all_outcomes()) %>%\n step_scale(all_predictors(), -all_outcomes())\n```\n:::\n\n\nTo summarize, we obtained a recipe object, `penguins_recipe`, by putting the `recipe()` and step functions together on our training data that we had ready to go from sampling.\n\nNow, to get the recipe details, simply call the recipe object. The operations section details what pre-processing steps we've applied to the data. Notice that the steps shown here are in the order that they were input into the recipe and they specify the variables used in each step.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-12_a73027254138e5220218d014c0571de8'}\n\n```{.r .cell-code}\npenguins_recipe\n```\n:::\n\n\n### Model Training\n\nRecall that in R, the same type of model could be fit using several different packages, and each such package typically has it's own style of interface. Two popular packages to fit random forest models are `ranger` and `randomForest`. One way that their interfaces differ is in the parameter name for the number of trees - `ranger()` has the parameter `num.trees`, whereas in `randomForest` has parameter `ntree`. Such differences do not make it simple to run the model in the other package.\n\n`Tidymodels` created an single interface that supports the usage of both models. Moreover, this general interface supports an even wider range of functions that use perform random forest. The key part that points to the function and package to be used is the engine.\n\nLet's see how this works in practice. In the below example, we'll use the general `rand_forest()` function from `tidymodels`. In there, we can specify the number of trees by using the `trees` argument. Then, in `set_engine()` we specify that we want to use ranger's version of random forest. Notice this follows the model specification format introduced in the \\[Regression in Tidymodels\\] chapter.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-13_2ce2f34d915184be525e50766273bfe5'}\n\n```{.r .cell-code}\npenguins_ranger <- rand_forest(trees = 100, mode = \"classification\") %>%\n set_engine(\"ranger\")\n```\n:::\n\n\nNow, if we wanted to use a different package's version of random forest, we could easily do that by simply swapping out the engine. To try this out, let's use `randomForest` instead of `ranger`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-14_75b18395aafd9d4afdc081db7a05fb63'}\n\n```{.r .cell-code}\npenguins_rf <- rand_forest(trees = 100, mode = \"classification\") %>%\n set_engine(\"randomForest\")\n```\n:::\n\n\nFor the remainder of this tutorial, we'll stick with using `ranger` for simplify. At this stage, we're ready to pre-process and model. The first task of those two is to apply our recipe before we train and test our model, in that we must\n\n1. Process the recipe using the training set.\n\n2. Use the recipe on the training set to get the finalized predictor set.\n\n3. Use the recipe on the predictor set to get the test set.\n\nA workflow can be used to pair model and processing tasks together. When different recipes are needed for different models, this is very useful so that you don't have to keep track of separate model and recipe objects in your workspace. Hence, training and testing different workflows becomes easier.\n\nFor our example, we'll try tidy model's workflows package to pair our model and our recipe together.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-15_d496ee9edf143e935990b725debaaec7'}\n\n```{.r .cell-code}\npenguins_wflow <- workflow() %>%\n add_model(penguins_ranger) %>%\n add_recipe(penguins_recipe)\n\npenguins_wflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow ═════════════════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: rand_forest()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 3 Recipe Steps\n#> \n#> • step_corr()\n#> • step_center()\n#> • step_scale()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> Random Forest Model Specification (classification)\n#> \n#> Main Arguments:\n#> trees = 100\n#> \n#> Computational engine: ranger\n```\n:::\n:::\n\n\nAfter that, we're ready to fit the model to our training data. The `fit()` function is what we will use to prepare the the recipe and train the model from the finalized predictors.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-16_7c12180c3e71f078841b3a8df9ca54dc'}\n\n```{.r .cell-code}\npenguins_fit <- penguins_wflow %>% fit(data = train_data)\n```\n:::\n\n\nThe resulting object contains both the recipe and fitted model. To extract the model, use the helper function of `extract_fit_parsnip()`, and to extract the recipe object, use `extract_recipe()`. We extract the model object below.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-17_adf67ad4f74f01b4a2e09821534fca75'}\n\n```{.r .cell-code}\nextract_fit_parsnip(penguins_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) \n#> \n#> Type: Probability estimation \n#> Number of trees: 100 \n#> Sample size: 233 \n#> Number of independent variables: 3 \n#> Mtry: 1 \n#> Target node size: 10 \n#> Variable importance mode: none \n#> Splitrule: gini \n#> OOB prediction error (Brier s.): 0.02954337\n```\n:::\n:::\n\n\nOne important thing to notice is that that if we wanted to use the `randomForest` model instead of the `ranger` model, all we'd need to do is replace the engine in the model specification; the rest of the code remains the same. We shall leave it to the reader to try this on their own and marvel at the beauty of having such a unifying interface.\n\n### Use a trained workflow to predict\n\nUp to this point we have\n\n1. Built the model (`penguins_ranger`)\n\n2. Created a pre-processing recipe (`penguins_recipe`),\n\n3. Paired the model and recipe (`penguins_wflow`), and\n\n4. Trained our workflow using `fit()`.\n\nSo the next step is to use the trained workflow, `penguins_fit`, to predict with the test data. This is easily done with a call to `predict()`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-18_48e696c9d90b3c6d7c8251337e6b4991'}\n\n```{.r .cell-code}\npredict(penguins_fit, test_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 100 × 1\n#> .pred_class\n#> \n#> 1 Adelie \n#> 2 Adelie \n#> 3 Adelie \n#> 4 Chinstrap \n#> 5 Adelie \n#> 6 Adelie \n#> # ℹ 94 more rows\n```\n:::\n:::\n\n\nIf you wanted to obtain a probability for each predicted value, then simply set the `type = prob` in `predict()`. This will yield a tibble with one column per outcome type and the corresponding predicted probability for each value to be each type of outcome. Then, to add the predicted values as a new column on the test data, use the `bind_cols()` function from `dplyr`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-19_af9e880b31504b55f0de2461d26ce93e'}\n\n```{.r .cell-code}\npenguins_predp <- penguins_fit %>%\n predict(test_data, type = \"prob\")\n```\n:::\n\n\nTo add the predicted values as a new column on the test data, you can use the `bind_cols()` function from `dplyr`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-20_c0637410ff018df353bfded24b041a1a'}\n\n```{.r .cell-code}\nbind_cols(test_data, penguins_predp) %>%\n head() # View first six rows of output\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species bill_length_mm bill_depth_mm flipper_length_mm .pred_Adelie\n#> \n#> 1 Adelie 39.5 17.4 186 0.910\n#> 2 Adelie 40.3 18 195 0.960\n#> 3 Adelie 38.7 19 195 0.964\n#> 4 Adelie 46 21.5 194 0.286\n#> 5 Adelie 35.9 19.2 189 0.997\n#> 6 Adelie 38.2 18.1 185 1 \n#> # ℹ 2 more variables: .pred_Chinstrap , .pred_Gentoo \n```\n:::\n:::\n\n\nAlternatively, we can use the `augment()` function to obtain the predicted probabilities and add them to the test data in a one-liner.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-21_4969b7fc4d20b43b37f5df734074efec'}\n\n```{.r .cell-code}\npenguins_aug <- augment(penguins_fit, test_data)\n\npenguins_aug\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 100 × 8\n#> species bill_length_mm bill_depth_mm flipper_length_mm .pred_class\n#> \n#> 1 Adelie 39.5 17.4 186 Adelie \n#> 2 Adelie 40.3 18 195 Adelie \n#> 3 Adelie 38.7 19 195 Adelie \n#> 4 Adelie 46 21.5 194 Chinstrap \n#> 5 Adelie 35.9 19.2 189 Adelie \n#> 6 Adelie 38.2 18.1 185 Adelie \n#> # ℹ 94 more rows\n#> # ℹ 3 more variables: .pred_Adelie , .pred_Chinstrap , …\n```\n:::\n:::\n\n\nWe can see from the first couple of rows shown that our model predicted the species correctly to be Adelie (in the `.pred_class` column) because the `.pred_Adelie` probabilities are by far the largest of the three predicted probabilities for each row. So while we can informally say that our model is doing well for predicting, how can we formally assess this? We would like to calculate a metric (well, probably more than one) to tell us how well our model predicts the species of penguins.\n\n### Model Validation\n\nThe `metrics()` function from the `yardstick` package is helps to assess model performance. As suggested by its name, it will output some metrics, and as an added bonus, these will be automatically selected for the type of model that you construct. The input for this function is a tibble that contains the actual values and the predicted values. This way we can compare how close the model estimates are to the truth. To serve this purpose, we can use `penguins_aug`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-22_b8252ed2ea48acc351d82910705fa3ba'}\n\n```{.r .cell-code}\npenguins_aug %>%\n metrics(truth = species, .pred_Adelie:.pred_Gentoo, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 4 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy multiclass 0.97 \n#> 2 kap multiclass 0.954\n#> 3 mn_log_loss multiclass 0.123\n#> 4 roc_auc hand_till 0.993\n```\n:::\n:::\n\n\nLet's briefly go through the metrics that were generated. Accuracy is simply the proportion of values that are predicted correctly, while kappa is similar to accuracy, but is normalized by the accuracy that would be expected by chance (you can think of it as a measure that compares observed accuracy to expected accuracy from random chance alone). For our example, both the accuracy and kappa value estimates are extremely high (near to the upper limit of 1) and similar in value, indicating that our model performs very well for prediction on the test data. Log loss is a measure of the performance of a classification model and a perfect model has a log loss of 0, so our model performs pretty well in that respect. Finally, `roc_auc` is the area under ROC curve and we'll explain this very shortly so stay tuned (for now, just note that a value close to 1, like we have, is the goal). All in all, our model fairs very well.\n\nSince it is often not enough to rely purely on one number summaries of model performance, we'll also look to graphical, curve-based metrics. We'll walk through producing the classic ROC curve, which is computed using `roc_curve()` and `roc_auc()` from `yardstick`.\n\nTo get ourselves an ROC curve, we need to input the actual values and the columns of predicted class probabilities into `roc_curve()`. We finish off by piping into the `autoplot()` function.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-23_4f33465326b36ba9f605f8144e95ba64'}\n\n```{.r .cell-code}\npenguins_aug %>%\n roc_curve(truth = species, .pred_Adelie:.pred_Gentoo) %>%\n autoplot()\n```\n\n::: {.cell-output-display}\n![](tidymodels-intro_files/figure-html/unnamed-chunk-23-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nNotice that the x-axis displays 1 - specificity, which is otherwise known as the false positive rate. So on this plot, we can visualize the trade-off between the false positive (1 - specificity) and the true positive (sensitivity) rates. Since the best classification would be where all positives are correctly classified as positive (sensitivity = 1), and no negatives are incorrect classified as positive (specificity = 0), curves closer to the top left corner (and, hence, an area under the curve of about 1) is what we're hoping for.\n\nSo, we can see that the curves for each of our species are looking pretty close to perfection (save for Adelie, which still does very well). To estimate the area under the curves, we can use `roc_auc` (or look to the summary of our metrics above for this very value).\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-24_2f383612e8ab94d7d60b43f6df1b72f3'}\n\n```{.r .cell-code}\npenguins_aug %>%\n roc_auc(truth = species, .pred_Adelie:.pred_Gentoo)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 1 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 roc_auc hand_till 0.993\n```\n:::\n:::\n\n\nAs expected, the estimated area is very close to 1, indicating near-perfect discrimination.\n\nThe `yardstick` package also offers other standard tools for model assessment like a confusion matrix, from which we can inspect the counts of correct classifications and miclassifications.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-25_976809de4dcd6ea5a1da1608bf941c16'}\n\n```{.r .cell-code}\npenguins_aug %>%\n conf_mat(truth = species, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> Truth\n#> Prediction Adelie Chinstrap Gentoo\n#> Adelie 40 0 0\n#> Chinstrap 3 24 0\n#> Gentoo 0 0 33\n```\n:::\n:::\n\n\nWe could even combine this with `autoplot()` to get a nice heatmap visualization.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-26_96de13131862791cf00497058a2f1a03'}\n\n```{.r .cell-code}\npenguins_aug %>%\n conf_mat(truth = species, estimate = .pred_class) %>%\n autoplot(\"heatmap\")\n```\n\n::: {.cell-output-display}\n![](tidymodels-intro_files/figure-html/unnamed-chunk-26-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nThe diagonal shows the counts of correct predictions for each species, while the off-diagonal shows the counts of model misclassifications. As the metrics have indicated, our model performed magnificently on the test set as there was only three misclassifications of Adelie penguins as Chinstrap.\n\n## Concluding remarks\n\nIn this vignette, we introduced `tidymodels` and illustrated how to its packages work together by way of example. Since this was an elementary example, so use this as a starting point and explore what more can be done with this wonderful set of packages. And yet, however wonderful they are, you may have already noticed that there are limitations like the glaring lack of a set of post-processing tools to refine the results. We fill this gap for epidemiological modelling with [frosting](https://cmu-delphi.github.io/epipredict/reference/add_frosting.html). This will be formally introduced in a later chapter, so stay tuned!\\\n\\\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n\n## Attribution\n\n\nThis Chapter was largely adapted from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) as well as [Tidymodels - Getting Started](https://www.tidymodels.org/start/recipes/) and [Tidymodels](https://wec.wur.nl/dse/24-tidymodels.html). The diagrams are from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) and based on [R for Data Science](https://r4ds.had.co.nz/explore-intro.html).\n\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n", + "markdown": "\n# Introduction to Tidymodels\n\n\n::: {.cell}\n\n:::\n\n\nR contains a universe of packages that each have their own unique interfaces (functions and argument names) and return types. For instance, simple linear regression in R is traditionally performed using `lm()` from the stats package, but there's also the option to use `glm`, `glmnet` or other packages. Similarly for random forest - a user has the option to use `ranger`, `randomForest`, or `xgboost` amongst other options. Having such a bevy of options is great, but it also adds complications to the modelling process.\n\nIf only there was a unifying interface available to help simplify and streamline the modelling process. This is the purpose of `tidymodels`, which provides a unified interface for modeling that abides by the [tidy philosphy](https://tidyverse.tidyverse.org/articles/paper.html#:~:text=Its%20primary%20goal%20is%20to,easier%20to%20learn%20the%20next) and that fits nicely into the tidyverse. From pre-processing to model training to prediction to validation, `tidymodels` provides the necessary tools to perform many modelling tasks.\n\nIt is important to understand that the `tidymodels` packages do not aim to implement the algorithms themselves, rather they provide the interface to bring together many disparate approaches under the same roof. And as a result of this, model fitting tasks are easier to carry out. In the grand scheme of things, here's where `tidymodels` tends to fit into a data analysis project.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-2_9ebbde4ead853f4c75fd41c8da8aac8a'}\n::: {.cell-output-display}\n![](img/tidymodels_packages.png){fig-align='center' width=90%}\n:::\n:::\n\n\nNow, modelling can be broken down into several sub-tasks, and `tidymodels` recognizes this by providing different packages for different tasks. So `tidymodels` can be considered a metapackage - when you load `tidymodels`, several packages are in fact loaded including `rsample`, `recipes`, `parsniup` and `yardstick`. Each of these packages has their own role to play in the modelling process.\n\n- `rsample` is intended for sampling and subsetting tasks (such as splitting the data into test and train sets)\n- `recipes` allows the user to easily and neatly record the steps to take in data pre-processing\n- `parsnip` provides a common interface for model training to help standardize the interface for model fitting and output\n- `yardstick` gives access to model performance measures\n\nThe following diagram shows where each package comes into play in a general workflow for modelling using `tidymodels`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-3_3e419eaced9bcacb0667e00ae55cb6b5'}\n::: {.cell-output-display}\n![](img/tidymodels_model_substeps.png){fig-align='center' width=90%}\n:::\n:::\n\n\n## An example using the penguins dataset\n\nWe will now explore the `tidymodels` functions using the `penguins` dataset that we introduced and used in [Regression in Tidymodels](tidymodels-regression.qmd#sec-slr-intro).\n\n### Load packages\n\nNote that `tidymodels` automatically loads some very useful `tidyverse` packages for us, including fan favourites like `dplyr` and `ggplot2`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-4_2264c225b8ea3011e2081f83c730b65e'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n:::\n\n\n### Simplify dataset\n\nTo keep the focus on learning how to use `tidymodels`, we will only wretain the complete cases/rows in the `penguins` dataset\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-5_67af100b153dab40f26460d41ac7fb15'}\n\n```{.r .cell-code}\npenguins <- penguins %>%\n filter(complete.cases(.))\n\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> \n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen 36.7 19.3 193\n#> 5 Adelie Torgersen 39.3 20.6 190\n#> 6 Adelie Torgersen 38.9 17.8 181\n#> # ℹ 2 more variables: body_mass_g , sex \n```\n:::\n:::\n\n\nand we will only use the `species`, `bill_length_mm`, `bill_depth_mm`, and `flipper_length_mm` variables.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-6_ef5ed1a8ec1402118f9763f6a2bf242e'}\n\n```{.r .cell-code}\npenguins <- penguins %>%\n select(c(species, bill_length_mm, bill_depth_mm, flipper_length_mm))\n```\n:::\n\n\n### Data sampling\n\nAfter fitting a model, make sure it is a good model. That is, don't forget to test how the model performs. For this reason, it is customary to split data into distinct training and test sets at the onset. The training data is used to fit the model and the test data is used to assess model performance.\n\nThe `initial_split()` function from the `rsample` package is what we will use to split our dataset into a training and test set. The function by default uses 3/4 of data for training and reserves the remaining 1/4 for testing. Use the `prop` argument to change the proportion used for training. Note that this function gives a `rsplit` object and not a data frame and the output of the object shows the number of rows used for testing, training and the grand total.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-7_9b76d434319551c35f834ca8a8a0e7b1'}\n\n```{.r .cell-code}\nset.seed(123) # For reproduciblity, as when we split the data below\npenguins_split <- initial_split(penguins, prop = 0.7)\npenguins_split\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> <233/100/333>\n```\n:::\n:::\n\n\nTo see what observations were used for training and testing, use the `training()` and `testing()` functions respectively.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-8_965d451cfc0d5363da62984b5411a2e3'}\n\n```{.r .cell-code}\npenguins_split %>%\n training() %>%\n glimpse()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> Rows: 233\n#> Columns: 4\n#> $ species Gentoo, Adelie, Gentoo, Chinstrap, Adelie, Chinst…\n#> $ bill_length_mm 59.6, 34.4, 45.2, 49.0, 41.4, 51.0, 44.9, 51.1, 5…\n#> $ bill_depth_mm 17.0, 18.4, 15.8, 19.5, 18.5, 18.8, 13.8, 16.5, 1…\n#> $ flipper_length_mm 230, 184, 215, 210, 202, 203, 212, 225, 210, 211,…\n```\n:::\n:::\n\n\nNow, we'll create a data frame for each of the training and test set:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-9_4dbd72c658687ca6c5c9dfc6962b1cbc'}\n\n```{.r .cell-code}\ntrain_data <- training(penguins_split)\ntest_data <- testing(penguins_split)\n```\n:::\n\n\n### Pre-processing\n\nThe main goal of this step is to use data transformations to make the data suitable for modeling. Most transformations that one required for standard data analysis tasks can be achieved by `dplyr`, or another `tidyverse` package.\n\n#### The pre-processing interface\n\nBefore training the model, a recipe can be used to carry out the pre-processing required by the model.\n\nThe `recipe()` has two main arguments: a formula (with the same format as when doing [regression in tidymodels](tidymodels-regression.qmd#sec-slr-intro)) and a data argument, which is usually the training set as that is the data used to create the model. Hence, we have `data = train_data` here.\n\nIn our example, suppose that our goal is to predict penguin species from bill length, bill depth and flipper length, then our recipe function would look as follows:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-10_77ac862fee16d19160e6cf03adee22dc'}\n\n```{.r .cell-code}\nrecipe(species ~ ., data = train_data)\n```\n:::\n\n\nThe point of recipe is to be a more general purpose formula. A number of packages are not formula-based. The ever-popular `glmnet()` function is one example because it takes in matrices for the x and y variables instead of a formula. So a recipe is useful because you can use a package like `glmnet` by following the same standard formula-based recipe format and simply specify later on in the modelling stage that the you would like to use `glmnet`.\n\nNow, after saying that you are making a recipe by way of the `recipe()` function, simply specify the transformations that you want to apply to your data in the necessary steps. Each data transformation is one step and all of the available pre-processing transformations all have the prefix of `step_`. Now, while there are many step functions available ([here's](https://recipes.tidymodels.org/reference/index.html) a list), we will only use the following three in our example.\n\n- `step_corr()` to remove variables which have large absolute correlations with others\n\n- `step_center()` to normalize numeric data to have a mean of zero\n\n- `step_scale()` to normalize numeric data to have a standard deviation of one\n\nOne of the advantages of having these pre-processing steps is that they help to simplify the process of applying training statistics to test data. For example, centering could be a nuisance to implement from scratch because we would first have to calculate statistics (variable averages) from the training data and then use them on both the training and test data. \n\n:::{.callout-note}\nNote that centering should not be done on the test data, but rather on the training data to avoid data leakage - contamination of the test data by using statistics from the test data. \n:::\n\nIn a recipe, estimation of the variable means using the training data and the application of these to center new data is done automatically, under the hood, and so spares the coder from having to manually implement it. The situation is similar for scaling numeric data.\n\nAnother feature of the `tidymodels` pre-processing interface is that each step can be applied to one specified variable, a group of variables, or all variables. The `all_predictors()` and `all_outcomes()` functions are particularly convenient to help minimize the amount of typing you need to do. For instance, if you want to apply `step_center()` to only the predictor variables, simply type `step_center(all_predictors())` instead of listing out each and every predictor in the step function.\n\nNow, let's try this out on our example.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-11_92139d5ab910bdcdc9610f778722e874'}\n\n```{.r .cell-code}\npenguins_recipe <- recipe(species ~ ., data = train_data) %>%\n step_corr(all_predictors()) %>%\n step_center(all_predictors()) %>%\n step_scale(all_predictors())\n```\n:::\n\n\nTo summarize, we obtained a recipe object, `penguins_recipe`, by putting the `recipe()` and step functions together on our training data that we had ready to go from sampling.\n\nNow, to get the recipe details, simply call the recipe object. The operations section details what pre-processing steps we've applied to the data. Notice that the steps shown here are in the order that they were input into the recipe and they each indicate the variables that were used.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-12_a73027254138e5220218d014c0571de8'}\n\n```{.r .cell-code}\npenguins_recipe\n```\n:::\n\n\n### Model Training\n\nRecall that in R, the same type of model could be fit using several different packages, and each such package typically has it's own style of interface. Two popular packages to fit random forest models are `ranger` and `randomForest`. One way that their interfaces differ is in the parameter name for the number of trees - `ranger()` has the parameter `num.trees`, whereas in `randomForest` has parameter `ntree`. Such differences do not make it simple to run the model in the other package.\n\n`Tidymodels` created an single interface that supports the usage of both models. Moreover, this general interface supports an even wider range of functions that use perform random forest. The key part that points to the function and package to be used is the engine.\n\nLet's see how this works in practice. In the below example, we'll use the general `rand_forest()` function from `tidymodels`. In there, we can specify the number of trees by using the `trees` argument. Then, in `set_engine()` we specify that we want to use ranger's version of random forest. Notice this follows the model specification format introduced in the [Regression in Tidymodels](tidymodels-regression.qmd#sec-slr-intro) chapter.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-13_2ce2f34d915184be525e50766273bfe5'}\n\n```{.r .cell-code}\npenguins_ranger <- rand_forest(trees = 100, mode = \"classification\") %>%\n set_engine(\"ranger\")\n```\n:::\n\n\nNow, if we wanted to use a different package's version of random forest, we could easily do that by simply swapping out the engine. To try this out, let's use `randomForest` instead of `ranger`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-14_75b18395aafd9d4afdc081db7a05fb63'}\n\n```{.r .cell-code}\npenguins_rf <- rand_forest(trees = 100, mode = \"classification\") %>%\n set_engine(\"randomForest\")\n```\n:::\n\n\nFor the remainder of this tutorial, we'll stick with using `ranger` for simplicity. And at this stage, we're ready to pre-process and model. A workflow can be used to pair model and processing tasks together. When different recipes are needed for different models, this is very useful so that you don't have to keep track of separate model and recipe objects in your workspace. So training and testing different workflows becomes easier.\n\nFor our example, we'll employ the `workflows` package to pair our model and our recipe together.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-15_d496ee9edf143e935990b725debaaec7'}\n\n```{.r .cell-code}\npenguins_wflow <- workflow() %>%\n add_model(penguins_ranger) %>%\n add_recipe(penguins_recipe)\n\npenguins_wflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow ═════════════════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: rand_forest()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 3 Recipe Steps\n#> \n#> • step_corr()\n#> • step_center()\n#> • step_scale()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> Random Forest Model Specification (classification)\n#> \n#> Main Arguments:\n#> trees = 100\n#> \n#> Computational engine: ranger\n```\n:::\n:::\n\n\nAfter that, we're ready to fit the model to our training data. The `fit()` function is what we will use to prepare the the recipe and train the model from the finalized predictors.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-16_7c12180c3e71f078841b3a8df9ca54dc'}\n\n```{.r .cell-code}\npenguins_fit <- penguins_wflow %>% fit(data = train_data)\n```\n:::\n\n\nThe resulting object contains both the recipe and fitted model. To extract the model, use the helper function of `extract_fit_parsnip()`, and to extract the recipe object, use `extract_recipe()`. We extract the model object below.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-17_adf67ad4f74f01b4a2e09821534fca75'}\n\n```{.r .cell-code}\nextract_fit_parsnip(penguins_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) \n#> \n#> Type: Probability estimation \n#> Number of trees: 100 \n#> Sample size: 233 \n#> Number of independent variables: 3 \n#> Mtry: 1 \n#> Target node size: 10 \n#> Variable importance mode: none \n#> Splitrule: gini \n#> OOB prediction error (Brier s.): 0.02943022\n```\n:::\n:::\n\n\nOne important thing to notice is that that if we wanted to use the `randomForest` model instead of the `ranger` model, all we'd need to do is replace the engine in the model specification; the rest of the code remains the same. We shall leave it to the reader to try this on their own and marvel at the beauty of having such a unifying interface.\n\n### Use a trained workflow to predict\n\nUp to this point we have\n\n1. Built the model (`penguins_ranger`)\n\n2. Created a pre-processing recipe (`penguins_recipe`),\n\n3. Paired the model and recipe (`penguins_wflow`), and\n\n4. Trained our workflow using `fit()`.\n\nSo the next step is to use the trained workflow, `penguins_fit`, to predict with the test data. This is easily done with a call to `predict()`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-18_48e696c9d90b3c6d7c8251337e6b4991'}\n\n```{.r .cell-code}\npredict(penguins_fit, test_data)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 100 × 1\n#> .pred_class\n#> \n#> 1 Adelie \n#> 2 Adelie \n#> 3 Adelie \n#> 4 Chinstrap \n#> 5 Adelie \n#> 6 Adelie \n#> # ℹ 94 more rows\n```\n:::\n:::\n\n\nIf you wanted to obtain a probability for each predicted value, then simply set the `type = prob` in `predict()`. This will yield a tibble with one column per outcome type and the corresponding predicted probability for each value to be each type of outcome. Then, to add the predicted values as a new column on the test data, use the `bind_cols()` function from `dplyr`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-19_af9e880b31504b55f0de2461d26ce93e'}\n\n```{.r .cell-code}\npenguins_predp <- penguins_fit %>%\n predict(test_data, type = \"prob\")\n```\n:::\n\n\nTo add the predicted values as a new column on the test data, you can use the `bind_cols()` function from `dplyr`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-20_c0637410ff018df353bfded24b041a1a'}\n\n```{.r .cell-code}\nbind_cols(test_data, penguins_predp) %>%\n head() # View first six rows of output\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species bill_length_mm bill_depth_mm flipper_length_mm .pred_Adelie\n#> \n#> 1 Adelie 39.5 17.4 186 0.910\n#> 2 Adelie 40.3 18 195 0.967\n#> 3 Adelie 38.7 19 195 0.964\n#> 4 Adelie 46 21.5 194 0.286\n#> 5 Adelie 35.9 19.2 189 0.997\n#> 6 Adelie 38.2 18.1 185 1 \n#> # ℹ 2 more variables: .pred_Chinstrap , .pred_Gentoo \n```\n:::\n:::\n\n\nAlternatively, we can use the `augment()` function to obtain the predicted probabilities and add them to the test data in a one-liner.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-21_4969b7fc4d20b43b37f5df734074efec'}\n\n```{.r .cell-code}\npenguins_aug <- augment(penguins_fit, test_data)\n\npenguins_aug\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 100 × 8\n#> species bill_length_mm bill_depth_mm flipper_length_mm .pred_class\n#> \n#> 1 Adelie 39.5 17.4 186 Adelie \n#> 2 Adelie 40.3 18 195 Adelie \n#> 3 Adelie 38.7 19 195 Adelie \n#> 4 Adelie 46 21.5 194 Chinstrap \n#> 5 Adelie 35.9 19.2 189 Adelie \n#> 6 Adelie 38.2 18.1 185 Adelie \n#> # ℹ 94 more rows\n#> # ℹ 3 more variables: .pred_Adelie , .pred_Chinstrap , …\n```\n:::\n:::\n\n\nWe can see from the first couple of rows shown that our model predicted the species correctly to be Adelie (in the `.pred_class` column) because the `.pred_Adelie` probabilities are by far the largest of the three predicted probabilities for each row. So while we can informally say that our model is doing well for predicting, how can we formally assess this? We would like to calculate a metric (well, probably more than one) to tell us how well our model predicts the species of penguins.\n\n### Model Validation\n\nThe `metrics()` function from the `yardstick` package is helps to assess model performance. As suggested by its name, it will output some metrics, and as an added bonus, these will be automatically selected for the type of model that you construct. The input for this function is a tibble that contains the actual values and the predicted values. This way we can compare how close the model estimates are to the truth. To serve this purpose, we can use `penguins_aug`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-22_b8252ed2ea48acc351d82910705fa3ba'}\n\n```{.r .cell-code}\npenguins_aug %>%\n metrics(truth = species, .pred_Adelie:.pred_Gentoo, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 4 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy multiclass 0.97 \n#> 2 kap multiclass 0.954\n#> 3 mn_log_loss multiclass 0.124\n#> 4 roc_auc hand_till 0.993\n```\n:::\n:::\n\n\nLet's briefly go through the metrics that were generated. Accuracy is simply the proportion of values that are predicted correctly, while kappa is similar to accuracy, but is normalized by the accuracy that would be expected by chance (you can think of it as a measure that compares observed accuracy to expected accuracy from random chance alone). For our example, both the accuracy and kappa value estimates are extremely high (near to the upper limit of 1) and similar in value, indicating that our model performs very well for prediction on the test data. Log loss is a measure of the performance of a classification model and a perfect model has a log loss of 0, so our model performs pretty well in that respect. Finally, `roc_auc` is the area under ROC curve and we'll explain this very shortly so stay tuned (for now, just note that a value close to 1, like we have, is the goal). All in all, our model fairs very well.\n\nSince it is often not enough to rely purely on one number summaries of model performance, we'll also look to graphical, curve-based metrics. We'll walk through producing the classic ROC curve, which is computed using `roc_curve()` and `roc_auc()` from `yardstick`.\n\nTo get ourselves an ROC curve, we need to input the actual values and the columns of predicted class probabilities into `roc_curve()`. We finish off by piping into the `autoplot()` function.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-23_4f33465326b36ba9f605f8144e95ba64'}\n\n```{.r .cell-code}\npenguins_aug %>%\n roc_curve(truth = species, .pred_Adelie:.pred_Gentoo) %>%\n autoplot()\n```\n\n::: {.cell-output-display}\n![](tidymodels-intro_files/figure-html/unnamed-chunk-23-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nNotice that the x-axis displays 1 - specificity, which is otherwise known as the false positive rate. So on this plot, we can visualize the trade-off between the false positive (1 - specificity) and the true positive (sensitivity) rates. Since the best classification would be where all positives are correctly classified as positive (sensitivity = 1), and no negatives are incorrect classified as positive (specificity = 0), curves closer to the top left corner (and, hence, an area under the curve of about 1) is what we're hoping for.\n\nSo, we can see that the curves for each of our species are looking pretty close to perfection (save for Adelie, which still does very well). To estimate the area under the curves, we can use `roc_auc` (or look to the summary of our metrics above for this very value).\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-24_2f383612e8ab94d7d60b43f6df1b72f3'}\n\n```{.r .cell-code}\npenguins_aug %>%\n roc_auc(truth = species, .pred_Adelie:.pred_Gentoo)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 1 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 roc_auc hand_till 0.993\n```\n:::\n:::\n\n\nAs expected, the estimated area is very close to 1, indicating near-perfect discrimination.\n\nThe `yardstick` package also offers other standard tools for model assessment like a confusion matrix, from which we can inspect the counts of correct classifications and miclassifications.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-25_976809de4dcd6ea5a1da1608bf941c16'}\n\n```{.r .cell-code}\npenguins_aug %>%\n conf_mat(truth = species, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> Truth\n#> Prediction Adelie Chinstrap Gentoo\n#> Adelie 40 0 0\n#> Chinstrap 3 24 0\n#> Gentoo 0 0 33\n```\n:::\n:::\n\n\nWe could even combine this with `autoplot()` to get a nice heatmap visualization.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-intro_cache/html/unnamed-chunk-26_96de13131862791cf00497058a2f1a03'}\n\n```{.r .cell-code}\npenguins_aug %>%\n conf_mat(truth = species, estimate = .pred_class) %>%\n autoplot(\"heatmap\")\n```\n\n::: {.cell-output-display}\n![](tidymodels-intro_files/figure-html/unnamed-chunk-26-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nThe diagonal shows the counts of correct predictions for each species, while the off-diagonal shows the counts of model misclassifications. As the metrics have indicated, our model performed magnificently on the test set as there was only three misclassifications of Adelie penguins as Chinstrap.\n\n\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n\n## Attribution\n\n\nThis Chapter was largely adapted from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) as well as [Tidymodels - Getting Started](https://www.tidymodels.org/start/recipes/) and [Tidymodels](https://wec.wur.nl/dse/24-tidymodels.html). The diagrams are from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) and based on [R for Data Science](https://r4ds.had.co.nz/explore-intro.html).\n\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/tidymodels-regression/execute-results/html.json b/_freeze/tidymodels-regression/execute-results/html.json index c2f8b4e..f11ca0c 100644 --- a/_freeze/tidymodels-regression/execute-results/html.json +++ b/_freeze/tidymodels-regression/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "11af347186b163fee8f2e7de58f8f919", + "hash": "01419910222b503b86a097b57dccee23", "result": { - "markdown": "# Regression in Tidymodels\n\n\n::: {.cell hash='tidymodels-regression_cache/html/unnamed-chunk-1_5f143964b13de5b28ac37a5c751a7e1f'}\n\n:::\n\n\nThis vignette is a gentle introduction into performing simple and multiple linear regression using `tidymodels`. Model fitting will be done using [parsnip](https://www.tidymodels.org/start/models/), which provides a unifying interface for model fitting and the resulting output. This means that parsnip provides a single interface with standardized argument names for each class of models so that you don't have to directly deal with the different interfaces for different functions that aim to do the same thing (like linear regression). See [here](https://www.tidymodels.org/find/parsnip/) for a list of models that `parsnip` currently supports.\n\n## Libraries\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-2_f1b70260de9274ceee0de0930c458b3d'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(broom)\nlibrary(performance)\n```\n:::\n\n\n## Simple linear regression\n\nThe key steps to perform linear regression in `tidymodels` are to first specify the model type and then to specify the model form and the data to be used to construct it.\n\nTo illustrate, we shall look to `penguins` dataset from the `tidymodels`' `modeldata` package. This dataset contains measurements for 344 penguins from three islands in Palmer Archipelago, Antarctica, and includes information on their species, island home, size (flipper length, body mass, bill dimensions), and sex.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-3_473e0ea4ec6d47826afc1c168bb38198'}\n::: {.cell-output-display}\n![](img/palmer_penguin_species.png){fig-align='center' width=75%}\n:::\n:::\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-4_aa7c3f7c67794e4868c4a9be11dcebfd'}\n\n```{.r .cell-code}\n# Let's inspect the data\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> \n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen NA NA NA\n#> 5 Adelie Torgersen 36.7 19.3 193\n#> 6 Adelie Torgersen 39.3 20.6 190\n#> # ℹ 2 more variables: body_mass_g , sex \n```\n:::\n:::\n\n\nOne thing you may have spotted is that there's missing data in this dataset in the fourth row. For simplicity, we will only work with the complete cases. This reduces the number of rows in our dataset to 333.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-5_6c5cb7769e50f22fea43f07d1fca5e94'}\n\n```{.r .cell-code}\npenguins <- penguins %>%\n filter(complete.cases(.))\n\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> \n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen 36.7 19.3 193\n#> 5 Adelie Torgersen 39.3 20.6 190\n#> 6 Adelie Torgersen 38.9 17.8 181\n#> # ℹ 2 more variables: body_mass_g , sex \n```\n:::\n:::\n\n\nMuch better! We will now build a simple linear regression model to model bill length as a function of bill depth.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-6_9752f79826c0bd2c0727f0d373aa20ff'}\n::: {.cell-output-display}\n![](img/bill_length_depth.png){fig-align='center' width=60%}\n:::\n:::\n\n\nIn `parsnip`, the model specification is broken down into small functions such as `set_mode()` and `set_engine()` to make the interface more flexible and readable. The general structure is to first specify a mode (regression or classification) and then an engine to indicate what software (or implementation of the algorithm) will be used to fit the model. For our purposes, the mode is `regression` and the engine is `lm` for ordinary least squares. You may note that setting the mode is unnecessary for linear regression, but we include it here as it is a good practice.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-7_d894b718918a6c425b9e0fff7ad8299b'}\n\n```{.r .cell-code}\nlm_spec <- linear_reg() %>%\n set_mode(\"regression\") %>%\n set_engine(\"lm\")\n```\n:::\n\n\nThe above specification does not actually carry out the regression, rather it just states what we would like to do.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-8_dbdab6875a55b8152227310053c8cadd'}\n\n```{.r .cell-code}\nlm_spec\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> Linear Regression Model Specification (regression)\n#> \n#> Computational engine: lm\n```\n:::\n:::\n\n\nOnce we have such a blueprint, we may fit a model by inputting data and a formula. Recall that in R, a formula takes the form `y ~ x` where `y` ix the response and `x` is the predictor variable. For our example, where the response of bill length and predictor of bill depth, we would write the formula as `bill_length_mm ~ bill_depth_mm`. \n\n::: {.callout-note}\nUnlike with standard R `formula()` objects, the names used this a formula must \nbe identical to the variable names in the dataset. No processing functions\nare allowed (processing is handled by the `recipe()`).\n:::\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-9_4435e7400b6dd465d61c248470b8ec32'}\n\n```{.r .cell-code}\nlm_fit <- lm_spec %>%\n fit(bill_length_mm ~ bill_depth_mm, data = penguins)\n\nlm_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm \n#> 54.8909 -0.6349\n```\n:::\n:::\n\n\nThe resulting `parsnip` object includes basic information about the fit such as the model coefficients. To access the underlying fit object, we could use the standard `lm_fit$fit` or with `purrr`'s `pluck()` function.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-10_c44cea2b4ccca541a75e2249a6991015'}\n\n```{.r .cell-code}\nlm_fit %>%\n pluck(\"fit\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm \n#> 54.8909 -0.6349\n```\n:::\n:::\n\n\nTo get additional information about the fit (such as standard errors, and goodness-of-fit statistics), we can get a summary of the model fit as follows:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-11_170fec411bf1e01a30f67a906726bc21'}\n\n```{.r .cell-code}\nlm_fit %>%\n pluck(\"fit\") %>%\n summary()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)\n#> \n#> Residuals:\n#> Min 1Q Median 3Q Max \n#> -12.9498 -3.9530 -0.3657 3.7327 15.5025 \n#> \n#> Coefficients:\n#> Estimate Std. Error t value Pr(>|t|) \n#> (Intercept) 54.8909 2.5673 21.380 < 2e-16 ***\n#> bill_depth_mm -0.6349 0.1486 -4.273 2.53e-05 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> Residual standard error: 5.332 on 331 degrees of freedom\n#> Multiple R-squared: 0.05227,\tAdjusted R-squared: 0.04941 \n#> F-statistic: 18.26 on 1 and 331 DF, p-value: 2.528e-05\n```\n:::\n:::\n\n\nTo get a tidy summary of the model parameter estimates, simply use the tidy function from the [broom](https://broom.tidymodels.org/) package on the model fit. To extract model statistics, `glance()` can be used.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-12_3c6cda9ae2eea12f6c255b8bf2cd5061'}\n\n```{.r .cell-code}\ntidy(lm_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 2 × 5\n#> term estimate std.error statistic p.value\n#> \n#> 1 (Intercept) 54.9 2.57 21.4 2.54e-64\n#> 2 bill_depth_mm -0.635 0.149 -4.27 2.53e- 5\n```\n:::\n\n```{.r .cell-code}\nglance(lm_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 1 × 12\n#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n#> \n#> 1 0.0523 0.0494 5.33 18.3 0.0000253 1 -1029. 2064. 2075.\n#> # ℹ 3 more variables: deviance , df.residual , nobs \n```\n:::\n:::\n\n\nNow, to make predictions, we simply use `predict()` on the parnsip model object. In there, we must specify the dataset we want to predict on in the `new_data` argument. Note that this may be a different dataset than we used for fitting the model, but this input data must include all predictor variables that were used to fit the model.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-13_0370f690b7e270998042a729bdbf587f'}\n\n```{.r .cell-code}\npredict(lm_fit, new_data = penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 1\n#> .pred\n#> \n#> 1 43.0\n#> 2 43.8\n#> 3 43.5\n#> 4 42.6\n#> 5 41.8\n#> 6 43.6\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nFor parnsip models, the predictions are always outputted in a tibble.\n\nTo specify the type of prediction made, modify `type` argument. If we set `type = \"conf_int\"`, we get a 95% confidence interval.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-14_ef093136394a01ed5d663b2ea837c983'}\n\n```{.r .cell-code}\npredict(lm_fit, new_data = penguins, type = \"conf_int\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 2\n#> .pred_lower .pred_upper\n#> \n#> 1 42.3 43.7\n#> 2 43.3 44.4\n#> 3 42.8 44.1\n#> 4 41.8 43.5\n#> 5 40.7 43.0\n#> 6 43.0 44.2\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nTo evaluate model predictive performance, it is logical to compare the each of the observed and predicted values. To see these values side-by-side we simply bind the two vectors of interest.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-15_e464c86a9e8050cfa6b8e3174a99d1ee'}\n\n```{.r .cell-code}\nbind_cols(\n predict(lm_fit, new_data = penguins),\n penguins\n) %>%\n select(bill_length_mm, .pred)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 2\n#> bill_length_mm .pred\n#> \n#> 1 39.1 43.0\n#> 2 39.5 43.8\n#> 3 40.3 43.5\n#> 4 36.7 42.6\n#> 5 39.3 41.8\n#> 6 38.9 43.6\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nA simpler way to do this is to use the nifty `augment()` function.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-16_008a078253bb995f610dd7929c7c4874'}\n\n```{.r .cell-code}\naugment(lm_fit, new_data = penguins) %>%\n select(bill_length_mm, .pred)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 2\n#> bill_length_mm .pred\n#> \n#> 1 39.1 43.0\n#> 2 39.5 43.8\n#> 3 40.3 43.5\n#> 4 36.7 42.6\n#> 5 39.3 41.8\n#> 6 38.9 43.6\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\n## Multiple linear regression\n\nThe only difference about fitting a multiple linear regression model in comparison to a simple linear regression model lies the formula. For multiple linear regression, the predictors are specified in the formula expression, separated by `+`. For example, if we have a response variable `y` and three predictors, `x1, x2,` and `x3`, we would write the formula as, `y ~ x1 + x2 + x3`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-17_77a1ab952443165abb55d9b9fdae419e'}\n\n```{.r .cell-code}\nlm_fit2 <- lm_spec %>% fit(\n formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm + body_mass_g,\n data = penguins\n)\nlm_fit2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm + \n#> body_mass_g, data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm flipper_length_mm body_mass_g \n#> -2.571e+01 6.131e-01 2.872e-01 3.472e-04\n```\n:::\n:::\n\n\nEverything else proceeds much the same as before. Such as obtaining parameter estimates\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-18_2f633c83ebc971b730b1364d8a51e402'}\n\n```{.r .cell-code}\ntidy(lm_fit2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 4 × 5\n#> term estimate std.error statistic p.value\n#> \n#> 1 (Intercept) -25.7 6.72 -3.83 1.55e- 4\n#> 2 bill_depth_mm 0.613 0.138 4.43 1.26e- 5\n#> 3 flipper_length_mm 0.287 0.0351 8.18 6.28e-15\n#> 4 body_mass_g 0.000347 0.000566 0.614 5.40e- 1\n```\n:::\n:::\n\n\nas well as predicting new values.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-19_bade4f9eb41832a1bfdb0efeae87f940'}\n\n```{.r .cell-code}\npredict(lm_fit2, new_data = penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 1\n#> .pred\n#> \n#> 1 39.0\n#> 2 39.7\n#> 3 42.5\n#> 4 42.8\n#> 5 42.8\n#> 6 38.4\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nIf you would like to use all variables aside from your response as predictors, a shortcut is to use the formula form `y ~ .`\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-20_4a1fb1bf15b09cf0cb3b73cb1f3a7c72'}\n\n```{.r .cell-code}\nlm_fit3 <- lm_spec %>% fit(bill_length_mm ~ ., data = penguins)\nlm_fit3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) speciesChinstrap speciesGentoo islandDream \n#> 15.343291 9.835502 6.117675 -0.503815 \n#> islandTorgersen bill_depth_mm flipper_length_mm body_mass_g \n#> -0.127431 0.300670 0.069257 0.001081 \n#> sexmale \n#> 2.047859\n```\n:::\n:::\n\n\n## Checking model assumptions\n\nAfter fitting a model, it is good to check whether the assumptions of linear regression are met. For this, we will use the `performance` package, in particular the `check_model()` function to produce several helpful plots we may use to check the assumptions for our first multiple linear regression model.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-21_9f0e50621ab07bb69b919f860e7a0ba9'}\n\n```{.r .cell-code}\nlm_fit2 %>%\n extract_fit_engine() %>%\n check_model()\n```\n\n::: {.cell-output-display}\n![](tidymodels-regression_files/figure-html/unnamed-chunk-21-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nNotice that on each plot it says what we should expect to see if the model assumption is met.\n\nWe shall now briefly walk you through what each plot means.\n\nThe first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line.\n\nFor our example, we observe that there's a pretty high correlation between `body_mass_g` and `flipper_length_mm` (not quite in the red-zone of 10 and above, but close enough for concern). That is indicative of multicollinearity between them. Intuitively, it makes sense for the body mass and flipper length variables - we'd expect that as once increases, so should the other.\n\nWe can take a closer look at the correlation by whipping up a correlation matrix by using base R's `cor()` function. Since for collinearity we're only usually interested in the numerical predictors, we'll only include the four numeric variables.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-22_49987b72aff741a32f5dfc50558d4948'}\n\n```{.r .cell-code}\npenguins_corr <- penguins %>%\n select(body_mass_g, ends_with(\"_mm\")) %>%\n cor()\npenguins_corr\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> body_mass_g bill_length_mm bill_depth_mm flipper_length_mm\n#> body_mass_g 1.0000000 0.5894511 -0.4720157 0.8729789\n#> bill_length_mm 0.5894511 1.0000000 -0.2286256 0.6530956\n#> bill_depth_mm -0.4720157 -0.2286256 1.0000000 -0.5777917\n#> flipper_length_mm 0.8729789 0.6530956 -0.5777917 1.0000000\n```\n:::\n:::\n\n\nIndeed `body_mass_g` and `flipper_length_mm` are highly positively correlated. To deal with this problem, we'll re-fit the model without `body_mass_g`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-23_306a0e9cc19412e08433bd04830dd697'}\n\n```{.r .cell-code}\nlm_fit3 <- lm_spec %>% fit(\n formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm,\n data = penguins\n)\nlm_fit3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm, \n#> data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm flipper_length_mm \n#> -27.9762 0.6200 0.3052\n```\n:::\n:::\n\n\nand then check again to see whether the assumptions are met.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-24_6508d43288ce1be6d4caff814eec9b9f'}\n\n```{.r .cell-code}\nlm_fit3 %>%\n extract_fit_engine() %>%\n check_model()\n```\n\n::: {.cell-output-display}\n![](tidymodels-regression_files/figure-html/unnamed-chunk-24-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nOverall, the plots look pretty good. For details on how to interpret each of these plots and more details about model assumptions please see [here](https://easystats.github.io/see/articles/performance.html) and [here](https://rdrr.io/cran/performance/man/check_model.html).\n\n## Interaction terms\n\nIn general, the syntax to add an interaction term to a formula is as follows:\n\n- `x:y` denotes an interaction term between `x` and `y`.\n- `x*y` denotes the interaction between `x` and `y` as well as `x` and `y`; that is, `x + y + x*y`.\n\nIt is important to note that this syntax is not compatible with all engines. Thus, we shall explain how to bypass this issue by adding an interaction term in a recipe later on. For now, let's start simple by adding an interaction term between `species` and `bill_length_mm`, which allows for a species-specific slope.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-25_be09e39914328ca51bbe6fb2207c82b1'}\n\n```{.r .cell-code}\nlm_fit4 <- lm_spec %>% fit(\n formula = bill_length_mm ~ species * bill_depth_mm,\n data = penguins\n)\nlm_fit4\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ species * bill_depth_mm, \n#> data = data)\n#> \n#> Coefficients:\n#> (Intercept) speciesChinstrap \n#> 23.3668 -9.9389 \n#> speciesGentoo bill_depth_mm \n#> -6.6966 0.8425 \n#> speciesChinstrap:bill_depth_mm speciesGentoo:bill_depth_mm \n#> 1.0796 1.2178\n```\n:::\n:::\n\n\nUsing recipes, the interaction term is specified by using `step_interact()`. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a `parsnip` model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-26_5d6b5266d0b373c844a38e0a080436d0'}\n\n```{.r .cell-code}\nrec_spec_interact <- recipe(\n formula = bill_length_mm ~ species + bill_depth_mm,\n data = penguins\n) %>%\n step_interact(~ species:bill_depth_mm)\n\nlm_wf_interact <- workflow() %>%\n add_model(lm_spec) %>%\n add_recipe(rec_spec_interact)\n\nlm_wf_interact %>% fit(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow [trained] ═══════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 1 Recipe Step\n#> \n#> • step_interact()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) speciesChinstrap \n#> 23.3668 -9.9389 \n#> speciesGentoo bill_depth_mm \n#> -6.6966 0.8425 \n#> speciesChinstrap_x_bill_depth_mm speciesGentoo_x_bill_depth_mm \n#> 1.0796 1.2178\n```\n:::\n:::\n\n\nNotice the variable name for the interaction term is not the same as it is in base R (which is simply of the form `x:y`). In `step_interact()`, the default separator between the variable names is `_x_`. You can change this default by specifying the `sep` argument in the function.\n\nTo read more about formula syntax, see [?formula](https://rdrr.io/r/stats/formula.html).\n\n## Non-linear transformations of the predictors\n\nSimilar to how we were able to add an interaction term using recipes, we can also perform a transformation as a pre-processing step. The function used for this is `step_mutate()` (which acts like `dplyr`'s `mutate`).\n\nNote that, in general, if you are specifying a recipe aim to keep as much of the pre-processing in your recipe specification as possible. This helps to ensure that the transformation will be applied to new data consistently.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-27_b7d9eb46155f4d760ade7a95ee6ba5d7'}\n\n```{.r .cell-code}\nrec_spec_pow2 <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%\n step_mutate(bill_depth_mm2 = bill_depth_mm^2)\n\nlm_wf_pow2 <- workflow() %>%\n add_model(lm_spec) %>%\n add_recipe(rec_spec_pow2)\n\nlm_wf_pow2 %>% fit(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow [trained] ═══════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 1 Recipe Step\n#> \n#> • step_mutate()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm bill_depth_mm2 \n#> 95.2558 -5.4431 0.1413\n```\n:::\n:::\n\n\nThere are many transformations already built into recipes such as `step_log()`. So, for basic transformations, there's often no need to make your own transformation from scratch. See [here](https://recipes.tidymodels.org/reference/#section-step-functions-individual-transformations) for a comprehensive list of the transformations that are offered in recipes.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-28_53a92b75fcd1ad219c04c0fcc94425be'}\n\n```{.r .cell-code}\nrec_spec_log <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%\n step_log(bill_depth_mm) # transforms the var in-place, keeps it's name\n\nlm_wf_log <- workflow() %>%\n add_model(lm_spec) %>%\n add_recipe(rec_spec_log)\n\nlm_wf_log %>% fit(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow [trained] ═══════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 1 Recipe Step\n#> \n#> • step_log()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm \n#> 74.95 -10.91\n```\n:::\n:::\n\n\n\\\n\\\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n\n## Attribution\n\nThis Chapter was largely adapted from [Chapter 3 of ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html). Checking linear regression assumptions using the performance package is based on [this article](https://easystats.github.io/performance/reference/check_model.html) and [this blog post](https://www.r-bloggers.com/2021/07/easystats-quickly-investigate-model-performance/) on investigating model performance. The artwork used is by [Allison Horst](https://twitter.com/allison_horst).[Allison Horst](https://twitter.com/allison_horst).\n\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n", + "markdown": "# Regression in Tidymodels\n\n\n::: {.cell}\n\n:::\n\n\nThis vignette is a gentle introduction into performing simple and multiple linear regression using `tidymodels`. Model fitting will be done using [parsnip](https://www.tidymodels.org/start/models/), which provides a unifying interface for model fitting and the resulting output. This means that parsnip provides a single interface with standardized argument names for each class of models so that you don't have to directly deal with the different interfaces for different functions that aim to do the same thing (like linear regression). See [here](https://www.tidymodels.org/find/parsnip/) for a list of models that `parsnip` currently supports.\n\n## Libraries\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-2_f1b70260de9274ceee0de0930c458b3d'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(broom)\nlibrary(performance)\n```\n:::\n\n\n## Simple linear regression {#sec-slr-intro}\n\nThe key steps to perform linear regression in `tidymodels` are to first specify the model type and then to specify the model form and the data to be used to construct it.\n\nTo illustrate, we shall look to `penguins` dataset from the `tidymodels`' `modeldata` package. This dataset contains measurements for 344 penguins from three islands in Palmer Archipelago, Antarctica, and includes information on their species, island home, size (flipper length, body mass, bill dimensions), and sex.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-3_473e0ea4ec6d47826afc1c168bb38198'}\n::: {.cell-output-display}\n![](img/palmer_penguin_species.png){fig-align='center' width=75%}\n:::\n:::\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-4_aa7c3f7c67794e4868c4a9be11dcebfd'}\n\n```{.r .cell-code}\n# Let's inspect the data\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> \n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen NA NA NA\n#> 5 Adelie Torgersen 36.7 19.3 193\n#> 6 Adelie Torgersen 39.3 20.6 190\n#> # ℹ 2 more variables: body_mass_g , sex \n```\n:::\n:::\n\n\nOne thing you may have spotted is that there's missing data in this dataset in the fourth row. For simplicity, we will only work with the complete cases. This reduces the number of rows in our dataset to 333.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-5_6c5cb7769e50f22fea43f07d1fca5e94'}\n\n```{.r .cell-code}\npenguins <- penguins %>%\n filter(complete.cases(.))\n\nhead(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 6 × 7\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> \n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen 36.7 19.3 193\n#> 5 Adelie Torgersen 39.3 20.6 190\n#> 6 Adelie Torgersen 38.9 17.8 181\n#> # ℹ 2 more variables: body_mass_g , sex \n```\n:::\n:::\n\n\nMuch better! We will now build a simple linear regression model to model bill length as a function of bill depth.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-6_9752f79826c0bd2c0727f0d373aa20ff'}\n::: {.cell-output-display}\n![](img/bill_length_depth.png){fig-align='center' width=60%}\n:::\n:::\n\n\nIn `parsnip`, the model specification is broken down into small functions such as `set_mode()` and `set_engine()` to make the interface more flexible and readable. The general structure is to first specify a mode (regression or classification) and then an engine to indicate what software (or implementation of the algorithm) will be used to fit the model. For our purposes, the mode is `regression` and the engine is `lm` for ordinary least squares. You may note that setting the mode is unnecessary for linear regression, but we include it here as it is a good practice.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-7_d894b718918a6c425b9e0fff7ad8299b'}\n\n```{.r .cell-code}\nlm_spec <- linear_reg() %>%\n set_mode(\"regression\") %>%\n set_engine(\"lm\")\n```\n:::\n\n\nThe above specification does not actually carry out the regression, rather it just states what we would like to do.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-8_dbdab6875a55b8152227310053c8cadd'}\n\n```{.r .cell-code}\nlm_spec\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> Linear Regression Model Specification (regression)\n#> \n#> Computational engine: lm\n```\n:::\n:::\n\n\nOnce we have such a blueprint, we may fit a model by inputting data and a formula. Recall that in R, a formula takes the form `y ~ x` where `y` ix the response and `x` is the predictor variable. For our example, where the response of bill length and predictor of bill depth, we would write the formula as `bill_length_mm ~ bill_depth_mm`. \n\n::: {.callout-note}\nUnlike with standard R `formula()` objects, the names used this a formula must \nbe identical to the variable names in the dataset. No processing functions\nare allowed (processing is handled by the `recipe()`).\n:::\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-9_4435e7400b6dd465d61c248470b8ec32'}\n\n```{.r .cell-code}\nlm_fit <- lm_spec %>%\n fit(bill_length_mm ~ bill_depth_mm, data = penguins)\n\nlm_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm \n#> 54.8909 -0.6349\n```\n:::\n:::\n\n\nThe resulting `parsnip` object includes basic information about the fit such as the model coefficients. To access the underlying fit object, we could use the standard `lm_fit$fit` or with `purrr`'s `pluck()` function.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-10_c44cea2b4ccca541a75e2249a6991015'}\n\n```{.r .cell-code}\nlm_fit %>%\n pluck(\"fit\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm \n#> 54.8909 -0.6349\n```\n:::\n:::\n\n\nTo get additional information about the fit (such as standard errors, and goodness-of-fit statistics), we can get a summary of the model fit as follows:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-11_170fec411bf1e01a30f67a906726bc21'}\n\n```{.r .cell-code}\nlm_fit %>%\n pluck(\"fit\") %>%\n summary()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)\n#> \n#> Residuals:\n#> Min 1Q Median 3Q Max \n#> -12.9498 -3.9530 -0.3657 3.7327 15.5025 \n#> \n#> Coefficients:\n#> Estimate Std. Error t value Pr(>|t|) \n#> (Intercept) 54.8909 2.5673 21.380 < 2e-16 ***\n#> bill_depth_mm -0.6349 0.1486 -4.273 2.53e-05 ***\n#> ---\n#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n#> \n#> Residual standard error: 5.332 on 331 degrees of freedom\n#> Multiple R-squared: 0.05227,\tAdjusted R-squared: 0.04941 \n#> F-statistic: 18.26 on 1 and 331 DF, p-value: 2.528e-05\n```\n:::\n:::\n\n\nTo get a tidy summary of the model parameter estimates, simply use the tidy function from the [broom](https://broom.tidymodels.org/) package on the model fit. To extract model statistics, `glance()` can be used.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-12_3c6cda9ae2eea12f6c255b8bf2cd5061'}\n\n```{.r .cell-code}\ntidy(lm_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 2 × 5\n#> term estimate std.error statistic p.value\n#> \n#> 1 (Intercept) 54.9 2.57 21.4 2.54e-64\n#> 2 bill_depth_mm -0.635 0.149 -4.27 2.53e- 5\n```\n:::\n\n```{.r .cell-code}\nglance(lm_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 1 × 12\n#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n#> \n#> 1 0.0523 0.0494 5.33 18.3 0.0000253 1 -1029. 2064. 2075.\n#> # ℹ 3 more variables: deviance , df.residual , nobs \n```\n:::\n:::\n\n\nNow, to make predictions, we simply use `predict()` on the parnsip model object. In there, we must specify the dataset we want to predict on in the `new_data` argument. Note that this may be a different dataset than we used for fitting the model, but this input data must include all predictor variables that were used to fit the model.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-13_0370f690b7e270998042a729bdbf587f'}\n\n```{.r .cell-code}\npredict(lm_fit, new_data = penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 1\n#> .pred\n#> \n#> 1 43.0\n#> 2 43.8\n#> 3 43.5\n#> 4 42.6\n#> 5 41.8\n#> 6 43.6\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nFor parnsip models, the predictions are always outputted in a tibble.\n\nTo specify the type of prediction made, modify `type` argument. If we set `type = \"conf_int\"`, we get a 95% confidence interval.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-14_ef093136394a01ed5d663b2ea837c983'}\n\n```{.r .cell-code}\npredict(lm_fit, new_data = penguins, type = \"conf_int\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 2\n#> .pred_lower .pred_upper\n#> \n#> 1 42.3 43.7\n#> 2 43.3 44.4\n#> 3 42.8 44.1\n#> 4 41.8 43.5\n#> 5 40.7 43.0\n#> 6 43.0 44.2\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nTo evaluate model predictive performance, it is logical to compare the each of the observed and predicted values. To see these values side-by-side we simply bind the two vectors of interest.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-15_e464c86a9e8050cfa6b8e3174a99d1ee'}\n\n```{.r .cell-code}\nbind_cols(\n predict(lm_fit, new_data = penguins),\n penguins\n) %>%\n select(bill_length_mm, .pred)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 2\n#> bill_length_mm .pred\n#> \n#> 1 39.1 43.0\n#> 2 39.5 43.8\n#> 3 40.3 43.5\n#> 4 36.7 42.6\n#> 5 39.3 41.8\n#> 6 38.9 43.6\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nA simpler way to do this is to use the nifty `augment()` function.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-16_008a078253bb995f610dd7929c7c4874'}\n\n```{.r .cell-code}\naugment(lm_fit, new_data = penguins) %>%\n select(bill_length_mm, .pred)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 2\n#> bill_length_mm .pred\n#> \n#> 1 39.1 43.0\n#> 2 39.5 43.8\n#> 3 40.3 43.5\n#> 4 36.7 42.6\n#> 5 39.3 41.8\n#> 6 38.9 43.6\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\n## Multiple linear regression\n\nThe only difference about fitting a multiple linear regression model in comparison to a simple linear regression model lies the formula. For multiple linear regression, the predictors are specified in the formula expression, separated by `+`. For example, if we have a response variable `y` and three predictors, `x1, x2,` and `x3`, we would write the formula as, `y ~ x1 + x2 + x3`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-17_77a1ab952443165abb55d9b9fdae419e'}\n\n```{.r .cell-code}\nlm_fit2 <- lm_spec %>% fit(\n formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm + body_mass_g,\n data = penguins\n)\nlm_fit2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm + \n#> body_mass_g, data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm flipper_length_mm body_mass_g \n#> -2.571e+01 6.131e-01 2.872e-01 3.472e-04\n```\n:::\n:::\n\n\nEverything else proceeds much the same as before. Such as obtaining parameter estimates\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-18_2f633c83ebc971b730b1364d8a51e402'}\n\n```{.r .cell-code}\ntidy(lm_fit2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 4 × 5\n#> term estimate std.error statistic p.value\n#> \n#> 1 (Intercept) -25.7 6.72 -3.83 1.55e- 4\n#> 2 bill_depth_mm 0.613 0.138 4.43 1.26e- 5\n#> 3 flipper_length_mm 0.287 0.0351 8.18 6.28e-15\n#> 4 body_mass_g 0.000347 0.000566 0.614 5.40e- 1\n```\n:::\n:::\n\n\nas well as predicting new values.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-19_bade4f9eb41832a1bfdb0efeae87f940'}\n\n```{.r .cell-code}\npredict(lm_fit2, new_data = penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> # A tibble: 333 × 1\n#> .pred\n#> \n#> 1 39.0\n#> 2 39.7\n#> 3 42.5\n#> 4 42.8\n#> 5 42.8\n#> 6 38.4\n#> # ℹ 327 more rows\n```\n:::\n:::\n\n\nIf you would like to use all variables aside from your response as predictors, a shortcut is to use the formula form `y ~ .`\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-20_4a1fb1bf15b09cf0cb3b73cb1f3a7c72'}\n\n```{.r .cell-code}\nlm_fit3 <- lm_spec %>% fit(bill_length_mm ~ ., data = penguins)\nlm_fit3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) speciesChinstrap speciesGentoo islandDream \n#> 15.343291 9.835502 6.117675 -0.503815 \n#> islandTorgersen bill_depth_mm flipper_length_mm body_mass_g \n#> -0.127431 0.300670 0.069257 0.001081 \n#> sexmale \n#> 2.047859\n```\n:::\n:::\n\n\n## Checking model assumptions\n\nAfter fitting a model, it is good to check whether the assumptions of linear regression are met. For this, we will use the `performance` package, in particular the `check_model()` function to produce several helpful plots we may use to check the assumptions for our first multiple linear regression model.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-21_9f0e50621ab07bb69b919f860e7a0ba9'}\n\n```{.r .cell-code}\nlm_fit2 %>%\n extract_fit_engine() %>%\n check_model()\n```\n\n::: {.cell-output-display}\n![](tidymodels-regression_files/figure-html/unnamed-chunk-21-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nNotice that on each plot it says what we should expect to see if the model assumption is met.\n\nWe shall now briefly walk you through what each plot means.\n\nThe first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in the standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line.\n\nFor our example, we observe that there's a pretty high correlation between `body_mass_g` and `flipper_length_mm` (not quite in the red-zone of 10 and above, but close enough for concern). That is indicative of multicollinearity between them. Intuitively, it makes sense for the body mass and flipper length variables - we'd expect that as once increases, so should the other.\n\nWe can take a closer look at the correlation by whipping up a correlation matrix by using base R's `cor()` function. Since for collinearity we're only usually interested in the numerical predictors, we'll only include the four numeric variables.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-22_49987b72aff741a32f5dfc50558d4948'}\n\n```{.r .cell-code}\npenguins_corr <- penguins %>%\n select(body_mass_g, ends_with(\"_mm\")) %>%\n cor()\npenguins_corr\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> body_mass_g bill_length_mm bill_depth_mm flipper_length_mm\n#> body_mass_g 1.0000000 0.5894511 -0.4720157 0.8729789\n#> bill_length_mm 0.5894511 1.0000000 -0.2286256 0.6530956\n#> bill_depth_mm -0.4720157 -0.2286256 1.0000000 -0.5777917\n#> flipper_length_mm 0.8729789 0.6530956 -0.5777917 1.0000000\n```\n:::\n:::\n\n\nIndeed `body_mass_g` and `flipper_length_mm` are highly positively correlated. To deal with this problem, we'll re-fit the model without `body_mass_g`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-23_306a0e9cc19412e08433bd04830dd697'}\n\n```{.r .cell-code}\nlm_fit3 <- lm_spec %>% fit(\n formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm,\n data = penguins\n)\nlm_fit3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm, \n#> data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm flipper_length_mm \n#> -27.9762 0.6200 0.3052\n```\n:::\n:::\n\n\nand then check again to see whether the assumptions are met.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-24_6508d43288ce1be6d4caff814eec9b9f'}\n\n```{.r .cell-code}\nlm_fit3 %>%\n extract_fit_engine() %>%\n check_model()\n```\n\n::: {.cell-output-display}\n![](tidymodels-regression_files/figure-html/unnamed-chunk-24-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\nOverall, the plots look pretty good. For details on how to interpret each of these plots and more details about model assumptions please see [here](https://easystats.github.io/see/articles/performance.html) and [here](https://rdrr.io/cran/performance/man/check_model.html).\n\n## Interaction terms\n\nIn general, the syntax to add an interaction term to a formula is as follows:\n\n- `x:y` denotes an interaction term between `x` and `y`.\n- `x*y` denotes the interaction between `x` and `y` as well as `x` and `y`; that is, `x + y + x*y`.\n\nLet's start simple by adding an interaction term between `species` and `bill_length_mm`, which allows for a species-specific slope.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-25_be09e39914328ca51bbe6fb2207c82b1'}\n\n```{.r .cell-code}\nlm_fit4 <- lm_spec %>% fit(\n formula = bill_length_mm ~ species * bill_depth_mm,\n data = penguins\n)\nlm_fit4\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ species * bill_depth_mm, \n#> data = data)\n#> \n#> Coefficients:\n#> (Intercept) speciesChinstrap \n#> 23.3668 -9.9389 \n#> speciesGentoo bill_depth_mm \n#> -6.6966 0.8425 \n#> speciesChinstrap:bill_depth_mm speciesGentoo:bill_depth_mm \n#> 1.0796 1.2178\n```\n:::\n:::\n\n\nIt is important to note that this formula-based syntax is not compatible with all engines. Thus, we'll now go over how to bypass this problem by adding an interaction term in a recipe.\n\nUsing `recipes`, the interaction term is specified by using `step_interact()`. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a `parsnip` model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-26_5d6b5266d0b373c844a38e0a080436d0'}\n\n```{.r .cell-code}\nrec_spec_interact <- recipe(\n formula = bill_length_mm ~ species + bill_depth_mm,\n data = penguins\n) %>%\n step_interact(~ species:bill_depth_mm)\n\nlm_wf_interact <- workflow() %>%\n add_model(lm_spec) %>%\n add_recipe(rec_spec_interact)\n\nlm_wf_interact %>% fit(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow [trained] ═══════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 1 Recipe Step\n#> \n#> • step_interact()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) speciesChinstrap \n#> 23.3668 -9.9389 \n#> speciesGentoo bill_depth_mm \n#> -6.6966 0.8425 \n#> speciesChinstrap_x_bill_depth_mm speciesGentoo_x_bill_depth_mm \n#> 1.0796 1.2178\n```\n:::\n:::\n\n\nNotice the variable name for the interaction term is not the same as it is in base R (which is simply of the form `x:y`). In `step_interact()`, the default separator between the variable names is `_x_`. You can change this default by specifying the `sep` argument in the function.\n\nTo read more about formula syntax, see [?formula](https://rdrr.io/r/stats/formula.html).\n\n## Non-linear transformations of the predictors\n\nWhile in base R, you can add a non-linear transformation to a formula like so, \n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-27_d6f3ef00aa25dfe37695c80be53724c9'}\n\n```{.r .cell-code}\nlm_fit5 <- lm_spec %>%\n # let's square bill_depth_mm\n fit(bill_length_mm ~ bill_depth_mm + I(bill_depth_mm^2), data = penguins)\n\nlm_fit5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> parsnip model object\n#> \n#> \n#> Call:\n#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + I(bill_depth_mm^2), \n#> data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm I(bill_depth_mm^2) \n#> 95.2558 -5.4431 0.1413\n```\n:::\n:::\n\n\nyou cannot do this in a formula in a recipe. \n\nFortunately, we can deal with this in a similar way as we did to add an interaction term to a recipe. We can add non-linear transformation as a pre-processing step. The function used for this is `step_mutate()` (which acts like `dplyr`'s `mutate`). Thus, we can perform the above square transformation in a recipe as follows:\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-28_c5cb928b34a8c3cc7f26b4a39e90f142'}\n\n```{.r .cell-code}\nrec_spec_pow2 <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%\n step_mutate(bill_depth_mm2 = bill_depth_mm^2)\n\nlm_wf_pow2 <- workflow() %>%\n add_model(lm_spec) %>%\n add_recipe(rec_spec_pow2)\n\nlm_wf_pow2 %>% fit(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow [trained] ═══════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 1 Recipe Step\n#> \n#> • step_mutate()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm bill_depth_mm2 \n#> 95.2558 -5.4431 0.1413\n```\n:::\n:::\n\n\nThere are many transformations already built into `recipes` such as `step_log()`. So, for basic transformations, there's often no need to make your own transformation from scratch. See [here](https://recipes.tidymodels.org/reference/#section-step-functions-individual-transformations) for a comprehensive list of the transformations that are offered in `recipes`.\n\n\n::: {.cell layout-align=\"center\" hash='tidymodels-regression_cache/html/unnamed-chunk-29_4d2689e4b3d012702e59b4977aebe950'}\n\n```{.r .cell-code}\nrec_spec_log <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%\n step_log(bill_depth_mm) # transforms the var in-place, keeps it's name\n\nlm_wf_log <- workflow() %>%\n add_model(lm_spec) %>%\n add_recipe(rec_spec_log)\n\nlm_wf_log %>% fit(penguins)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n#> ══ Workflow [trained] ═══════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: linear_reg()\n#> \n#> ── Preprocessor ─────────────────────────────────────────────────────────────\n#> 1 Recipe Step\n#> \n#> • step_log()\n#> \n#> ── Model ────────────────────────────────────────────────────────────────────\n#> \n#> Call:\n#> stats::lm(formula = ..y ~ ., data = data)\n#> \n#> Coefficients:\n#> (Intercept) bill_depth_mm \n#> 74.95 -10.91\n```\n:::\n:::\n\n\n\\\n\\\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n\n## Attribution\n\nThis Chapter was largely adapted from [Chapter 3 of ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html). Checking linear regression assumptions using the performance package is based on [this article](https://easystats.github.io/performance/reference/check_model.html) and [this blog post](https://www.r-bloggers.com/2021/07/easystats-quickly-investigate-model-performance/) on investigating model performance. The artwork used is by [Allison Horst](https://twitter.com/allison_horst).\n\n🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/tidymodels-regression/figure-html/unnamed-chunk-21-1.svg b/_freeze/tidymodels-regression/figure-html/unnamed-chunk-21-1.svg index 903b3c5..d9ac0d5 100644 --- a/_freeze/tidymodels-regression/figure-html/unnamed-chunk-21-1.svg +++ b/_freeze/tidymodels-regression/figure-html/unnamed-chunk-21-1.svg @@ -468,22 +468,22 @@ - + - + - + - + - + - + @@ -501,324 +501,327 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + + + + @@ -831,110 +834,113 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -943,45 +949,49 @@ - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - + + - - + + - - + + - - + + + + + + @@ -1134,52 +1144,52 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -1517,7 +1527,7 @@ - + @@ -1632,61 +1642,61 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -2167,76 +2177,76 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -2572,10 +2582,10 @@ - + - + @@ -2785,55 +2795,55 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -3037,76 +3047,76 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + diff --git a/_freeze/tidymodels-regression/figure-html/unnamed-chunk-24-1.svg b/_freeze/tidymodels-regression/figure-html/unnamed-chunk-24-1.svg index 1e852ef..bb0bd1b 100644 --- a/_freeze/tidymodels-regression/figure-html/unnamed-chunk-24-1.svg +++ b/_freeze/tidymodels-regression/figure-html/unnamed-chunk-24-1.svg @@ -462,55 +462,55 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + @@ -813,110 +813,110 @@ - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -925,45 +925,45 @@ - - - - + + + + - - - - + + + + - - - - + + + + - - - - + + + + - - + + - - + + - - + + - - + + diff --git a/epipredict.qmd b/epipredict.qmd index 6c6b9b5..5f0c9ed 100644 --- a/epipredict.qmd +++ b/epipredict.qmd @@ -13,7 +13,7 @@ Serving both populations is the main motivation for our efforts, but at the same ## Baseline models We provide a set of basic, easy-to-use forecasters that work out of the box. -You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below). +You should be able to do a limited amount of customization on them. Any serious customization happens with the framework discussed below). For the basic forecasters, we provide: @@ -214,7 +214,7 @@ out_gb <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"), Or quantile regression, using our custom forecasting engine `quantile_reg()`: ```{r quantreg, warning = FALSE} -out_gb <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"), +out_qr <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"), quantile_reg()) ``` diff --git a/flatline-forecaster.qmd b/flatline-forecaster.qmd index 4b76353..10c7ed8 100644 --- a/flatline-forecaster.qmd +++ b/flatline-forecaster.qmd @@ -1,6 +1,6 @@ # Introducing the flatline forecaster -The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The predictive intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org). +The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The prediction intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org). ## Example of using the flatline forecaster @@ -27,19 +27,26 @@ jhu ### The basic mechanics of the flatline forecaster -The simplest way to create and train a flatline forecaster to predict the d -eath rate one week into the future, is to input the `epi_df` and the name of -the column from it that we want to predict in the `flatline_forecaster` function. +Suppose that our goal is to predict death rates one week ahead of the last date available for each state. Mathematically, on day $t$, we want to predict new deaths $y$ that are $h$ days ahead at many locations $j$. So for each location, we'll predict +$$ +\hat{y}_{j, {t+h}} = y_{j, t} +$$ +where $t$ is 2021-12-31, $h$ is 7 days, and $j$ is the state in our example. + +Now, the simplest way to create and train a flatline forecaster to predict the death rate one week into the future is to input the `epi_df` and the name of the column from it that we want to predict in the `flatline_forecaster` function. ```{r} -one_week_ahead <- flatline_forecaster(jhu, outcome = "death_rate") +one_week_ahead <- flatline_forecaster(jhu, + outcome = "death_rate") one_week_ahead ``` -The result is both a fitted model object which could be used any time in the -future to create different forecasts, as well as a set of predicted values and -prediction intervals for each location 7 days after the last available time -value in the data, which is Dec 31, 2021. Note that 7 days is the default +The result is a S3 object, which contains three objects - metadata, a tibble of predictions, +and a S3 object of class `epi_workflow`. There are a few important things to note here. +First, the fitted model object contained in the `epi_workflow` +object could be used any time in the future to create different forecasts. +Next, the tibble of predicted values and prediction intervals is for each location 7 days +after the last available time value in the data, which is Dec 31, 2021. Note that 7 days is the default number of time steps ahead of the forecast date in which forecasts should be produced. To change this, you must change the value of the `ahead` parameter in the list of additional arguments `flatline_args_list()`. Let's change this @@ -55,15 +62,16 @@ five_days_ahead <- flatline_forecaster( five_days_ahead ``` -We could also specify that we want a 80% predictive interval by changing the -levels. The default 0.05 and 0.95 levels/quantiles give us 90% predictive +We could also specify that we want a 80% prediction interval by changing the +levels. The default 0.05 and 0.95 levels/quantiles give us 90% prediction interval. ```{r} five_days_ahead <- flatline_forecaster( jhu, outcome = "death_rate", - flatline_args_list(ahead = 5L, levels = c(0.1, 0.9)) + flatline_args_list(ahead = 5L, + levels = c(0.1, 0.9)) ) five_days_ahead @@ -77,8 +85,10 @@ five_days_ahead$epi_workflow The fitted model here was based on minimal pre-processing of the data, estimating a flatline model, and then post-processing the results to be -meaningful for epidemiological tasks. To look deeper into the pre-processing, -model and processing parts individually, you may use the `$` operator after `epi_workflow`. For example, let's examine the pre-processing part in more detail. +meaningful for epidemiological tasks. To look deeper into the pre-processing +or post-processing parts individually, you can use the `extract_preprocessor()` +or `extract_frosting()` functions on the `epi_workflow`, respectively. +For example, let’s examine the pre-processing part in more detail. ```{r} #| results: false @@ -94,12 +104,10 @@ extract_preprocessor(five_days_ahead$epi_workflow) extract_preprocessor(five_days_ahead$epi_workflow) ``` - Under Operations, we can see that the pre-processing operations were to lead the death rate by 5 days (`step_epi_ahead()`) and that the \# of recent observations used in the training window were not limited (in `step_training_window()` as -`n_training = Inf` in `flatline_args_list()`). You should also see the -molded/pre-processed training data. +`n_training = Inf` in `flatline_args_list()`). For symmetry, let's have a look at the post-processing. @@ -116,8 +124,7 @@ extract_frosting(five_days_ahead$epi_workflow) extract_frosting(five_days_ahead$epi_workflow) ``` - -The post-processing operations in the order the that were performed were to create the predictions and the predictive intervals, add the forecast and target dates and bound the predictions at zero. +The post-processing operations in the order the that were performed were to create the predictions and the prediction intervals, add the forecast and target dates and bound the predictions at zero. We can also easily examine the predictions themselves. @@ -125,9 +132,9 @@ We can also easily examine the predictions themselves. five_days_ahead$predictions ``` -The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 95% predictive interval is available for every state (`geo_value`). +The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 80% prediction interval is available for every state (`geo_value`). -The figure below displays the prediction and prediction interval for three sample states: Arizona, New York, and Florida. +The figure below displays the prediction and prediction interval for three sample states: Arizona, Florida, and New York. ```{r} #| fig-height: 5 @@ -158,7 +165,7 @@ ggplot(hist, aes(color = geo_value)) + The vertical black line is the forecast date. Here the forecast seems pretty reasonable based on the past observations shown. In cases where the recent past is highly predictive of the near future, a simple flatline forecast may be respectable, but in more complex situations where there is more uncertainty of what's to come, the flatline forecaster may be best relegated to being a baseline model and nothing more. -Take for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 5 to 25 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. Then, we row bind the list of results. +Take for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or `ahead` values - in our case, 1 to 28 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique `ahead` value, we can use the `map()` function from `purrr` to apply the forecaster to each ahead value we want to use. And then we row bind the list of results. ```{r} out_df <- map(1:28, ~ flatline_forecaster( @@ -168,7 +175,7 @@ out_df <- map(1:28, ~ flatline_forecaster( list_rbind() ``` -Then, we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`. +Then we proceed as we did before. The only difference from before is that we're using `out_df` where we had `five_days_ahead$predictions`. ```{r} #| fig-height: 5 @@ -195,9 +202,9 @@ ggplot(hist) + theme(legend.position = "none") ``` -Now, you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural. +Now you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural. -So naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. Moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model. +So naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. And moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don't cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the [COVID Forecast Hub](https://covid19forecasthub.org). The following [blog post](https://delphi.cmu.edu/blog/2021/09/30/on-the-predictability-of-covid-19/#ensemble-forecast-performance) from Delphi explores the Hub's ensemble accuracy relative to such a baseline model. ## What we've learned in a nutshell diff --git a/forecast-framework.qmd b/forecast-framework.qmd index 519d31d..92bb73c 100644 --- a/forecast-framework.qmd +++ b/forecast-framework.qmd @@ -6,7 +6,7 @@ source("_common.R") ``` Underneath the hood, the `arx_forecaster()` (and all our canned -forecasters) creates (and returns) an `epi_workflow`. +forecasters) creates and returns an `epi_workflow`. Essentially, this is a big S3 object that wraps up the 4 modular steps (preprocessing - postprocessing) described in the last chapter. @@ -17,7 +17,7 @@ Essentially, this is a big S3 object that wraps up the 4 modular steps Let's investigate how these interact with `{tidymodels}` and why it's important to think of forecasting this way. To have something to play with, we'll continue -to examine the data and an estimated canned corecaster. +to examine the data and an estimated canned forecaster. ```{r demo-workflow} diff --git a/tidymodels-intro.qmd b/tidymodels-intro.qmd index 24d9818..4498eb5 100644 --- a/tidymodels-intro.qmd +++ b/tidymodels-intro.qmd @@ -10,7 +10,7 @@ R contains a universe of packages that each have their own unique interfaces (fu If only there was a unifying interface available to help simplify and streamline the modelling process. This is the purpose of `tidymodels`, which provides a unified interface for modeling that abides by the [tidy philosphy](https://tidyverse.tidyverse.org/articles/paper.html#:~:text=Its%20primary%20goal%20is%20to,easier%20to%20learn%20the%20next) and that fits nicely into the tidyverse. From pre-processing to model training to prediction to validation, `tidymodels` provides the necessary tools to perform many modelling tasks. -It is important to understand that the `tidymodels` packages do not aim to implement the algorithms themseves, rather they provide the interface to bring together many disparate approaches under the same roof. And as a result of this, model fitting tasks are easier to carry out. In the grand scheme of things, here's where `tidymodels` tends to fit into a data analysis project. +It is important to understand that the `tidymodels` packages do not aim to implement the algorithms themselves, rather they provide the interface to bring together many disparate approaches under the same roof. And as a result of this, model fitting tasks are easier to carry out. In the grand scheme of things, here's where `tidymodels` tends to fit into a data analysis project. ```{r, echo = FALSE} knitr::include_graphics("img/tidymodels_packages.png") @@ -31,7 +31,7 @@ knitr::include_graphics("img/tidymodels_model_substeps.png") ## An example using the penguins dataset -We will now explore the `tidymodels` functions using the `penguins` dataset that we introduced and used in [Regression in Tidymodels](LINK%20TO%20VIGNETTE). +We will now explore the `tidymodels` functions using the `penguins` dataset that we introduced and used in [Regression in Tidymodels](tidymodels-regression.qmd#sec-slr-intro). ### Load packages @@ -43,7 +43,7 @@ library(tidymodels) ### Simplify dataset -To keep the focus on learning how to use `tidymodels`, we will work with a simplified version of the dataset in which we will only use the complete cases/rows in the `penguins` dataset +To keep the focus on learning how to use `tidymodels`, we will only wretain the complete cases/rows in the `penguins` dataset ```{r} penguins <- penguins %>% @@ -94,7 +94,7 @@ The main goal of this step is to use data transformations to make the data suita Before training the model, a recipe can be used to carry out the pre-processing required by the model. -The `recipe()` has two main arguments: a formula (with the same format as when doing \[LINK TO VIGNETTE\]) and a data argument, which is usually the training set as that is the data used to create the model. Hence, we have `data = train_data` here. +The `recipe()` has two main arguments: a formula (with the same format as when doing [regression in tidymodels](tidymodels-regression.qmd#sec-slr-intro)) and a data argument, which is usually the training set as that is the data used to create the model. Hence, we have `data = train_data` here. In our example, suppose that our goal is to predict penguin species from bill length, bill depth and flipper length, then our recipe function would look as follows: @@ -112,22 +112,28 @@ Now, after saying that you are making a recipe by way of the `recipe()` function - `step_scale()` to normalize numeric data to have a standard deviation of one -One of the advantages of having these pre-processing steps is that they help to simplify concepts that are difficult or a pain to enforce in coding. For example, centering could be a nuisance to implement from scratch because we would first have to calculate statistics (variable averages) from the training data and then use them on both the training and on the test data. Note that centering should not be done on the test data, rather on the training data to avoid data leakage (contamination of the test data by using statistics from the test data). In a recipe, the the estimation of the variable means using the training data and the application of these to center new data sets is done automatically, under the hood, and so spares the coder from having to manually implement it. The situation is similar for scaling numeric data (`step_scale()`). +One of the advantages of having these pre-processing steps is that they help to simplify the process of applying training statistics to test data. For example, centering could be a nuisance to implement from scratch because we would first have to calculate statistics (variable averages) from the training data and then use them on both the training and test data. -Another useful feature of the `tidymodels` pre-processing interface is that each step can be applied to one specified variable, a group of variables, or all variables. The `all_predictors()` and `all_outcomes()` functions are particularly convenient to help minimize the amount of typing you need to do. For instance, if you wanted to apply `step_center()` to only the predictor variables, simply type `step_center(all_predictors())` instead of listing out each and every predictor in the step function. +:::{.callout-note} +Note that centering should not be done on the test data, but rather on the training data to avoid data leakage - contamination of the test data by using statistics from the test data. +::: -Now, let's try this all out on our example. +In a recipe, estimation of the variable means using the training data and the application of these to center new data is done automatically, under the hood, and so spares the coder from having to manually implement it. The situation is similar for scaling numeric data. + +Another feature of the `tidymodels` pre-processing interface is that each step can be applied to one specified variable, a group of variables, or all variables. The `all_predictors()` and `all_outcomes()` functions are particularly convenient to help minimize the amount of typing you need to do. For instance, if you want to apply `step_center()` to only the predictor variables, simply type `step_center(all_predictors())` instead of listing out each and every predictor in the step function. + +Now, let's try this out on our example. ```{r} penguins_recipe <- recipe(species ~ ., data = train_data) %>% step_corr(all_predictors()) %>% - step_center(all_predictors(), -all_outcomes()) %>% - step_scale(all_predictors(), -all_outcomes()) + step_center(all_predictors()) %>% + step_scale(all_predictors()) ``` To summarize, we obtained a recipe object, `penguins_recipe`, by putting the `recipe()` and step functions together on our training data that we had ready to go from sampling. -Now, to get the recipe details, simply call the recipe object. The operations section details what pre-processing steps we've applied to the data. Notice that the steps shown here are in the order that they were input into the recipe and they specify the variables used in each step. +Now, to get the recipe details, simply call the recipe object. The operations section details what pre-processing steps we've applied to the data. Notice that the steps shown here are in the order that they were input into the recipe and they each indicate the variables that were used. ```{r} penguins_recipe @@ -139,7 +145,7 @@ Recall that in R, the same type of model could be fit using several different pa `Tidymodels` created an single interface that supports the usage of both models. Moreover, this general interface supports an even wider range of functions that use perform random forest. The key part that points to the function and package to be used is the engine. -Let's see how this works in practice. In the below example, we'll use the general `rand_forest()` function from `tidymodels`. In there, we can specify the number of trees by using the `trees` argument. Then, in `set_engine()` we specify that we want to use ranger's version of random forest. Notice this follows the model specification format introduced in the \[Regression in Tidymodels\] chapter. +Let's see how this works in practice. In the below example, we'll use the general `rand_forest()` function from `tidymodels`. In there, we can specify the number of trees by using the `trees` argument. Then, in `set_engine()` we specify that we want to use ranger's version of random forest. Notice this follows the model specification format introduced in the [Regression in Tidymodels](tidymodels-regression.qmd#sec-slr-intro) chapter. ```{r} penguins_ranger <- rand_forest(trees = 100, mode = "classification") %>% @@ -153,17 +159,9 @@ penguins_rf <- rand_forest(trees = 100, mode = "classification") %>% set_engine("randomForest") ``` -For the remainder of this tutorial, we'll stick with using `ranger` for simplify. At this stage, we're ready to pre-process and model. The first task of those two is to apply our recipe before we train and test our model, in that we must - -1. Process the recipe using the training set. - -2. Use the recipe on the training set to get the finalized predictor set. - -3. Use the recipe on the predictor set to get the test set. - -A workflow can be used to pair model and processing tasks together. When different recipes are needed for different models, this is very useful so that you don't have to keep track of separate model and recipe objects in your workspace. Hence, training and testing different workflows becomes easier. +For the remainder of this tutorial, we'll stick with using `ranger` for simplicity. And at this stage, we're ready to pre-process and model. A workflow can be used to pair model and processing tasks together. When different recipes are needed for different models, this is very useful so that you don't have to keep track of separate model and recipe objects in your workspace. So training and testing different workflows becomes easier. -For our example, we'll try tidy model's workflows package to pair our model and our recipe together. +For our example, we'll employ the `workflows` package to pair our model and our recipe together. ```{r} penguins_wflow <- workflow() %>% @@ -173,7 +171,7 @@ penguins_wflow <- workflow() %>% penguins_wflow ``` -After that, we're ready to fit the model to our training data. The `fit()` function is what we will use to prepare the the recipe and train the model from the finalized predictors. +After that, we're ready to fit the model to our training data. The `fit()` function is what we will use to prepare the recipe and train the model from the finalized predictors. ```{r} penguins_fit <- penguins_wflow %>% fit(data = train_data) @@ -278,15 +276,12 @@ penguins_aug %>% The diagonal shows the counts of correct predictions for each species, while the off-diagonal shows the counts of model misclassifications. As the metrics have indicated, our model performed magnificently on the test set as there was only three misclassifications of Adelie penguins as Chinstrap. -## Concluding remarks -In this vignette, we introduced `tidymodels` and illustrated how to its packages work together by way of example. Since this was an elementary example, so use this as a starting point and explore what more can be done with this wonderful set of packages. And yet, however wonderful they are, you may have already noticed that there are limitations like the glaring lack of a set of post-processing tools to refine the results. We fill this gap for epidemiological modelling with [frosting](https://cmu-delphi.github.io/epipredict/reference/add_frosting.html). This will be formally introduced in a later chapter, so stay tuned!\ -\ -🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 +🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 ## Attribution This Chapter was largely adapted from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) as well as [Tidymodels - Getting Started](https://www.tidymodels.org/start/recipes/) and [Tidymodels](https://wec.wur.nl/dse/24-tidymodels.html). The diagrams are from [A Gentle Introduction to Tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/) and based on [R for Data Science](https://r4ds.had.co.nz/explore-intro.html). -🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 +🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 diff --git a/tidymodels-regression.qmd b/tidymodels-regression.qmd index f4b74f3..5bada78 100644 --- a/tidymodels-regression.qmd +++ b/tidymodels-regression.qmd @@ -15,7 +15,7 @@ library(broom) library(performance) ``` -## Simple linear regression +## Simple linear regression {#sec-slr-intro} The key steps to perform linear regression in `tidymodels` are to first specify the model type and then to specify the model form and the data to be used to construct it. @@ -174,7 +174,7 @@ Notice that on each plot it says what we should expect to see if the model assum We shall now briefly walk you through what each plot means. -The first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line. +The first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in the standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line. For our example, we observe that there's a pretty high correlation between `body_mass_g` and `flipper_length_mm` (not quite in the red-zone of 10 and above, but close enough for concern). That is indicative of multicollinearity between them. Intuitively, it makes sense for the body mass and flipper length variables - we'd expect that as once increases, so should the other. @@ -216,7 +216,7 @@ In general, the syntax to add an interaction term to a formula is as follows: - `x:y` denotes an interaction term between `x` and `y`. - `x*y` denotes the interaction between `x` and `y` as well as `x` and `y`; that is, `x + y + x*y`. -It is important to note that this syntax is not compatible with all engines. Thus, we shall explain how to bypass this issue by adding an interaction term in a recipe later on. For now, let's start simple by adding an interaction term between `species` and `bill_length_mm`, which allows for a species-specific slope. +Let's start simple by adding an interaction term between `species` and `bill_length_mm`, which allows for a species-specific slope. ```{r} lm_fit4 <- lm_spec %>% fit( @@ -226,7 +226,9 @@ lm_fit4 <- lm_spec %>% fit( lm_fit4 ``` -Using recipes, the interaction term is specified by using `step_interact()`. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a `parsnip` model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again. +It is important to note that this formula-based syntax is not compatible with all engines. Thus, we'll now go over how to bypass this problem by adding an interaction term in a recipe. + +Using `recipes`, the interaction term is specified by using `step_interact()`. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a `parsnip` model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again. ```{r} rec_spec_interact <- recipe( @@ -248,9 +250,19 @@ To read more about formula syntax, see [?formula](https://rdrr.io/r/stats/formul ## Non-linear transformations of the predictors -Similar to how we were able to add an interaction term using recipes, we can also perform a transformation as a pre-processing step. The function used for this is `step_mutate()` (which acts like `dplyr`'s `mutate`). +While in base R, you can add a non-linear transformation to a formula like so, + +```{r} +lm_fit5 <- lm_spec %>% + # let's square bill_depth_mm + fit(bill_length_mm ~ bill_depth_mm + I(bill_depth_mm^2), data = penguins) + +lm_fit5 +``` + +you cannot do this in a formula in a recipe. -Note that, in general, if you are specifying a recipe aim to keep as much of the pre-processing in your recipe specification as possible. This helps to ensure that the transformation will be applied to new data consistently. +Fortunately, we can deal with this in a similar way as we did to add an interaction term to a recipe. We can add non-linear transformation as a pre-processing step. The function used for this is `step_mutate()` (which acts like `dplyr`'s `mutate`). Thus, we can perform the above square transformation in a recipe as follows: ```{r} rec_spec_pow2 <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>% @@ -263,7 +275,7 @@ lm_wf_pow2 <- workflow() %>% lm_wf_pow2 %>% fit(penguins) ``` -There are many transformations already built into recipes such as `step_log()`. So, for basic transformations, there's often no need to make your own transformation from scratch. See [here](https://recipes.tidymodels.org/reference/#section-step-functions-individual-transformations) for a comprehensive list of the transformations that are offered in recipes. +There are many transformations already built into `recipes` such as `step_log()`. So, for basic transformations, there's often no need to make your own transformation from scratch. See [here](https://recipes.tidymodels.org/reference/#section-step-functions-individual-transformations) for a comprehensive list of the transformations that are offered in `recipes`. ```{r} rec_spec_log <- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>% @@ -283,6 +295,6 @@ lm_wf_log %>% fit(penguins) ## Attribution -This Chapter was largely adapted from [Chapter 3 of ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html). Checking linear regression assumptions using the performance package is based on [this article](https://easystats.github.io/performance/reference/check_model.html) and [this blog post](https://www.r-bloggers.com/2021/07/easystats-quickly-investigate-model-performance/) on investigating model performance. The artwork used is by [Allison Horst](https://twitter.com/allison_horst).[Allison Horst](https://twitter.com/allison_horst). +This Chapter was largely adapted from [Chapter 3 of ISLR tidymodels labs](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/03-linear-regression.html). Checking linear regression assumptions using the performance package is based on [this article](https://easystats.github.io/performance/reference/check_model.html) and [this blog post](https://www.r-bloggers.com/2021/07/easystats-quickly-investigate-model-performance/) on investigating model performance. The artwork used is by [Allison Horst](https://twitter.com/allison_horst). 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧