From 462dc83c6717bef355138de0965c27bec474c8e6 Mon Sep 17 00:00:00 2001 From: Max Kuhn Date: Tue, 10 Jan 2023 17:05:56 -0500 Subject: [PATCH 1/2] add many-models --- .../learn/statistics/many-models/index.qmd | 546 ++++++++++++++++++ 1 file changed, 546 insertions(+) create mode 100644 content/learn/statistics/many-models/index.qmd diff --git a/content/learn/statistics/many-models/index.qmd b/content/learn/statistics/many-models/index.qmd new file mode 100644 index 00000000..cfb887a9 --- /dev/null +++ b/content/learn/statistics/many-models/index.qmd @@ -0,0 +1,546 @@ +--- +title: "Many Models" +categories: + - statistical analysis + - tidying results +type: learn-subsection +weight: 1 +description: | + Analyze the results of many linear regressions and learn about nested data frames along the way. +--- + +```{r} +#| label: "setup" +#| include: false +source(here::here("common.R")) +``` + +```{r} +#| label: "load" +#| include: false +library(tidymodels) +pkgs <- c("tidymodels", "gapminder", "stringr") +theme_set(theme_bw() + theme(legend.position = "top")) +``` + + +_These materials were in the first edition of [R for Data Science](https://r4ds.had.co.nz). They are not in later editions and are shown here -- with some modifications-- with permission_. `r article_req_pkgs(pkgs)` + +In this article, you're going to learn three powerful ideas that help you to work with large numbers of models with ease: + +1. Using many simple models to better understand complex datasets. + +1. Using list-columns to store arbitrary data structures in a data frame. + For example, this will allow you to have a column that contains linear + models. + +1. Using the __broom__ package, originally by David Robinson, to turn models into tidy + data. This is a powerful technique for working with large numbers of models. + +We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends. + +The following sections will dive into more detail about the individual techniques: + +1. In [list-columns], you'll learn more about the list-column data structure, + and why it's valid to put lists in data frames. + +1. In [creating list-columns], you'll learn the three main ways in which you'll + create list-columns. + +1. In [simplifying list-columns] you'll learn how to convert list-columns back + to regular atomic vectors (or sets of atomic vectors) so you can work + with them more easily. + +1. In [making tidy data with broom], you'll learn about the full set of tools + provided by broom, and see how they can be applied to other types of + data structure. + +First, let's load some packages: + +```{r} +library(tidymodels) +tidymodels_prefer() +``` + +## gapminder + +To motivate the power of many simple models, we're going to look into the "gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, stop reading this chapter right now and go watch one of his videos! He is a fantastic data presenter and illustrates how you can use data to present a compelling story. A good place to start is this short video filmed in conjunction with the BBC: . + +The gapminder data summarises the progression of countries over time, looking at statistics like life expectancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package: + +```{r} +library(gapminder) +gapminder +``` + +In this case study, we're going to focus on just three variables to answer the question "How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?". A good place to start is with a plot: + +```{r} +#| label: year-life-exp +gapminder %>% + ggplot(aes(year, lifeExp, group = country)) + + geom_line(alpha = 1/3) +``` + +This is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectancy has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see? + +One way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left. + +You already know how to do that if we had a single country: + +```{r} +#| label: year-life-exp-nz +#| out-width: 33% +#| fig-asp: 1.0 +#| fig-width: 3.0 +nz <- filter(gapminder, country == "New Zealand") +nz %>% + ggplot(aes(year, lifeExp)) + + geom_line() + + ggtitle("Full data = ") + +nz_mod <- lm(lifeExp ~ year, data = nz) +``` + +The broom package provides a general set of functions to turn models into tidy data. One function, `augment()`, is used to supplement a data set with columns such as the predicted values, residuals, and so on. We'll use this to plot the predicted versus observed data: + +```{r} +#| label: year-life-exp-nz-resid +nz_mod_res <- augment(nz_mod) + +nz_mod_res %>% + ggplot(aes(year, .fitted)) + + geom_line() + + ggtitle("Linear trend + ") + +nz_mod_res %>% + ggplot(aes(year, .resid)) + + geom_hline(yintercept = 0, colour = "white", linewidth = 3) + + geom_line() + + ggtitle("Remaining pattern") +``` + +There is a `data` argument to `augment()`. The default uses the same data points that the model was trained on. + +How can we easily fit that model to every country? + +### Nested data + +You could imagine copy and pasting that code multiple times; but you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr. This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and "nest" it: + +```{r} +by_country <- gapminder %>% + group_by(country, continent) %>% + nest() + +by_country +``` + +(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.) + +This creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea. + +The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan). + +```{r} +by_country$data[[1]] +``` + +Note the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about a nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time. + +### List-columns + +Now that we have our nested data frame, we're in a good position to fit some models. We have a model-fitting function: + +```{r} +country_model <- function(df) { + lm(lifeExp ~ year, data = df) +} +``` + +And we want to apply it to every data frame. The data frames are in a list, so we can use `purrr::map()` to apply `country_model` to each element: + +```{r} +models <- map(by_country$data, country_model) +``` + +However, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame? + +In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`: + +```{r} +by_country <- by_country %>% + mutate(model = map(data, country_model)) +by_country +``` + +This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes care of that for you: + +```{r} +by_country %>% + filter(continent == "Europe") +by_country %>% + arrange(continent, country) +``` + +If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer! + +### Unnesting + +Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `augment()` with each model pair: + +```{r} +by_country <- by_country %>% + mutate( + results = map(model, augment) + ) +by_country +``` + +But how can you plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, and now we do the opposite with `unnest()`: + +```{r} +resids <- unnest(by_country, results) +resids +``` + +Note that each regular column is repeated once for each row of the nested tibble. + +Now we have regular data frame, we can plot the residuals: + +```{r} +#| label: resid-col-country +resids %>% + ggplot(aes(year, .resid)) + + geom_line(aes(group = country), alpha = 1 / 3) + + geom_smooth(se = FALSE) +``` + +Facetting by continent is particularly revealing: + +```{r} +#| label: resid-col-country-facetting +resids %>% + ggplot(aes(year, .resid, group = country)) + + geom_line(alpha = 1 / 3) + + facet_wrap(~continent) +``` + +It looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle. + +### Model quality + +Instead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package and we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row: + +```{r} +broom::glance(nz_mod) +``` + +We can use `mutate()` and `unnest()` to create a data frame with a row for each country: + +```{r} +by_country %>% + mutate(glance = map(model, broom::glance)) %>% + unnest(glance) +``` + +This isn't quite the output we want, because it still includes all the list columns. This is default behaviour when `unnest()` works on single row data frames. To suppress these columns we use `select()`: + +```{r} +glance <- by_country %>% + mutate(glance = map(model, broom::glance)) %>% + unnest(glance) %>% + select(-data, -model, -results) +glance +``` + +(Pay attention to the variables that aren't printed: there's a lot of useful stuff there.) + +With this data frame in hand, we can start to look for models that don't fit well: + +```{r} +glance %>% + arrange(r.squared) +``` + +The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective: + +```{r} +glance %>% + ggplot(aes(continent, r.squared)) + + geom_jitter(width = 0.5) +``` + +We could pull out the countries with particularly bad $R^2$ and plot the data: + +```{r} +#| label: bad-fit +bad_fit <- filter(glance, r.squared < 0.25) + +gapminder %>% + semi_join(bad_fit, by = "country") %>% + ggplot(aes(year, lifeExp, colour = country)) + + geom_line() +``` + +We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwandan genocide. + +## List-columns + +Now that you've seen a basic workflow for managing many models, let's dive back into some of the details. In this section, we'll explore the list-column data structure in a little more detail. It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to use a list as a column of a data frame. However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:. + +```{r} +data.frame(x = list(1:3, 3:5)) +``` + +You can prevent `data.frame()` from doing this with `I()`, but the result doesn't print particularly well: + +```{r} +data.frame( + x = I(list(1:3, 3:5)), + y = c("1, 2", "3, 4, 5") +) +``` + +Tibble alleviates this problem by being lazier (`tibble()` doesn't modify its inputs) and by providing a better print method: + +```{r} +tibble( + x = list(1:3, 3:5), + y = c("1, 2", "3, 4, 5") +) +``` + +It's even easier with `tribble()` as it can automatically work out that you need a list: + +```{r} +tribble( + ~x, ~y, + 1:3, "1, 2", + 3:5, "3, 4, 5" +) +``` + +List-columns are often most useful as intermediate data structure. They're hard to work with directly, because most R functions work with atomic vectors or data frames, but the advantage of keeping related items together in a data frame is worth a little hassle. + +Generally there are three parts of an effective list-column pipeline: + +1. You create the list-column using one of `nest()`, `summarise()` + `list()`, + or `mutate()` + a map function, as described in [Creating list-columns]. + +1. You create other intermediate list-columns by transforming existing + list columns with `map()`, `map2()` or `pmap()`. For example, + in the case study above, we created a list-column of models by transforming + a list-column of data frames. + +1. You simplify the list-column back down to a data frame or atomic vector, + as described in [Simplifying list-columns]. + +## Creating list-columns + +Typically, you won't create list-columns with `tibble()`. Instead, you'll create them from regular columns, using one of three methods: + +1. With `tidyr::nest()` to convert a grouped data frame into a nested data + frame where you have list-column of data frames. + +1. With `mutate()` and vectorised functions that return a list. + +1. With `summarise()` and summary functions that return multiple results. + +Alternatively, you might create them from a named list, using `tibble::enframe()`. + +Generally, when creating list-columns, you should make sure they're homogeneous: each element should contain the same type of thing. There are no checks to make sure this is true, but if you use purrr and remember what you've learned about type-stable functions, you should find it happens naturally. + +### With nesting + +`nest()` creates a nested data frame, which is a data frame with a list-column of data frames. In a nested data frame each row is a meta-observation: the other columns give variables that define the observation (like country and continent above), and the list-column of data frames gives the individual observations that make up the meta-observation. + +There are two ways to use `nest()`. So far you've seen how to use it with a grouped data frame. When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column: + +```{r} +gapminder %>% + group_by(country, continent) %>% + nest() +``` + +You can also use it on an ungrouped data frame, specifying which columns you want to nest: + +```{r} +gapminder %>% + nest(data = c(year:gdpPercap)) +``` + +### From vectorised functions + +Some useful functions take an atomic vector and return a list. For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of character vectors. If you use that inside mutate, you'll get a list-column: + +```{r} +df <- tribble( + ~x1, + "a,b,c", + "d,e,f,g" +) + +df %>% + mutate(x2 = stringr::str_split(x1, ",")) +``` + +`unnest()` knows how to handle these lists of vectors: + +```{r} +df %>% + mutate(x2 = stringr::str_split(x1, ",")) %>% + unnest(x2) +``` + +(If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern). + +### From multivalued summaries + +One restriction of `summarise()` is that it only works with summary functions that return a single value. That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length: + +```{r} +#| error: true +mtcars %>% + group_by(cyl) %>% + summarise(q = quantile(mpg), .groups = "drop") +``` + +You can however, wrap the result in a list! This obeys the contract of `summarise()`, because each summary is now a list (a vector) of length 1. + +```{r} +mtcars %>% + group_by(cyl) %>% + summarise(q = list(quantile(mpg)), .groups = "drop") +``` + +To make useful results with unnest, you'll also need to capture the probabilities: + +```{r} +probs <- c(0.01, 0.25, 0.5, 0.75, 0.99) +mtcars %>% + group_by(cyl) %>% + summarise(p = list(probs), q = list(quantile(mpg, probs)), .groups = "drop") %>% + unnest(c(p, q)) +``` + +### From a named list + +Data frames with list-columns provide a solution to a common problem: what do you do if you want to iterate over both the contents of a list and its elements? Instead of trying to jam everything into one object, it's often easier to make a data frame: one column can contain the elements, and one column can contain the list. An easy way to create such a data frame from a list is `tibble::enframe()`. + +```{r} +x <- list( + a = 1:5, + b = 3:4, + c = 5:6 +) + +df <- enframe(x) +df +``` + +The advantage of this structure is that it generalises in a straightforward way - names are useful if you have character vector of metadata, but don't help if you have other types of data, or multiple vectors. + +Now if you want to iterate over names and values in parallel, you can use `map2()`: + +```{r} +df %>% + mutate( + smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1])) + ) +``` + +## Simplifying list-columns + +To apply the techniques of data manipulation and visualisation, you'll need to simplify the list-column back to a regular column (an atomic vector), or set of columns. The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values: + +1. If you want a single value, use `mutate()` with `map_lgl()`, + `map_int()`, `map_dbl()`, and `map_chr()` to create an atomic vector. + +1. If you want many values, use `unnest()` to convert list-columns back + to regular columns, repeating the rows as many times as necessary. + +These are described in more detail below. + +### List to vector + +If you can reduce your list column to an atomic vector then it will be a regular column. For example, you can always summarise an object with its type and length, so this code will work regardless of what sort of list-column you have: + +```{r} +df <- tribble( + ~x, + letters[1:5], + 1:3, + runif(5) +) + +df %>% mutate( + type = map_chr(x, typeof), + length = map_int(x, length) +) +``` + +This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you have a heterogeneous list, and want to filter out the parts aren't working for you. + +Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`. This is useful for pulling apart nested lists into regular columns. Use the `.null` argument to provide a value to use if the element is missing (instead of returning `NULL`): + +```{r} +df <- tribble( + ~x, + list(a = 1, b = 2), + list(a = 2, c = 4) +) +df %>% mutate( + a = map_dbl(x, "a"), + b = map_dbl(x, "b", .null = NA_real_) +) +``` + +### Unnesting + +`unnest()` works by repeating the regular columns once for each element of the list-column. For example, in the following very simple example we repeat the first row 4 times (because there the first element of `y` has length four), and the second row once: + +```{r} +tibble(x = 1:2, y = list(1:4, 1)) %>% unnest(y) +``` + +This means that you can't simultaneously unnest two columns that contain different number of elements: + +```{r} +#| error: true +# Ok, because y and z have the same number of elements in +# every row +df1 <- tribble( + ~x, ~y, ~z, + 1, c("a", "b"), 1:2, + 2, "c", 3 +) +df1 +df1 %>% unnest(c(y, z)) + +# Doesn't work because y and z have different number of elements +df2 <- tribble( + ~x, ~y, ~z, + 1, "a", 1:2, + 2, c("b", "c"), 3 +) +df2 +df2 %>% unnest(c(y, z)) +``` + +The same principle applies when unnesting list-columns of data frames. You can unnest multiple list-cols as long as all the data frames in each row have the same number of rows. + + +## Making tidy data with broom + +The broom package provides three general tools for turning models into tidy data frames: + +1. `broom::glance(model)` returns a row for each model. Each column gives a + model summary: either a measure of model quality, or complexity, or a + combination of the two. + +1. `broom::tidy(model)` returns a row for each coefficient in the model. Each + column gives information about the estimate or its variability. + +1. `broom::augment(model, data)` returns a row for each row in `data`, adding + extra values like residuals, and influence statistics. From 9722667d57d99a6d6241f8773d9265b34a6e46b9 Mon Sep 17 00:00:00 2001 From: Max Kuhn Date: Tue, 10 Jan 2023 17:06:04 -0500 Subject: [PATCH 2/2] re-render --- .../parsnip/index/execute-results/html.json | 2 +- .../recipes/index/execute-results/html.json | 2 +- .../index/execute-results/html.json | 2 +- .../recipes/index/execute-results/html.json | 4 +- .../index/execute-results/html.json | 2 +- .../index/execute-results/html.json | 14 + .../statistics/many-models/figs/bad-fit-1.svg | 105 ++ .../many-models/figs/resid-col-country-1.svg | 217 +++ .../figs/resid-col-country-facetting-1.svg | 431 ++++++ .../many-models/figs/year-life-exp-1.svg | 216 +++ .../many-models/figs/year-life-exp-nz-1.svg | 80 ++ .../figs/year-life-exp-nz-resid-1.svg | 81 ++ .../figs/year-life-exp-nz-resid-2.svg | 85 ++ docs/content/find/index.html | 14 +- docs/content/find/parsnip/index.html | 22 +- docs/content/find/recipes/index.html | 20 +- .../learn/develop/parameters/index.html | 36 +- docs/content/learn/develop/recipes/index.html | 58 +- docs/content/learn/index.html | 83 +- .../models/parsnip-ranger-glmnet/index.html | 38 +- .../statistics/many-models/figs/bad-fit-1.svg | 105 ++ .../many-models/figs/resid-col-country-1.svg | 217 +++ .../figs/resid-col-country-facetting-1.svg | 431 ++++++ .../many-models/figs/unnamed-chunk-16-1.svg | 218 +++ .../many-models/figs/unnamed-chunk-17-1.svg | 436 ++++++ .../many-models/figs/unnamed-chunk-22-1.svg | 215 +++ .../many-models/figs/unnamed-chunk-23-1.svg | 106 ++ .../many-models/figs/unnamed-chunk-5-1.svg | 217 +++ .../many-models/figs/unnamed-chunk-6-1.svg | 81 ++ .../many-models/figs/unnamed-chunk-7-1.svg | 82 ++ .../many-models/figs/unnamed-chunk-7-2.svg | 86 ++ .../many-models/figs/year-life-exp-1.svg | 216 +++ .../many-models/figs/year-life-exp-nz-1.svg | 80 ++ .../figs/year-life-exp-nz-resid-1.svg | 81 ++ .../figs/year-life-exp-nz-resid-2.svg | 85 ++ .../learn/statistics/many-models/index.html | 1163 +++++++++++++++++ docs/listings.json | 1 + docs/search.json | 91 +- docs/site_libs/bootstrap/bootstrap.min.css | 4 +- 39 files changed, 5298 insertions(+), 129 deletions(-) create mode 100644 _freeze/content/learn/statistics/many-models/index/execute-results/html.json create mode 100644 content/learn/statistics/many-models/figs/bad-fit-1.svg create mode 100644 content/learn/statistics/many-models/figs/resid-col-country-1.svg create mode 100644 content/learn/statistics/many-models/figs/resid-col-country-facetting-1.svg create mode 100644 content/learn/statistics/many-models/figs/year-life-exp-1.svg create mode 100644 content/learn/statistics/many-models/figs/year-life-exp-nz-1.svg create mode 100644 content/learn/statistics/many-models/figs/year-life-exp-nz-resid-1.svg create mode 100644 content/learn/statistics/many-models/figs/year-life-exp-nz-resid-2.svg create mode 100644 docs/content/learn/statistics/many-models/figs/bad-fit-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/resid-col-country-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/resid-col-country-facetting-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-16-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-17-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-22-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-23-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-5-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-6-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-7-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/unnamed-chunk-7-2.svg create mode 100644 docs/content/learn/statistics/many-models/figs/year-life-exp-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/year-life-exp-nz-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-1.svg create mode 100644 docs/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-2.svg create mode 100644 docs/content/learn/statistics/many-models/index.html diff --git a/_freeze/content/find/parsnip/index/execute-results/html.json b/_freeze/content/find/parsnip/index/execute-results/html.json index 8718c532..cd872254 100644 --- a/_freeze/content/find/parsnip/index/execute-results/html.json +++ b/_freeze/content/find/parsnip/index/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "90dba81c1c48d850e12a2f4bae0f8d64", "result": { - "markdown": "---\ntitle: Search parsnip models\nweight: 2\ndescription: | \n Find model types, engines, and arguments to fit and predict in the tidymodels framework.\n---\n\n\nTo learn about the parsnip package, see [*Get Started: Build a Model*](/start/models/). Use the tables below to find [model types and engines](#models) and to explore [model arguments](#model-args).\n\n\n\n\n\n\n\n\n\n## Explore models {#models}\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n\n\n
\n\nModels can be added by the user too. The article [How to build a parsnip model](/learn/develop/models/) walks you through the steps.\n\n## Explore model arguments {#model-args}\n\nThe parsnip package provides consistent interface for working with similar models across different engines. This means that parsnip adopts standardized parameter names as arguments, and those names may be different from those used by the individual engines. The searchable table below provides a mapping between the parsnip and the engine arguments: \n\n
\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n", + "markdown": "---\ntitle: Search parsnip models\nweight: 2\ndescription: | \n Find model types, engines, and arguments to fit and predict in the tidymodels framework.\n---\n\n\nTo learn about the parsnip package, see [*Get Started: Build a Model*](/start/models/). Use the tables below to find [model types and engines](#models) and to explore [model arguments](#model-args).\n\n\n\n\n\n\n\n\n\n## Explore models {#models}\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n\n\n
\n\nModels can be added by the user too. The article [How to build a parsnip model](/learn/develop/models/) walks you through the steps.\n\n## Explore model arguments {#model-args}\n\nThe parsnip package provides consistent interface for working with similar models across different engines. This means that parsnip adopts standardized parameter names as arguments, and those names may be different from those used by the individual engines. The searchable table below provides a mapping between the parsnip and the engine arguments: \n\n
\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/content/find/recipes/index/execute-results/html.json b/_freeze/content/find/recipes/index/execute-results/html.json index c19dd1e2..cf7a1844 100644 --- a/_freeze/content/find/recipes/index/execute-results/html.json +++ b/_freeze/content/find/recipes/index/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "14716ad5ffee97d2204641313d0efd02", "result": { - "markdown": "---\nsubtitle: Recipes\ntitle: Search recipe steps\nweight: 3\ndescription: | \n Find recipe steps in the tidymodels framework to help you prep your data for modeling.\n---\n\n\n\n\n\n\nTo learn about the recipes package, see [*Get Started: Preprocess your data with recipes*](/start/recipes/). The table below allows you to search for recipe steps across tidymodels packages.\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n", + "markdown": "---\nsubtitle: Recipes\ntitle: Search recipe steps\nweight: 3\ndescription: | \n Find recipe steps in the tidymodels framework to help you prep your data for modeling.\n---\n\n\n\n\n\n\nTo learn about the recipes package, see [*Get Started: Preprocess your data with recipes*](/start/recipes/). The table below allows you to search for recipe steps across tidymodels packages.\n\n\n\n\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/content/learn/develop/parameters/index/execute-results/html.json b/_freeze/content/learn/develop/parameters/index/execute-results/html.json index 83e91770..61fe1eb8 100644 --- a/_freeze/content/learn/develop/parameters/index/execute-results/html.json +++ b/_freeze/content/learn/develop/parameters/index/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "b25235885fee717b9b516645517237ff", "result": { - "markdown": "---\ntitle: \"How to create a tuning parameter function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 4\ndescription: | \n Build functions to use in tuning both quantitative and qualitative parameters.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: dials and scales.\n\nSome models and recipe steps contain parameters that dials does not know about. You can construct new quantitative and qualitative parameters using `new_quant_param()` or `new_qual_param()`, respectively. This article is a guide to creating new parameters.\n\n## Quantitative parameters\n\nAs an example, let's consider the multivariate adaptive regression spline ([MARS](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline)) model, which creates nonlinear features from predictors and adds them to a linear regression models. The earth package is an excellent implementation of this method.\n\nMARS creates an initial set of features and then prunes them back to an appropriate size. This can be done automatically by `earth::earth()` or the number of final terms can be set by the user. The parsnip function `mars()` has a parameter called `num_terms` that defines this.\n\nWhat if we want to create a parameter for the number of *initial terms* included in the model. There is no argument in `parsnip::mars()` for this but we will make one now. The argument name in `earth::earth()` is `nk`, which is not very descriptive. Our parameter will be called `num_initial_terms`.\n\nWe use the `new_quant_param()` function since this is a numeric parameter. The main two arguments to a numeric parameter function are `range` and `trans`.\n\nThe `range` specifies the possible values of the parameter. For our example, a minimal value might be one or two. What is the upper limit? The default in the earth package is\n\n\n::: {.cell layout-align=\"center\" hash='cache/eart_0aaa451856e86c8fdc7e0c3f099c8de4'}\n\n```{.r .cell-code}\nmin(200, max(20, 2 * ncol(x))) + 1\n```\n:::\n\n\nwhere `x` is the predictor matrix. We often put in values that are either sensible defaults or are minimal enough to work for the majority of data sets. For now, let's specify an upper limit of 10 but this will be discussed more in the next section.\n\nThe other argument is `trans`, which represents a transformation that should be applied to the parameter values when working with them. For example, many regularization methods have a `penalty` parameter that tends to range between zero and some upper bound (let's say 1). The effect of going from a penalty value of 0.01 to 0.1 is much more impactful than going from 0.9 to 1.0. In such a case, it might make sense to work with this parameter in transformed units (such as the log, in this example). If new parameter values are generated at random, it helps if they are uniformly simulated in the transformed units and then converted back to the original units.\n\nThe `trans` parameter accepts a transformation object from the scales package. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/scales_b25ae3bb346dde06d2a7e463ba1f4c4d'}\n\n```{.r .cell-code}\nlibrary(scales)\nlsf.str(\"package:scales\", pattern = \"_trans$\")\n#> asn_trans : function () \n#> atanh_trans : function () \n#> boxcox_trans : function (p, offset = 0) \n#> compose_trans : function (...) \n#> date_trans : function () \n#> exp_trans : function (base = exp(1)) \n#> hms_trans : function () \n#> identity_trans : function () \n#> log_trans : function (base = exp(1)) \n#> log10_trans : function () \n#> log1p_trans : function () \n#> log2_trans : function () \n#> logit_trans : function () \n#> modulus_trans : function (p, offset = 1) \n#> probability_trans : function (distribution, ...) \n#> probit_trans : function () \n#> pseudo_log_trans : function (sigma = 1, base = exp(1)) \n#> reciprocal_trans : function () \n#> reverse_trans : function () \n#> sqrt_trans : function () \n#> time_trans : function (tz = NULL) \n#> yj_trans : function (p)\nscales::log10_trans()\n#> Transformer: log-10 [1e-100, Inf]\n```\n:::\n\n\nA value of `NULL` means that no transformation should be used.\n\nA quantitative parameter function should have these two arguments and, in the function body, a call `new_quant_param()`. There are a few arguments to this function:\n\n\n::: {.cell layout-align=\"center\" hash='cache/new_quant_param_dc898a8030c6fa2847b68be9c2db5701'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nargs(new_quant_param)\n#> function (type = c(\"double\", \"integer\"), range = NULL, inclusive = NULL, \n#> default = deprecated(), trans = NULL, values = NULL, label = NULL, \n#> finalize = NULL) \n#> NULL\n```\n:::\n\n\n- Possible types are double precision and integers. The value of `type` should agree with the values of `range` in the function definition.\n\n- It's OK for our tuning to include the minimum or maximum, so we'll use `c(TRUE, TRUE)` for `inclusive`. If the value cannot include one end of the range, set one or both of these values to `FALSE`.\n\n- The `label` should be a named character string where the name is the parameter name and the value represents what will be printed automatically.\n\n- `finalize` is an argument that can set parts of the range. This is discussed more below.\n\nHere's an example of a basic quantitative parameter object:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms_d602318800beb1ef90a7bde3e6959438'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, 10L), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\n\nnum_initial_terms()\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 10]\n\n# Sample from the parameter:\nset.seed(4832856)\nnum_initial_terms() %>% value_sample(5)\n#> [1] 6 4 9 10 4\n```\n:::\n\n\n### Finalizing parameters\n\nIt might be the case that the range of the parameter is unknown. For example, parameters that are related to the number of columns in a data set cannot be exactly specified in the absence of data. In those cases, a placeholder of `unknown()` can be added. This will force the user to \"finalize\" the parameter object for their particular data set. Let's redefine our function with an `unknown()` value:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms-unk_9de7d72b673760c5098403e4f395b8d8'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\nnum_initial_terms()\n\n# Can we sample? \nnum_initial_terms() %>% value_sample(5)\n```\n:::\n\n\nThe `finalize` argument of `num_initial_terms()` can take a function that uses data to set the range. For example, the package already includes a few functions for finalization:\n\n\n::: {.cell layout-align=\"center\" hash='cache/dials-final-funcs_13c8f3f4f0f277ecb3c76d762cb7a32c'}\n\n```{.r .cell-code}\nlsf.str(\"package:dials\", pattern = \"^get_\")\n#> get_batch_sizes : function (object, x, frac = c(1/10, 1/3), ...) \n#> get_log_p : function (object, x, ...) \n#> get_n : function (object, x, log_vals = FALSE, ...) \n#> get_n_frac : function (object, x, log_vals = FALSE, frac = 1/3, ...) \n#> get_n_frac_range : function (object, x, log_vals = FALSE, frac = c(1/10, 5/10), ...) \n#> get_p : function (object, x, log_vals = FALSE, ...) \n#> get_rbf_range : function (object, x, seed = sample.int(10^5, 1), ...)\n```\n:::\n\n\nThese functions generally take a data frame of predictors (in an argument called `x`) and add the range of the parameter object. Using the formula in the earth package, we might use:\n\n\n::: {.cell layout-align=\"center\" hash='cache/earth-range_e1c9bf6b8f535f761d22d3b738ea8bb2'}\n\n```{.r .cell-code}\nget_initial_mars_terms <- function(object, x) {\n upper_bound <- min(200, max(20, 2 * ncol(x))) + 1\n upper_bound <- as.integer(upper_bound)\n bounds <- range_get(object)\n bounds$upper <- upper_bound\n range_set(object, bounds)\n}\n\n# Use the mtcars are the finalize the upper bound: \nnum_initial_terms() %>% get_initial_mars_terms(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\nOnce we add this function to the object, the general `finalize()` method can be used:\n\n\n::: {.cell layout-align=\"center\" hash='cache/final-obj_9b8361428190d870490441c3eecf012e'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = get_initial_mars_terms\n )\n}\n\nnum_initial_terms() %>% finalize(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\n## Qualitative parameters\n\nNow let's look at an example of a qualitative parameter. If a model includes a data aggregation step, we want to allow users to tune how our parameters are aggregated. For example, in embedding methods, possible values might be `min`, `max`, `mean`, `sum`, or to not aggregate at all (\"none\"). Since these cannot be put on a numeric scale, they are possible values of a qualitative parameter. We'll take \"character\" input (not \"logical\"), and we must specify the allowed values. By default we won't aggregate.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation_39f71033809bc19015c698dbeca6311d'}\n\n```{.r .cell-code}\naggregation <- function(values = c(\"none\", \"min\", \"max\", \"mean\", \"sum\")) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nWithin the dials package, the convention is to have the values contained in a separate vector whose name starts with `values_`. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-vec_7dcc9f145f63cccd19b24d9fa23f4d10'}\n\n```{.r .cell-code}\nvalues_aggregation <- c(\"none\", \"min\", \"max\", \"mean\", \"sum\")\naggregation <- function(values = values_aggregation) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nThis step may not make sense if you are using the function in a script and not keeping it within a package.\n\nWe can use our `aggregation` parameters with dials functions.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-use_9ac92377d564310ae312bf63617fedad'}\n\n```{.r .cell-code}\naggregation()\n#> Warning: The `default` argument of `new_qual_param()` is deprecated as of\n#> dials 1.0.1.\n#> Aggregation Method (qualitative)\n#> 5 possible values include:\n#> 'none', 'min', 'max', 'mean' and 'sum'\naggregation() %>% value_sample(3)\n#> [1] \"min\" \"sum\" \"mean\"\n```\n:::\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Monterey 12.6.1\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-05\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.0)\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0)\n#> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-09 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-12-27 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", + "markdown": "---\ntitle: \"How to create a tuning parameter function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 4\ndescription: | \n Build functions to use in tuning both quantitative and qualitative parameters.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: dials and scales.\n\nSome models and recipe steps contain parameters that dials does not know about. You can construct new quantitative and qualitative parameters using `new_quant_param()` or `new_qual_param()`, respectively. This article is a guide to creating new parameters.\n\n## Quantitative parameters\n\nAs an example, let's consider the multivariate adaptive regression spline ([MARS](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline)) model, which creates nonlinear features from predictors and adds them to a linear regression models. The earth package is an excellent implementation of this method.\n\nMARS creates an initial set of features and then prunes them back to an appropriate size. This can be done automatically by `earth::earth()` or the number of final terms can be set by the user. The parsnip function `mars()` has a parameter called `num_terms` that defines this.\n\nWhat if we want to create a parameter for the number of *initial terms* included in the model. There is no argument in `parsnip::mars()` for this but we will make one now. The argument name in `earth::earth()` is `nk`, which is not very descriptive. Our parameter will be called `num_initial_terms`.\n\nWe use the `new_quant_param()` function since this is a numeric parameter. The main two arguments to a numeric parameter function are `range` and `trans`.\n\nThe `range` specifies the possible values of the parameter. For our example, a minimal value might be one or two. What is the upper limit? The default in the earth package is\n\n\n::: {.cell layout-align=\"center\" hash='cache/eart_0aaa451856e86c8fdc7e0c3f099c8de4'}\n\n```{.r .cell-code}\nmin(200, max(20, 2 * ncol(x))) + 1\n```\n:::\n\n\nwhere `x` is the predictor matrix. We often put in values that are either sensible defaults or are minimal enough to work for the majority of data sets. For now, let's specify an upper limit of 10 but this will be discussed more in the next section.\n\nThe other argument is `trans`, which represents a transformation that should be applied to the parameter values when working with them. For example, many regularization methods have a `penalty` parameter that tends to range between zero and some upper bound (let's say 1). The effect of going from a penalty value of 0.01 to 0.1 is much more impactful than going from 0.9 to 1.0. In such a case, it might make sense to work with this parameter in transformed units (such as the log, in this example). If new parameter values are generated at random, it helps if they are uniformly simulated in the transformed units and then converted back to the original units.\n\nThe `trans` parameter accepts a transformation object from the scales package. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/scales_b25ae3bb346dde06d2a7e463ba1f4c4d'}\n\n```{.r .cell-code}\nlibrary(scales)\nlsf.str(\"package:scales\", pattern = \"_trans$\")\n#> asn_trans : function () \n#> atanh_trans : function () \n#> boxcox_trans : function (p, offset = 0) \n#> compose_trans : function (...) \n#> date_trans : function () \n#> exp_trans : function (base = exp(1)) \n#> hms_trans : function () \n#> identity_trans : function () \n#> log_trans : function (base = exp(1)) \n#> log10_trans : function () \n#> log1p_trans : function () \n#> log2_trans : function () \n#> logit_trans : function () \n#> modulus_trans : function (p, offset = 1) \n#> probability_trans : function (distribution, ...) \n#> probit_trans : function () \n#> pseudo_log_trans : function (sigma = 1, base = exp(1)) \n#> reciprocal_trans : function () \n#> reverse_trans : function () \n#> sqrt_trans : function () \n#> time_trans : function (tz = NULL) \n#> yj_trans : function (p)\nscales::log10_trans()\n#> Transformer: log-10 [1e-100, Inf]\n```\n:::\n\n\nA value of `NULL` means that no transformation should be used.\n\nA quantitative parameter function should have these two arguments and, in the function body, a call `new_quant_param()`. There are a few arguments to this function:\n\n\n::: {.cell layout-align=\"center\" hash='cache/new_quant_param_dc898a8030c6fa2847b68be9c2db5701'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nargs(new_quant_param)\n#> function (type = c(\"double\", \"integer\"), range = NULL, inclusive = NULL, \n#> default = deprecated(), trans = NULL, values = NULL, label = NULL, \n#> finalize = NULL) \n#> NULL\n```\n:::\n\n\n- Possible types are double precision and integers. The value of `type` should agree with the values of `range` in the function definition.\n\n- It's OK for our tuning to include the minimum or maximum, so we'll use `c(TRUE, TRUE)` for `inclusive`. If the value cannot include one end of the range, set one or both of these values to `FALSE`.\n\n- The `label` should be a named character string where the name is the parameter name and the value represents what will be printed automatically.\n\n- `finalize` is an argument that can set parts of the range. This is discussed more below.\n\nHere's an example of a basic quantitative parameter object:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms_d602318800beb1ef90a7bde3e6959438'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, 10L), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\n\nnum_initial_terms()\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 10]\n\n# Sample from the parameter:\nset.seed(4832856)\nnum_initial_terms() %>% value_sample(5)\n#> [1] 6 4 9 10 4\n```\n:::\n\n\n### Finalizing parameters\n\nIt might be the case that the range of the parameter is unknown. For example, parameters that are related to the number of columns in a data set cannot be exactly specified in the absence of data. In those cases, a placeholder of `unknown()` can be added. This will force the user to \"finalize\" the parameter object for their particular data set. Let's redefine our function with an `unknown()` value:\n\n\n::: {.cell layout-align=\"center\" hash='cache/num-initial-terms-unk_9de7d72b673760c5098403e4f395b8d8'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = NULL\n )\n}\nnum_initial_terms()\n\n# Can we sample? \nnum_initial_terms() %>% value_sample(5)\n```\n:::\n\n\nThe `finalize` argument of `num_initial_terms()` can take a function that uses data to set the range. For example, the package already includes a few functions for finalization:\n\n\n::: {.cell layout-align=\"center\" hash='cache/dials-final-funcs_13c8f3f4f0f277ecb3c76d762cb7a32c'}\n\n```{.r .cell-code}\nlsf.str(\"package:dials\", pattern = \"^get_\")\n#> get_batch_sizes : function (object, x, frac = c(1/10, 1/3), ...) \n#> get_log_p : function (object, x, ...) \n#> get_n : function (object, x, log_vals = FALSE, ...) \n#> get_n_frac : function (object, x, log_vals = FALSE, frac = 1/3, ...) \n#> get_n_frac_range : function (object, x, log_vals = FALSE, frac = c(1/10, 5/10), ...) \n#> get_p : function (object, x, log_vals = FALSE, ...) \n#> get_rbf_range : function (object, x, seed = sample.int(10^5, 1), ...)\n```\n:::\n\n\nThese functions generally take a data frame of predictors (in an argument called `x`) and add the range of the parameter object. Using the formula in the earth package, we might use:\n\n\n::: {.cell layout-align=\"center\" hash='cache/earth-range_e1c9bf6b8f535f761d22d3b738ea8bb2'}\n\n```{.r .cell-code}\nget_initial_mars_terms <- function(object, x) {\n upper_bound <- min(200, max(20, 2 * ncol(x))) + 1\n upper_bound <- as.integer(upper_bound)\n bounds <- range_get(object)\n bounds$upper <- upper_bound\n range_set(object, bounds)\n}\n\n# Use the mtcars are the finalize the upper bound: \nnum_initial_terms() %>% get_initial_mars_terms(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\nOnce we add this function to the object, the general `finalize()` method can be used:\n\n\n::: {.cell layout-align=\"center\" hash='cache/final-obj_9b8361428190d870490441c3eecf012e'}\n\n```{.r .cell-code}\nnum_initial_terms <- function(range = c(1L, unknown()), trans = NULL) {\n new_quant_param(\n type = \"integer\",\n range = range,\n inclusive = c(TRUE, TRUE),\n trans = trans,\n label = c(num_initial_terms = \"# Initial MARS Terms\"),\n finalize = get_initial_mars_terms\n )\n}\n\nnum_initial_terms() %>% finalize(x = mtcars[, -1])\n#> # Initial MARS Terms (quantitative)\n#> Range: [1, 21]\n```\n:::\n\n\n## Qualitative parameters\n\nNow let's look at an example of a qualitative parameter. If a model includes a data aggregation step, we want to allow users to tune how our parameters are aggregated. For example, in embedding methods, possible values might be `min`, `max`, `mean`, `sum`, or to not aggregate at all (\"none\"). Since these cannot be put on a numeric scale, they are possible values of a qualitative parameter. We'll take \"character\" input (not \"logical\"), and we must specify the allowed values. By default we won't aggregate.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation_39f71033809bc19015c698dbeca6311d'}\n\n```{.r .cell-code}\naggregation <- function(values = c(\"none\", \"min\", \"max\", \"mean\", \"sum\")) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nWithin the dials package, the convention is to have the values contained in a separate vector whose name starts with `values_`. For example:\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-vec_7dcc9f145f63cccd19b24d9fa23f4d10'}\n\n```{.r .cell-code}\nvalues_aggregation <- c(\"none\", \"min\", \"max\", \"mean\", \"sum\")\naggregation <- function(values = values_aggregation) {\n new_qual_param(\n type = \"character\",\n values = values,\n # By default, the first value is selected as default. We'll specify that to\n # make it clear.\n default = \"none\",\n label = c(aggregation = \"Aggregation Method\")\n )\n}\n```\n:::\n\n\nThis step may not make sense if you are using the function in a script and not keeping it within a package.\n\nWe can use our `aggregation` parameters with dials functions.\n\n\n::: {.cell layout-align=\"center\" hash='cache/aggregation-use_9ac92377d564310ae312bf63617fedad'}\n\n```{.r .cell-code}\naggregation()\n#> Warning: The `default` argument of `new_qual_param()` is deprecated as of\n#> dials 1.0.1.\n#> Aggregation Method (qualitative)\n#> 5 possible values include:\n#> 'none', 'min', 'max', 'mean' and 'sum'\naggregation() %>% value_sample(3)\n#> [1] \"min\" \"sum\" \"mean\"\n```\n:::\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Big Sur/Monterey 10.16\n#> system x86_64, darwin17.0\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-04\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3.9000 2022-12-12 [1] local\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1.9000 2022-12-13 [1] local\n#> scales * 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-05 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-11-30 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/content/learn/develop/recipes/index/execute-results/html.json b/_freeze/content/learn/develop/recipes/index/execute-results/html.json index 8d8b8212..9b4ebc44 100644 --- a/_freeze/content/learn/develop/recipes/index/execute-results/html.json +++ b/_freeze/content/learn/develop/recipes/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "e4f84da00cddb4f9fbc6e267db2e8700", + "hash": "8db58da323045aa3496d83885e9bf155", "result": { - "markdown": "---\ntitle: \"Create your own recipe step function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Write a new recipe step for data preprocessing.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: modeldata and tidymodels.\n\nThere are many existing recipe steps in packages like recipes, themis, textrecipes, and others. A full list of steps in CRAN packages [can be found here](/find/recipes/). However, you might need to define your own preprocessing operations; this article describes how to do that. If you are looking for good examples of existing steps, we suggest looking at the code for [centering](https://github.com/tidymodels/recipes/blob/master/R/center.R) or [PCA](https://github.com/tidymodels/recipes/blob/master/R/pca.R) to start. \n\nFor check operations (e.g. `check_class()`), the process is very similar. Notes on this are available at the end of this article. \n\nThe general process to follow is to:\n\n1. Define a step constructor function.\n\n2. Create the minimal S3 methods for `prep()`, `bake()`, and `print()`. \n\n3. Optionally add some extra methods to work with other tidymodels packages, such as `tunable()` and `tidy()`. \n\nAs an example, we will create a step for converting data into percentiles. \n\n## A new step definition\n\nLet's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package:\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_e9b15d7263a13cf19c62529abfd10bdb'}\n\n```{.r .cell-code}\nlibrary(modeldata)\ndata(biomass)\nstr(biomass)\n#> 'data.frame':\t536 obs. of 8 variables:\n#> $ sample : chr \"Akhrot Shell\" \"Alabama Oak Wood Waste\" \"Alder\" \"Alfalfa\" ...\n#> $ dataset : chr \"Training\" \"Training\" \"Training\" \"Training\" ...\n#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...\n#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...\n#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...\n#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...\n#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...\n#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...\n\nbiomass_tr <- biomass[biomass$dataset == \"Training\",]\nbiomass_te <- biomass[biomass$dataset == \"Testing\",]\n```\n:::\n\n\nTo illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. \n\n\n::: {.cell layout-align=\"center\" hash='cache/carbon_dist_1ea4da99e7a470221927e2615897ebcd'}\n\n```{.r .cell-code}\nlibrary(ggplot2)\ntheme_set(theme_bw())\nggplot(biomass_tr, aes(x = carbon)) + \n geom_histogram(binwidth = 5, col = \"blue\", fill = \"blue\", alpha = .5) + \n geom_vline(xintercept = biomass_te$carbon[1], lty = 2)\n```\n\n::: {.cell-output-display}\n![](figs/carbon_dist-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\nBased on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values as percentiles rather than their original values. \n\nOur new step will do this computation for any numeric variables of interest. We will call this new recipe step `step_percentile()`. The code below is designed for illustration and not speed or best practices. We've left out a lot of error trapping that we would want in a real implementation. \n\n## Create the function\n\nTo start, there is a _user-facing_ function. Let's call that `step_percentile()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentile_new()`. \n\nThe function `step_percentile()` takes the same arguments as your function and simply adds it to a new recipe. The `...` signifies the variable selectors that can be used.\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_def_9e13ec18326669f1593d727bab539fa6'}\n\n```{.r .cell-code}\nstep_percentile <- function(\n recipe, \n ..., \n role = NA, \n trained = FALSE, \n ref_dist = NULL,\n options = list(probs = (0:100)/100, names = TRUE),\n skip = FALSE,\n id = rand_id(\"percentile\")\n ) {\n\n ## The variable selectors are not immediately evaluated by using\n ## the `quos()` function in `rlang`. `ellipse_check()` captures \n ## the values and also checks to make sure that they are not empty. \n terms <- ellipse_check(...) \n\n add_step(\n recipe, \n step_percentile_new(\n terms = terms, \n trained = trained,\n role = role, \n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n )\n}\n```\n:::\n\n\nYou should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes:\n\n * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. \n * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. \n * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of \"outcomes\", these data would not be available for new samples. \n * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. \n\nWe can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentile` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`.\n\nWe will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentile()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. \n\nIt is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: \n\n* Allow the important arguments to be main arguments to the step function. \n\n* Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. \n\nThere are benefits to following these principles (as shown below). \n\n## Initialize a new object\n\nNow, the constructor function can be created.\n\nThe function cascade is: \n\n```\nstep_percentile() calls recipes::add_step()\n└──> recipes::add_step() calls step_percentile_new()\n └──> step_percentile_new() calls recipes::step()\n```\n\n`step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = \"percentile\"` will set the class of new objects to `\"step_percentile\"`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/initialize_565d7c1989cea598cb438eb330637305'}\n\n```{.r .cell-code}\nstep_percentile_new <- \n function(terms, role, trained, ref_dist, options, skip, id) {\n step(\n subclass = \"percentile\", \n terms = terms,\n role = role,\n trained = trained,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n }\n```\n:::\n\n\nThis constructor function should have no default argument values. Defaults should be set in the user-facing step object. \n\n## Create the `prep` method\n\nYou will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are:\n\n```r\nfunction(x, training, info = NULL)\n```\n\nwhere\n\n * `x` will be the `step_percentile` object,\n * `training` will be a _tibble_ that has the training set data, and\n * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either \"numeric\" or \"nominal\"), `role` (defining the variable's role), and `source` (either \"original\" or \"derived\" depending on where it originated).\n\nYou can define other arguments as well. \n\nThe first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. \n\n{{% warning %}} The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. {{%/ warning %}}\n\n\n::: {.cell layout-align=\"center\" hash='cache/prep_1_e9cc5b4aa66319e4eca05dbda8b4cc2c'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info) \n # TODO finish the rest of the function\n}\n```\n:::\n\n\nAfter this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. \n\nOnce we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: \n\n\n::: {.cell layout-align=\"center\" hash='cache/splice_d7f87c1fe0ee11958f763f1b2ab1ee72'}\n\n```{.r .cell-code}\nget_train_pctl <- function(x, args = NULL) {\n res <- rlang::exec(\"quantile\", x = x, !!!args)\n # Remove duplicate percentile values\n res[!duplicated(res)]\n}\n\n# For example:\nget_train_pctl(biomass_tr$carbon, list(probs = 0:1))\n#> 0% 100% \n#> 14.61 97.18\nget_train_pctl(biomass_tr$carbon)\n#> 0% 25% 50% 75% 100% \n#> 14.610 44.715 47.100 49.725 97.180\n```\n:::\n\n\nNow, the `prep()` method can be created: \n\n\n::: {.cell layout-align=\"center\" hash='cache/prep-2_7c79c53e94b392f35b346414e8b3f731'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n ## You can add error trapping for non-numeric data here and so on. \n \n ## We'll use the names later so make sure they are available\n if (x$options$names == FALSE) {\n rlang::abort(\"`names` should be set to TRUE\")\n }\n \n if (!any(names(x$options) == \"probs\")) {\n x$options$probs <- (0:100)/100\n } else {\n x$options$probs <- sort(unique(x$options$probs))\n }\n \n # Compute percentile grid\n ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options)\n\n ## Use the constructor function to return the updated object. \n ## Note that `trained` is now set to TRUE\n \n step_percentile_new(\n terms = x$terms, \n trained = TRUE,\n role = x$role, \n ref_dist = ref_dist,\n options = x$options,\n skip = x$skip,\n id = x$id\n )\n}\n```\n:::\n\n\nWe suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results.\n\n\n## Create the `bake` method\n\nRemember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentile()` class. The minimum arguments for this are\n\n```r\nfunction(object, new_data, ...)\n```\n\nwhere `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. \n\nHere is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-helpers_26b81d4ec4ac1dca48a0bd67178379d4'}\n\n```{.r .cell-code}\npctl_by_approx <- function(x, ref) {\n # In case duplicates were removed, get the percentiles from\n # the names of the reference object\n grid <- as.numeric(gsub(\"%$\", \"\", names(ref))) \n approx(x = ref, y = grid, xout = x)$y/100\n}\n```\n:::\n\n\nThese computations are done column-wise using `purrr::map2_dfc()` to modify the new data in-place:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-method_c5dab3d658e8739f1bff0630466624e5'}\n\n```{.r .cell-code}\nbake.step_percentile <- function(object, new_data, ...) {\n ## For illustration (and not speed), we will loop through the affected variables\n ## and do the computations\n vars <- names(object$ref_dist)\n \n new_data[, vars] <-\n purrr::map2_dfc(new_data[, vars], object$ref_dist, pctl_by_approx)\n \n ## Always convert to tibbles on the way out\n tibble::as_tibble(new_data)\n}\n```\n:::\n\n\n{{% note %}} You need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. {{%/ note %}}\n\n## Run the example\n\nLet's use the example data to make sure that it works: \n\n\n::: {.cell layout-align=\"center\" hash='cache/example_772256fc0ca7141fc3d35b1fd93e88a9'}\n\n```{.r .cell-code}\nrec_obj <- \n recipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\")) %>%\n prep(training = biomass_tr)\n\nbiomass_te %>% select(ends_with(\"gen\")) %>% slice(1:2)\n#> hydrogen oxygen nitrogen\n#> 1 5.67 47.20 0.30\n#> 2 5.50 48.06 2.85\nbake(rec_obj, biomass_te %>% slice(1:2), ends_with(\"gen\"))\n#> # A tibble: 2 × 3\n#> hydrogen oxygen nitrogen\n#> \n#> 1 0.45 0.903 0.21 \n#> 2 0.38 0.922 0.928\n\n# Checking to get approximate result: \nmean(biomass_tr$hydrogen <= biomass_te$hydrogen[1])\n#> [1] 0.4517544\nmean(biomass_tr$oxygen <= biomass_te$oxygen[1])\n#> [1] 0.9013158\n```\n:::\n\n\nThe plot below shows how the original hydrogen percentiles line up with the estimated values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/cdf_plot_159799b7fd72828fadd4d606fa204f3e'}\n\n```{.r .cell-code}\nhydrogen_values <- \n bake(rec_obj, biomass_te, hydrogen) %>% \n bind_cols(biomass_te %>% select(original = hydrogen))\n\nggplot(biomass_tr, aes(x = hydrogen)) + \n # Plot the empirical distribution function of the \n # hydrogen training set values as a black line\n stat_ecdf() + \n # Overlay the estimated percentiles for the new data: \n geom_point(data = hydrogen_values, \n aes(x = original, y = hydrogen), \n col = \"red\", alpha = .5, cex = 2) + \n labs(x = \"New Hydrogen Values\", y = \"Percentile Based on Training Set\")\n```\n\n::: {.cell-output-display}\n![](figs/cdf_plot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThese line up very nicely! \n\n## Custom check operations \n\nThe process here is exactly the same as steps; the internal functions have a similar naming convention: \n\n * `add_check()` instead of `add_step()`\n * `check()` instead of `step()`, and so on. \n \nIt is strongly recommended that:\n \n 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`)\n 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met\n 1. The original data are returned (unaltered) by the check when the conditions are satisfied. \n\n## Other step methods\n\nThere are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. \n\n### A print method\n\nIf you don't add a print method for `step_percentile`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `printer()` that should be useful in most cases. We are using it here for the custom print method for `step_percentile`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: \n\n\n::: {.cell layout-align=\"center\" hash='cache/print-method_3703e6bb929a5780983729502421e4b4'}\n\n```{.r .cell-code}\nprint.step_percentile <-\n function(x, width = max(20, options()$width - 35), ...) {\n cat(\"Percentile transformation on \", sep = \"\")\n printer(\n # Names before prep (could be selectors)\n untr_obj = x$terms,\n # Names after prep:\n tr_obj = names(x$ref_dist),\n # Has it been prepped? \n trained = x$trained,\n # An estimate of how many characters to print on a line: \n width = width\n )\n invisible(x)\n }\n\n# Results before `prep()`:\nrecipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\"))\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Operations:\n#> \n#> Percentile transformation on ends_with(\"gen\")\n\n# Results after `prep()`: \nrec_obj\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Training data contained 456 data points and no missing data.\n#> \n#> Operations:\n#> \n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\n```\n:::\n\n \n### Methods for declaring required packages\n\nSome recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: \n\n```\n> recipes::recipes_pkg_check(\"some_package\")\n1 package is needed for this step and is not installed. (some_package). Start \na clean R session then run: install.packages(\"some_package\")\n```\n\nThere is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: \n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-15_ef766f12d6be3f28d8352aa2616aa143'}\n\n```{.r .cell-code}\nrequired_pkgs.step_hypothetical <- function(x, ...) {\n c(\"hypothetical\", \"myrecipespkg\")\n}\n```\n:::\n\n\nIn this example, `myrecipespkg` is the package where the step resides (if it is in a package).\n\nThe reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. \n\nIf this S3 method is used for your step, you can rely on this for checking the installation: \n \n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-16_bb0b7690de51c7e735663ee46cceddfd'}\n\n```{.r .cell-code}\nrecipes::recipes_pkg_check(required_pkgs.step_hypothetical())\n```\n:::\n\n\nIf you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package.\n\n### A tidy method\n\nThe `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. \n\nWhen the recipe has been prepped, those data are in the list `ref_dist`. A small function can be used to reformat that data into a tibble. It is customary to return the main values as `value`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy-calcs_4ea442a4112fcdb07bcd881ce7e823bf'}\n\n```{.r .cell-code}\nformat_pctl <- function(x) {\n tibble::tibble(\n value = unname(x),\n percentile = as.numeric(gsub(\"%$\", \"\", names(x))) \n )\n}\n\n# For example: \npctl_step_object <- rec_obj$steps[[1]]\npctl_step_object\n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\nformat_pctl(pctl_step_object$ref_dist[[\"hydrogen\"]])\n#> # A tibble: 87 × 2\n#> value percentile\n#> \n#> 1 0.03 0\n#> 2 0.934 1\n#> 3 1.60 2\n#> 4 2.07 3\n#> 5 2.45 4\n#> 6 2.74 5\n#> 7 3.15 6\n#> 8 3.49 7\n#> 9 3.71 8\n#> 10 3.99 9\n#> # … with 77 more rows\n```\n:::\n\n\nThe tidy method could return these values for each selected column. Before `prep()`, missing values can be used as placeholders. \n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy_034d8d1109dc6a42f7f14bec91cfc669'}\n\n```{.r .cell-code}\ntidy.step_percentile <- function(x, ...) {\n if (is_trained(x)) {\n res <- map_dfr(x$ref_dist, format_pctl, .id = \"term\")\n }\n else {\n term_names <- sel2char(x$terms)\n res <-\n tibble(\n terms = term_names,\n value = rlang::na_dbl,\n percentile = rlang::na_dbl\n )\n }\n # Always return the step id: \n res$id <- x$id\n res\n}\n\ntidy(rec_obj, number = 1)\n#> # A tibble: 274 × 4\n#> term value percentile id \n#> \n#> 1 hydrogen 0.03 0 percentile_HcwF5\n#> 2 hydrogen 0.934 1 percentile_HcwF5\n#> 3 hydrogen 1.60 2 percentile_HcwF5\n#> 4 hydrogen 2.07 3 percentile_HcwF5\n#> 5 hydrogen 2.45 4 percentile_HcwF5\n#> 6 hydrogen 2.74 5 percentile_HcwF5\n#> 7 hydrogen 3.15 6 percentile_HcwF5\n#> 8 hydrogen 3.49 7 percentile_HcwF5\n#> 9 hydrogen 3.71 8 percentile_HcwF5\n#> 10 hydrogen 3.99 9 percentile_HcwF5\n#> # … with 264 more rows\n```\n:::\n\n\n### Methods for tuning parameters\n\nThe tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: \n\n\n::: {.cell layout-align=\"center\" hash='cache/poly-args_aeebdb3dac81ee70eba24e6ae9ec6448'}\n\n```{.r .cell-code}\nargs(step_poly)\n#> function (recipe, ..., role = \"predictor\", trained = FALSE, objects = NULL, \n#> degree = 2, options = list(), skip = FALSE, id = rand_id(\"poly\")) \n#> NULL\n```\n:::\n\n\nThe argument `degree` is tunable.\n\nTo work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the step object as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"recipe\"`. \n\n* `component`: A character string with more information about the source. For recipes, this is just the name of the step (e.g. `\"step_poly\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For recipes, this is just the `id` value of the step object. \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n::: {.cell layout-align=\"center\" hash='cache/mtry_3c03e505855845f8e5bb7a077fd5b825'}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\nFor `step_poly()`, a dials object is needed that returns an integer that is the number of new columns to create. It turns out that there are a few different types of tuning parameters related to degree: \n\n```r\n> lsf.str(\"package:dials\", pattern = \"degree\")\ndegree : function (range = c(1, 3), trans = NULL) \ndegree_int : function (range = c(1L, 3L), trans = NULL) \nprod_degree : function (range = c(1L, 2L), trans = NULL) \nspline_degree : function (range = c(3L, 10L), trans = NULL) \n```\n\nLooking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. \n\nFor `step_poly()` the `tunable()` S3 method could be: \n\n\n::: {.cell layout-align=\"center\" hash='cache/tunable_16f4fa39f3ddb8145e6a67412664dadb'}\n\n```{.r .cell-code}\ntunable.step_poly <- function (x, ...) {\n tibble::tibble(\n name = c(\"degree\"),\n call_info = list(list(pkg = \"dials\", fun = \"degree_int\")),\n source = \"recipe\",\n component = \"step_poly\",\n component_id = x$id\n )\n}\n```\n:::\n\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Monterey 12.6.1\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-05\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0)\n#> modeldata * 1.0.1 2022-09-06 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.0)\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-09 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-12-27 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", + "markdown": "---\ntitle: \"Create your own recipe step function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Write a new recipe step for data preprocessing.\n---\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: modeldata and tidymodels.\n\nThere are many existing recipe steps in packages like recipes, themis, textrecipes, and others. A full list of steps in CRAN packages [can be found here](/find/recipes/). However, you might need to define your own preprocessing operations; this article describes how to do that. If you are looking for good examples of existing steps, we suggest looking at the code for [centering](https://github.com/tidymodels/recipes/blob/master/R/center.R) or [PCA](https://github.com/tidymodels/recipes/blob/master/R/pca.R) to start. \n\nFor check operations (e.g. `check_class()`), the process is very similar. Notes on this are available at the end of this article. \n\nThe general process to follow is to:\n\n1. Define a step constructor function.\n\n2. Create the minimal S3 methods for `prep()`, `bake()`, and `print()`. \n\n3. Optionally add some extra methods to work with other tidymodels packages, such as `tunable()` and `tidy()`. \n\nAs an example, we will create a step for converting data into percentiles. \n\n## A new step definition\n\nLet's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package:\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_e9b15d7263a13cf19c62529abfd10bdb'}\n\n```{.r .cell-code}\nlibrary(modeldata)\ndata(biomass)\nstr(biomass)\n#> 'data.frame':\t536 obs. of 8 variables:\n#> $ sample : chr \"Akhrot Shell\" \"Alabama Oak Wood Waste\" \"Alder\" \"Alfalfa\" ...\n#> $ dataset : chr \"Training\" \"Training\" \"Training\" \"Training\" ...\n#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...\n#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...\n#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...\n#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...\n#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...\n#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...\n\nbiomass_tr <- biomass[biomass$dataset == \"Training\",]\nbiomass_te <- biomass[biomass$dataset == \"Testing\",]\n```\n:::\n\n\nTo illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. \n\n\n::: {.cell layout-align=\"center\" hash='cache/carbon_dist_1ea4da99e7a470221927e2615897ebcd'}\n\n```{.r .cell-code}\nlibrary(ggplot2)\ntheme_set(theme_bw())\nggplot(biomass_tr, aes(x = carbon)) + \n geom_histogram(binwidth = 5, col = \"blue\", fill = \"blue\", alpha = .5) + \n geom_vline(xintercept = biomass_te$carbon[1], lty = 2)\n```\n\n::: {.cell-output-display}\n![](figs/carbon_dist-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\nBased on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values as percentiles rather than their original values. \n\nOur new step will do this computation for any numeric variables of interest. We will call this new recipe step `step_percentile()`. The code below is designed for illustration and not speed or best practices. We've left out a lot of error trapping that we would want in a real implementation. \n\n## Create the function\n\nTo start, there is a _user-facing_ function. Let's call that `step_percentile()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentile_new()`. \n\nThe function `step_percentile()` takes the same arguments as your function and simply adds it to a new recipe. The `...` signifies the variable selectors that can be used.\n\n\n::: {.cell layout-align=\"center\" hash='cache/initial_def_9e13ec18326669f1593d727bab539fa6'}\n\n```{.r .cell-code}\nstep_percentile <- function(\n recipe, \n ..., \n role = NA, \n trained = FALSE, \n ref_dist = NULL,\n options = list(probs = (0:100)/100, names = TRUE),\n skip = FALSE,\n id = rand_id(\"percentile\")\n ) {\n\n ## The variable selectors are not immediately evaluated by using\n ## the `quos()` function in `rlang`. `ellipse_check()` captures \n ## the values and also checks to make sure that they are not empty. \n terms <- ellipse_check(...) \n\n add_step(\n recipe, \n step_percentile_new(\n terms = terms, \n trained = trained,\n role = role, \n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n )\n}\n```\n:::\n\n\nYou should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes:\n\n * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. \n * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. \n * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of \"outcomes\", these data would not be available for new samples. \n * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. \n\nWe can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentile` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`.\n\nWe will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentile()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. \n\nIt is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: \n\n* Allow the important arguments to be main arguments to the step function. \n\n* Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. \n\nThere are benefits to following these principles (as shown below). \n\n## Initialize a new object\n\nNow, the constructor function can be created.\n\nThe function cascade is: \n\n```\nstep_percentile() calls recipes::add_step()\n└──> recipes::add_step() calls step_percentile_new()\n └──> step_percentile_new() calls recipes::step()\n```\n\n`step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = \"percentile\"` will set the class of new objects to `\"step_percentile\"`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/initialize_565d7c1989cea598cb438eb330637305'}\n\n```{.r .cell-code}\nstep_percentile_new <- \n function(terms, role, trained, ref_dist, options, skip, id) {\n step(\n subclass = \"percentile\", \n terms = terms,\n role = role,\n trained = trained,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n }\n```\n:::\n\n\nThis constructor function should have no default argument values. Defaults should be set in the user-facing step object. \n\n## Create the `prep` method\n\nYou will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are:\n\n```r\nfunction(x, training, info = NULL)\n```\n\nwhere\n\n * `x` will be the `step_percentile` object,\n * `training` will be a _tibble_ that has the training set data, and\n * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either \"numeric\" or \"nominal\"), `role` (defining the variable's role), and `source` (either \"original\" or \"derived\" depending on where it originated).\n\nYou can define other arguments as well. \n\nThe first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. \n\n::: {.callout-warning}\n The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. \n:::\n\n\n::: {.cell layout-align=\"center\" hash='cache/prep_1_e9cc5b4aa66319e4eca05dbda8b4cc2c'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info) \n # TODO finish the rest of the function\n}\n```\n:::\n\n\nAfter this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. \n\nOnce we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: \n\n\n::: {.cell layout-align=\"center\" hash='cache/splice_d7f87c1fe0ee11958f763f1b2ab1ee72'}\n\n```{.r .cell-code}\nget_train_pctl <- function(x, args = NULL) {\n res <- rlang::exec(\"quantile\", x = x, !!!args)\n # Remove duplicate percentile values\n res[!duplicated(res)]\n}\n\n# For example:\nget_train_pctl(biomass_tr$carbon, list(probs = 0:1))\n#> 0% 100% \n#> 14.61 97.18\nget_train_pctl(biomass_tr$carbon)\n#> 0% 25% 50% 75% 100% \n#> 14.610 44.715 47.100 49.725 97.180\n```\n:::\n\n\nNow, the `prep()` method can be created: \n\n\n::: {.cell layout-align=\"center\" hash='cache/prep-2_7c79c53e94b392f35b346414e8b3f731'}\n\n```{.r .cell-code}\nprep.step_percentile <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n ## You can add error trapping for non-numeric data here and so on. \n \n ## We'll use the names later so make sure they are available\n if (x$options$names == FALSE) {\n rlang::abort(\"`names` should be set to TRUE\")\n }\n \n if (!any(names(x$options) == \"probs\")) {\n x$options$probs <- (0:100)/100\n } else {\n x$options$probs <- sort(unique(x$options$probs))\n }\n \n # Compute percentile grid\n ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options)\n\n ## Use the constructor function to return the updated object. \n ## Note that `trained` is now set to TRUE\n \n step_percentile_new(\n terms = x$terms, \n trained = TRUE,\n role = x$role, \n ref_dist = ref_dist,\n options = x$options,\n skip = x$skip,\n id = x$id\n )\n}\n```\n:::\n\n\nWe suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results.\n\n\n## Create the `bake` method\n\nRemember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentile()` class. The minimum arguments for this are\n\n```r\nfunction(object, new_data, ...)\n```\n\nwhere `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. \n\nHere is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-helpers_26b81d4ec4ac1dca48a0bd67178379d4'}\n\n```{.r .cell-code}\npctl_by_approx <- function(x, ref) {\n # In case duplicates were removed, get the percentiles from\n # the names of the reference object\n grid <- as.numeric(gsub(\"%$\", \"\", names(ref))) \n approx(x = ref, y = grid, xout = x)$y/100\n}\n```\n:::\n\n\nThese computations are done column-wise using `purrr::map2_dfc()` to modify the new data in-place:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bake-method_c5dab3d658e8739f1bff0630466624e5'}\n\n```{.r .cell-code}\nbake.step_percentile <- function(object, new_data, ...) {\n ## For illustration (and not speed), we will loop through the affected variables\n ## and do the computations\n vars <- names(object$ref_dist)\n \n new_data[, vars] <-\n purrr::map2_dfc(new_data[, vars], object$ref_dist, pctl_by_approx)\n \n ## Always convert to tibbles on the way out\n tibble::as_tibble(new_data)\n}\n```\n:::\n\n\n::: {.callout-note}\nYou need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. \n:::\n\n## Run the example\n\nLet's use the example data to make sure that it works: \n\n\n::: {.cell layout-align=\"center\" hash='cache/example_772256fc0ca7141fc3d35b1fd93e88a9'}\n\n```{.r .cell-code}\nrec_obj <- \n recipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\")) %>%\n prep(training = biomass_tr)\n\nbiomass_te %>% select(ends_with(\"gen\")) %>% slice(1:2)\n#> hydrogen oxygen nitrogen\n#> 1 5.67 47.20 0.30\n#> 2 5.50 48.06 2.85\nbake(rec_obj, biomass_te %>% slice(1:2), ends_with(\"gen\"))\n#> # A tibble: 2 × 3\n#> hydrogen oxygen nitrogen\n#> \n#> 1 0.45 0.903 0.21 \n#> 2 0.38 0.922 0.928\n\n# Checking to get approximate result: \nmean(biomass_tr$hydrogen <= biomass_te$hydrogen[1])\n#> [1] 0.4517544\nmean(biomass_tr$oxygen <= biomass_te$oxygen[1])\n#> [1] 0.9013158\n```\n:::\n\n\nThe plot below shows how the original hydrogen percentiles line up with the estimated values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/cdf_plot_159799b7fd72828fadd4d606fa204f3e'}\n\n```{.r .cell-code}\nhydrogen_values <- \n bake(rec_obj, biomass_te, hydrogen) %>% \n bind_cols(biomass_te %>% select(original = hydrogen))\n\nggplot(biomass_tr, aes(x = hydrogen)) + \n # Plot the empirical distribution function of the \n # hydrogen training set values as a black line\n stat_ecdf() + \n # Overlay the estimated percentiles for the new data: \n geom_point(data = hydrogen_values, \n aes(x = original, y = hydrogen), \n col = \"red\", alpha = .5, cex = 2) + \n labs(x = \"New Hydrogen Values\", y = \"Percentile Based on Training Set\")\n```\n\n::: {.cell-output-display}\n![](figs/cdf_plot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThese line up very nicely! \n\n## Custom check operations \n\nThe process here is exactly the same as steps; the internal functions have a similar naming convention: \n\n * `add_check()` instead of `add_step()`\n * `check()` instead of `step()`, and so on. \n \nIt is strongly recommended that:\n \n 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`)\n 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met\n 1. The original data are returned (unaltered) by the check when the conditions are satisfied. \n\n## Other step methods\n\nThere are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. \n\n### A print method\n\nIf you don't add a print method for `step_percentile`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `printer()` that should be useful in most cases. We are using it here for the custom print method for `step_percentile`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: \n\n\n::: {.cell layout-align=\"center\" hash='cache/print-method_3703e6bb929a5780983729502421e4b4'}\n\n```{.r .cell-code}\nprint.step_percentile <-\n function(x, width = max(20, options()$width - 35), ...) {\n cat(\"Percentile transformation on \", sep = \"\")\n printer(\n # Names before prep (could be selectors)\n untr_obj = x$terms,\n # Names after prep:\n tr_obj = names(x$ref_dist),\n # Has it been prepped? \n trained = x$trained,\n # An estimate of how many characters to print on a line: \n width = width\n )\n invisible(x)\n }\n\n# Results before `prep()`:\nrecipe(HHV ~ ., data = biomass_tr) %>%\n step_percentile(ends_with(\"gen\"))\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Operations:\n#> \n#> Percentile transformation on ends_with(\"gen\")\n\n# Results after `prep()`: \nrec_obj\n#> Recipe\n#> \n#> Inputs:\n#> \n#> role #variables\n#> outcome 1\n#> predictor 7\n#> \n#> Training data contained 456 data points and no missing data.\n#> \n#> Operations:\n#> \n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\n```\n:::\n\n \n### Methods for declaring required packages\n\nSome recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: \n\n```\n> recipes::recipes_pkg_check(\"some_package\")\n1 package is needed for this step and is not installed. (some_package). Start \na clean R session then run: install.packages(\"some_package\")\n```\n\nThere is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: \n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-15_ef766f12d6be3f28d8352aa2616aa143'}\n\n```{.r .cell-code}\nrequired_pkgs.step_hypothetical <- function(x, ...) {\n c(\"hypothetical\", \"myrecipespkg\")\n}\n```\n:::\n\n\nIn this example, `myrecipespkg` is the package where the step resides (if it is in a package).\n\nThe reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. \n\nIf this S3 method is used for your step, you can rely on this for checking the installation: \n \n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-16_bb0b7690de51c7e735663ee46cceddfd'}\n\n```{.r .cell-code}\nrecipes::recipes_pkg_check(required_pkgs.step_hypothetical())\n```\n:::\n\n\nIf you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package.\n\n### A tidy method\n\nThe `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. \n\nWhen the recipe has been prepped, those data are in the list `ref_dist`. A small function can be used to reformat that data into a tibble. It is customary to return the main values as `value`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy-calcs_4ea442a4112fcdb07bcd881ce7e823bf'}\n\n```{.r .cell-code}\nformat_pctl <- function(x) {\n tibble::tibble(\n value = unname(x),\n percentile = as.numeric(gsub(\"%$\", \"\", names(x))) \n )\n}\n\n# For example: \npctl_step_object <- rec_obj$steps[[1]]\npctl_step_object\n#> Percentile transformation on hydrogen, oxygen, nitrogen [trained]\nformat_pctl(pctl_step_object$ref_dist[[\"hydrogen\"]])\n#> # A tibble: 87 × 2\n#> value percentile\n#> \n#> 1 0.03 0\n#> 2 0.934 1\n#> 3 1.60 2\n#> 4 2.07 3\n#> 5 2.45 4\n#> 6 2.74 5\n#> 7 3.15 6\n#> 8 3.49 7\n#> 9 3.71 8\n#> 10 3.99 9\n#> # … with 77 more rows\n```\n:::\n\n\nThe tidy method could return these values for each selected column. Before `prep()`, missing values can be used as placeholders. \n\n\n::: {.cell layout-align=\"center\" hash='cache/tidy_034d8d1109dc6a42f7f14bec91cfc669'}\n\n```{.r .cell-code}\ntidy.step_percentile <- function(x, ...) {\n if (is_trained(x)) {\n res <- map_dfr(x$ref_dist, format_pctl, .id = \"term\")\n }\n else {\n term_names <- sel2char(x$terms)\n res <-\n tibble(\n terms = term_names,\n value = rlang::na_dbl,\n percentile = rlang::na_dbl\n )\n }\n # Always return the step id: \n res$id <- x$id\n res\n}\n\ntidy(rec_obj, number = 1)\n#> # A tibble: 274 × 4\n#> term value percentile id \n#> \n#> 1 hydrogen 0.03 0 percentile_Sim0q\n#> 2 hydrogen 0.934 1 percentile_Sim0q\n#> 3 hydrogen 1.60 2 percentile_Sim0q\n#> 4 hydrogen 2.07 3 percentile_Sim0q\n#> 5 hydrogen 2.45 4 percentile_Sim0q\n#> 6 hydrogen 2.74 5 percentile_Sim0q\n#> 7 hydrogen 3.15 6 percentile_Sim0q\n#> 8 hydrogen 3.49 7 percentile_Sim0q\n#> 9 hydrogen 3.71 8 percentile_Sim0q\n#> 10 hydrogen 3.99 9 percentile_Sim0q\n#> # … with 264 more rows\n```\n:::\n\n\n### Methods for tuning parameters\n\nThe tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: \n\n\n::: {.cell layout-align=\"center\" hash='cache/poly-args_aeebdb3dac81ee70eba24e6ae9ec6448'}\n\n```{.r .cell-code}\nargs(step_poly)\n#> function (recipe, ..., role = \"predictor\", trained = FALSE, objects = NULL, \n#> degree = 2, options = list(), skip = FALSE, id = rand_id(\"poly\")) \n#> NULL\n```\n:::\n\n\nThe argument `degree` is tunable.\n\nTo work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the step object as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"recipe\"`. \n\n* `component`: A character string with more information about the source. For recipes, this is just the name of the step (e.g. `\"step_poly\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For recipes, this is just the `id` value of the step object. \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n::: {.cell layout-align=\"center\" hash='cache/mtry_3c03e505855845f8e5bb7a077fd5b825'}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\nFor `step_poly()`, a dials object is needed that returns an integer that is the number of new columns to create. It turns out that there are a few different types of tuning parameters related to degree: \n\n```r\n> lsf.str(\"package:dials\", pattern = \"degree\")\ndegree : function (range = c(1, 3), trans = NULL) \ndegree_int : function (range = c(1L, 3L), trans = NULL) \nprod_degree : function (range = c(1L, 2L), trans = NULL) \nspline_degree : function (range = c(3L, 10L), trans = NULL) \n```\n\nLooking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. \n\nFor `step_poly()` the `tunable()` S3 method could be: \n\n\n::: {.cell layout-align=\"center\" hash='cache/tunable_16f4fa39f3ddb8145e6a67412664dadb'}\n\n```{.r .cell-code}\ntunable.step_poly <- function (x, ...) {\n tibble::tibble(\n name = c(\"degree\"),\n call_info = list(list(pkg = \"dials\", fun = \"degree_int\")),\n source = \"recipe\",\n component = \"step_poly\",\n component_id = x$id\n )\n}\n```\n:::\n\n\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Big Sur/Monterey 10.16\n#> system x86_64, darwin17.0\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-04\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.0)\n#> modeldata * 1.0.1.9000 2023-01-04 [1] local\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3.9000 2022-12-12 [1] local\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1.9000 2022-12-13 [1] local\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-05 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-11-30 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json b/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json index b1a7b7e8..3f609d25 100644 --- a/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json +++ b/_freeze/content/learn/models/parsnip-ranger-glmnet/index/execute-results/html.json @@ -1,7 +1,7 @@ { "hash": "fa8587a14b33bc8502378013a546ffc7", "result": { - "markdown": "---\ntitle: \"Regression models two ways\"\ncategories:\n - model fitting\n - random forests\n - linear regression\ntype: learn-subsection\nweight: 1\ndescription: | \n Create and train different kinds of regression models with different computational engines.\n---\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: glmnet, randomForest, ranger, and tidymodels.\n\nWe can create regression models with the tidymodels package [parsnip](https://parsnip.tidymodels.org/) to predict continuous or numeric quantities. Here, let's first fit a random forest model, which does _not_ require all numeric input (see discussion [here](https://bookdown.org/max/FES/categorical-trees.html)) and discuss how to use `fit()` and `fit_xy()`, as well as _data descriptors_. \n\nSecond, let's fit a regularized linear regression model to demonstrate how to move between different types of models using parsnip. \n\n## The Ames housing data\n\nWe'll use the Ames housing data set to demonstrate how to create regression models using parsnip. First, set up the data set and create a simple training/test set split:\n\n\n::: {.cell layout-align=\"center\" hash='cache/ames-split_18b5bf0134171b332b56ced5fc3b1911'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n\ndata(ames)\n\nset.seed(4595)\ndata_split <- initial_split(ames, strata = \"Sale_Price\", prop = 0.75)\n\names_train <- training(data_split)\names_test <- testing(data_split)\n```\n:::\n\n\nThe use of the test set here is _only for illustration_; normally in a data analysis these data would be saved to the very end after many models have been evaluated. \n\n## Random forest\n\nWe'll start by fitting a random forest model to a small set of parameters. Let's create a model with the predictors `Longitude`, `Latitude`, `Lot_Area`, `Neighborhood`, and `Year_Sold`. A simple random forest model can be specified via:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic_02332292464935db3e80ae3984b88bfe'}\n\n```{.r .cell-code}\nrf_defaults <- rand_forest(mode = \"regression\")\nrf_defaults\n#> Random Forest Model Specification (regression)\n#> \n#> Computational engine: ranger\n```\n:::\n\n\nThe model will be fit with the ranger package by default. Since we didn't add any extra arguments to `fit`, _many_ of the arguments will be set to their defaults from the function `ranger::ranger()`. The help pages for the model function describe the default parameters and you can also use the `translate()` function to check out such details. \n\nThe parsnip package provides two different interfaces to fit a model: \n\n- the formula interface (`fit()`), and\n- the non-formula interface (`fit_xy()`).\n\nLet's start with the non-formula interface:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy_b14276d12ae7cf9ce85ecf44ee6a9bb4'}\n\n```{.r .cell-code}\npreds <- c(\"Longitude\", \"Latitude\", \"Lot_Area\", \"Neighborhood\", \"Year_Sold\")\n\nrf_xy_fit <- \n rf_defaults %>%\n set_engine(\"ranger\") %>%\n fit_xy(\n x = ames_train[, preds],\n y = log10(ames_train$Sale_Price)\n )\n\nrf_xy_fit\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 500 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 2 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008500188 \n#> R squared (OOB): 0.7239116\n```\n:::\n\n\nThe non-formula interface doesn't do anything to the predictors before passing them to the underlying model function. This particular model does _not_ require indicator variables (sometimes called \"dummy variables\") to be created prior to fitting the model. Note that the output shows \"Number of independent variables: 5\".\n\nFor regression models, we can use the basic `predict()` method, which returns a tibble with a column named `.pred`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy-pred_6e4354a6451e46bc8faa718908d9a912'}\n\n```{.r .cell-code}\ntest_results <- \n ames_test %>%\n select(Sale_Price) %>%\n mutate(Sale_Price = log10(Sale_Price)) %>%\n bind_cols(\n predict(rf_xy_fit, new_data = ames_test[, preds])\n )\ntest_results %>% slice(1:5)\n#> # A tibble: 5 × 2\n#> Sale_Price .pred\n#> \n#> 1 5.39 5.25\n#> 2 5.28 5.29\n#> 3 5.23 5.26\n#> 4 5.21 5.30\n#> 5 5.60 5.51\n\n# summarize performance\ntest_results %>% metrics(truth = Sale_Price, estimate = .pred) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.0945\n#> 2 rsq standard 0.733 \n#> 3 mae standard 0.0629\n```\n:::\n\n\nNote that: \n\n * If the model required indicator variables, we would have to create them manually prior to using `fit()` (perhaps using the recipes package).\n * We had to manually log the outcome prior to modeling. \n\nNow, for illustration, let's use the formula method using some new parameter values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-form_1e1aadae8c7bde9360f11e8062781218'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 3 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008402569 \n#> R squared (OOB): 0.7270823\n```\n:::\n\n \nSuppose that we would like to use the randomForest package instead of ranger. To do so, the only part of the syntax that needs to change is the `set_engine()` argument:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-rf_7f5cc80f129451dce218438d3e2b5856'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"randomForest\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> \n#> Call:\n#> randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~3, x)) \n#> Type of random forest: regression\n#> Number of trees: 1000\n#> No. of variables tried at each split: 3\n#> \n#> Mean of squared residuals: 0.008472074\n#> % Var explained: 72.47\n```\n:::\n\n\nLook at the formula code that was printed out; one function uses the argument name `ntree` and the other uses `num.trees`. The parsnip models don't require you to know the specific names of the main arguments. \n\nNow suppose that we want to modify the value of `mtry` based on the number of predictors in the data. Usually, a good default value is `floor(sqrt(num_predictors))` but a pure bagging model requires an `mtry` value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present when the model will be fit (perhaps due to the generation of indicator variables or a variable filter) so this might be difficult to know exactly ahead of time when you write your code. \n\nWhen the model it being fit by parsnip, [_data descriptors_](https://parsnip.tidymodels.org/reference/descriptors.html) are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using `rand_forest()`), the values of the arguments that you give it are _immediately evaluated_ unless you delay them. To delay the evaluation of any argument, you can used `rlang::expr()` to make an expression. \n\nTwo relevant data descriptors for our example model are:\n\n * `.preds()`: the number of predictor _variables_ in the data set that are associated with the predictors **prior to dummy variable creation**.\n * `.cols()`: the number of predictor _columns_ after dummy variables (or other encodings) are created.\n\nSince ranger won't create indicator values, `.preds()` would be appropriate for `mtry` for a bagging model. \n\nFor example, let's use an expression with the `.preds()` descriptor to fit a bagging model: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bagged_2b76f70b641acbdb2616b84443585217'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = .preds(), trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~.preds(), x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 5 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.00867085 \n#> R squared (OOB): 0.7183685\n```\n:::\n\n\n\n## Regularized regression\n\nA linear model might work for this data set as well. We can use the `linear_reg()` parsnip model. There are two engines that can perform regularization/penalization, the glmnet and sparklyr packages. Let's use the former here. The glmnet package only implements a non-formula method, but parsnip will allow either one to be used. \n\nWhen regularization is used, the predictors should first be centered and scaled before being passed to the model. The formula method won't do that automatically so we will need to do this ourselves. We'll use the [recipes](https://recipes.tidymodels.org/) package for these steps. \n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-form_a0ca81e5cfdf6601081373c7b271e499'}\n\n```{.r .cell-code}\nnorm_recipe <- \n recipe(\n Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold, \n data = ames_train\n ) %>%\n step_other(Neighborhood) %>% \n step_dummy(all_nominal()) %>%\n step_center(all_predictors()) %>%\n step_scale(all_predictors()) %>%\n step_log(Sale_Price, base = 10) %>% \n # estimate the means and standard deviations\n prep(training = ames_train, retain = TRUE)\n\n# Now let's fit the model using the processed version of the data\n\nglmn_fit <- \n linear_reg(penalty = 0.001, mixture = 0.5) %>% \n set_engine(\"glmnet\") %>%\n fit(Sale_Price ~ ., data = bake(norm_recipe, new_data = NULL))\nglmn_fit\n#> parsnip model object\n#> \n#> \n#> Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = \"gaussian\", alpha = ~0.5) \n#> \n#> Df %Dev Lambda\n#> 1 0 0.00 0.138300\n#> 2 1 1.96 0.126000\n#> 3 1 3.72 0.114800\n#> 4 1 5.28 0.104600\n#> 5 2 7.07 0.095320\n#> 6 3 9.64 0.086850\n#> 7 4 12.58 0.079140\n#> 8 5 15.45 0.072110\n#> 9 5 17.93 0.065700\n#> 10 7 20.81 0.059860\n#> 11 7 23.51 0.054550\n#> 12 7 25.82 0.049700\n#> 13 8 28.20 0.045290\n#> 14 8 30.31 0.041260\n#> 15 8 32.12 0.037600\n#> 16 8 33.66 0.034260\n#> 17 8 34.97 0.031210\n#> 18 8 36.08 0.028440\n#> 19 8 37.02 0.025910\n#> 20 9 37.90 0.023610\n#> 21 9 38.65 0.021510\n#> 22 9 39.29 0.019600\n#> 23 9 39.83 0.017860\n#> 24 9 40.28 0.016270\n#> 25 10 40.68 0.014830\n#> 26 11 41.06 0.013510\n#> 27 11 41.38 0.012310\n#> 28 11 41.65 0.011220\n#> 29 11 41.88 0.010220\n#> 30 12 42.09 0.009313\n#> 31 12 42.27 0.008486\n#> 32 12 42.43 0.007732\n#> 33 12 42.56 0.007045\n#> 34 12 42.66 0.006419\n#> 35 12 42.75 0.005849\n#> 36 12 42.83 0.005329\n#> 37 12 42.90 0.004856\n#> 38 12 42.95 0.004424\n#> 39 12 42.99 0.004031\n#> 40 12 43.03 0.003673\n#> 41 12 43.06 0.003347\n#> 42 12 43.09 0.003050\n#> 43 12 43.11 0.002779\n#> 44 12 43.13 0.002532\n#> 45 12 43.15 0.002307\n#> 46 12 43.16 0.002102\n#> 47 12 43.17 0.001915\n#> 48 12 43.18 0.001745\n#> 49 12 43.19 0.001590\n#> 50 12 43.19 0.001449\n#> 51 12 43.20 0.001320\n#> 52 12 43.20 0.001203\n#> 53 12 43.21 0.001096\n#> 54 12 43.21 0.000999\n#> 55 12 43.21 0.000910\n#> 56 12 43.21 0.000829\n#> 57 12 43.22 0.000755\n#> 58 12 43.22 0.000688\n#> 59 12 43.22 0.000627\n#> 60 12 43.22 0.000571\n#> 61 12 43.22 0.000521\n#> 62 12 43.22 0.000474\n#> 63 12 43.22 0.000432\n#> 64 12 43.22 0.000394\n#> 65 12 43.22 0.000359\n```\n:::\n\n\nIf `penalty` were not specified, all of the `lambda` values would be computed. \n\nTo get the predictions for this specific value of `lambda` (aka `penalty`):\n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-pred_673611c19e448251aeb977fec5788162'}\n\n```{.r .cell-code}\n# First, get the processed version of the test set predictors:\ntest_normalized <- bake(norm_recipe, new_data = ames_test, all_predictors())\n\ntest_results <- \n test_results %>%\n rename(`random forest` = .pred) %>%\n bind_cols(\n predict(glmn_fit, new_data = test_normalized) %>%\n rename(glmnet = .pred)\n )\ntest_results\n#> # A tibble: 733 × 3\n#> Sale_Price `random forest` glmnet\n#> \n#> 1 5.39 5.25 5.16\n#> 2 5.28 5.29 5.27\n#> 3 5.23 5.26 5.24\n#> 4 5.21 5.30 5.24\n#> 5 5.60 5.51 5.24\n#> 6 5.32 5.29 5.26\n#> 7 5.17 5.14 5.18\n#> 8 5.06 5.13 5.17\n#> 9 4.98 5.01 5.18\n#> 10 5.11 5.14 5.19\n#> # … with 723 more rows\n\ntest_results %>% metrics(truth = Sale_Price, estimate = glmnet) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.142 \n#> 2 rsq standard 0.391 \n#> 3 mae standard 0.0979\n\ntest_results %>% \n gather(model, prediction, -Sale_Price) %>% \n ggplot(aes(x = prediction, y = Sale_Price)) + \n geom_abline(col = \"green\", lty = 2) + \n geom_point(alpha = .4) + \n facet_wrap(~model) + \n coord_fixed()\n```\n\n::: {.cell-output-display}\n![](figs/glmn-pred-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThis final plot compares the performance of the random forest and regularized regression models.\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Monterey 12.6.1\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-05\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> glmnet * 4.1-6 2022-11-27 [1] CRAN (R 4.2.0)\n#> infer * 1.0.4 2022-12-02 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 1.0.0 2022-12-20 [1] CRAN (R 4.2.0)\n#> randomForest * 4.7-1.1 2022-05-23 [1] CRAN (R 4.2.0)\n#> ranger * 0.14.1 2022-06-18 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3 2022-11-09 [1] CRAN (R 4.2.0)\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1 2022-12-07 [1] CRAN (R 4.2.0)\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-09 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-12-27 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", + "markdown": "---\ntitle: \"Regression models two ways\"\ncategories:\n - model fitting\n - random forests\n - linear regression\ntype: learn-subsection\nweight: 1\ndescription: | \n Create and train different kinds of regression models with different computational engines.\n---\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: glmnet, randomForest, ranger, and tidymodels.\n\nWe can create regression models with the tidymodels package [parsnip](https://parsnip.tidymodels.org/) to predict continuous or numeric quantities. Here, let's first fit a random forest model, which does _not_ require all numeric input (see discussion [here](https://bookdown.org/max/FES/categorical-trees.html)) and discuss how to use `fit()` and `fit_xy()`, as well as _data descriptors_. \n\nSecond, let's fit a regularized linear regression model to demonstrate how to move between different types of models using parsnip. \n\n## The Ames housing data\n\nWe'll use the Ames housing data set to demonstrate how to create regression models using parsnip. First, set up the data set and create a simple training/test set split:\n\n\n::: {.cell layout-align=\"center\" hash='cache/ames-split_18b5bf0134171b332b56ced5fc3b1911'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n\ndata(ames)\n\nset.seed(4595)\ndata_split <- initial_split(ames, strata = \"Sale_Price\", prop = 0.75)\n\names_train <- training(data_split)\names_test <- testing(data_split)\n```\n:::\n\n\nThe use of the test set here is _only for illustration_; normally in a data analysis these data would be saved to the very end after many models have been evaluated. \n\n## Random forest\n\nWe'll start by fitting a random forest model to a small set of parameters. Let's create a model with the predictors `Longitude`, `Latitude`, `Lot_Area`, `Neighborhood`, and `Year_Sold`. A simple random forest model can be specified via:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic_02332292464935db3e80ae3984b88bfe'}\n\n```{.r .cell-code}\nrf_defaults <- rand_forest(mode = \"regression\")\nrf_defaults\n#> Random Forest Model Specification (regression)\n#> \n#> Computational engine: ranger\n```\n:::\n\n\nThe model will be fit with the ranger package by default. Since we didn't add any extra arguments to `fit`, _many_ of the arguments will be set to their defaults from the function `ranger::ranger()`. The help pages for the model function describe the default parameters and you can also use the `translate()` function to check out such details. \n\nThe parsnip package provides two different interfaces to fit a model: \n\n- the formula interface (`fit()`), and\n- the non-formula interface (`fit_xy()`).\n\nLet's start with the non-formula interface:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy_b14276d12ae7cf9ce85ecf44ee6a9bb4'}\n\n```{.r .cell-code}\npreds <- c(\"Longitude\", \"Latitude\", \"Lot_Area\", \"Neighborhood\", \"Year_Sold\")\n\nrf_xy_fit <- \n rf_defaults %>%\n set_engine(\"ranger\") %>%\n fit_xy(\n x = ames_train[, preds],\n y = log10(ames_train$Sale_Price)\n )\n\nrf_xy_fit\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 500 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 2 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008500188 \n#> R squared (OOB): 0.7239116\n```\n:::\n\n\nThe non-formula interface doesn't do anything to the predictors before passing them to the underlying model function. This particular model does _not_ require indicator variables (sometimes called \"dummy variables\") to be created prior to fitting the model. Note that the output shows \"Number of independent variables: 5\".\n\nFor regression models, we can use the basic `predict()` method, which returns a tibble with a column named `.pred`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-xy-pred_6e4354a6451e46bc8faa718908d9a912'}\n\n```{.r .cell-code}\ntest_results <- \n ames_test %>%\n select(Sale_Price) %>%\n mutate(Sale_Price = log10(Sale_Price)) %>%\n bind_cols(\n predict(rf_xy_fit, new_data = ames_test[, preds])\n )\ntest_results %>% slice(1:5)\n#> # A tibble: 5 × 2\n#> Sale_Price .pred\n#> \n#> 1 5.39 5.25\n#> 2 5.28 5.29\n#> 3 5.23 5.26\n#> 4 5.21 5.30\n#> 5 5.60 5.51\n\n# summarize performance\ntest_results %>% metrics(truth = Sale_Price, estimate = .pred) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.0945\n#> 2 rsq standard 0.733 \n#> 3 mae standard 0.0629\n```\n:::\n\n\nNote that: \n\n * If the model required indicator variables, we would have to create them manually prior to using `fit()` (perhaps using the recipes package).\n * We had to manually log the outcome prior to modeling. \n\nNow, for illustration, let's use the formula method using some new parameter values:\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-basic-form_1e1aadae8c7bde9360f11e8062781218'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 3 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.008402569 \n#> R squared (OOB): 0.7270823\n```\n:::\n\n \nSuppose that we would like to use the randomForest package instead of ranger. To do so, the only part of the syntax that needs to change is the `set_engine()` argument:\n\n\n\n::: {.cell layout-align=\"center\" hash='cache/rf-rf_7f5cc80f129451dce218438d3e2b5856'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = 3, trees = 1000) %>%\n set_engine(\"randomForest\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> \n#> Call:\n#> randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~3, x)) \n#> Type of random forest: regression\n#> Number of trees: 1000\n#> No. of variables tried at each split: 3\n#> \n#> Mean of squared residuals: 0.008472074\n#> % Var explained: 72.47\n```\n:::\n\n\nLook at the formula code that was printed out; one function uses the argument name `ntree` and the other uses `num.trees`. The parsnip models don't require you to know the specific names of the main arguments. \n\nNow suppose that we want to modify the value of `mtry` based on the number of predictors in the data. Usually, a good default value is `floor(sqrt(num_predictors))` but a pure bagging model requires an `mtry` value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present when the model will be fit (perhaps due to the generation of indicator variables or a variable filter) so this might be difficult to know exactly ahead of time when you write your code. \n\nWhen the model it being fit by parsnip, [_data descriptors_](https://parsnip.tidymodels.org/reference/descriptors.html) are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using `rand_forest()`), the values of the arguments that you give it are _immediately evaluated_ unless you delay them. To delay the evaluation of any argument, you can used `rlang::expr()` to make an expression. \n\nTwo relevant data descriptors for our example model are:\n\n * `.preds()`: the number of predictor _variables_ in the data set that are associated with the predictors **prior to dummy variable creation**.\n * `.cols()`: the number of predictor _columns_ after dummy variables (or other encodings) are created.\n\nSince ranger won't create indicator values, `.preds()` would be appropriate for `mtry` for a bagging model. \n\nFor example, let's use an expression with the `.preds()` descriptor to fit a bagging model: \n\n\n::: {.cell layout-align=\"center\" hash='cache/bagged_2b76f70b641acbdb2616b84443585217'}\n\n```{.r .cell-code}\nrand_forest(mode = \"regression\", mtry = .preds(), trees = 1000) %>%\n set_engine(\"ranger\") %>%\n fit(\n log10(Sale_Price) ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold,\n data = ames_train\n )\n#> parsnip model object\n#> \n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~.preds(), x), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) \n#> \n#> Type: Regression \n#> Number of trees: 1000 \n#> Sample size: 2197 \n#> Number of independent variables: 5 \n#> Mtry: 5 \n#> Target node size: 5 \n#> Variable importance mode: none \n#> Splitrule: variance \n#> OOB prediction error (MSE): 0.00867085 \n#> R squared (OOB): 0.7183685\n```\n:::\n\n\n\n## Regularized regression\n\nA linear model might work for this data set as well. We can use the `linear_reg()` parsnip model. There are two engines that can perform regularization/penalization, the glmnet and sparklyr packages. Let's use the former here. The glmnet package only implements a non-formula method, but parsnip will allow either one to be used. \n\nWhen regularization is used, the predictors should first be centered and scaled before being passed to the model. The formula method won't do that automatically so we will need to do this ourselves. We'll use the [recipes](https://recipes.tidymodels.org/) package for these steps. \n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-form_a0ca81e5cfdf6601081373c7b271e499'}\n\n```{.r .cell-code}\nnorm_recipe <- \n recipe(\n Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold, \n data = ames_train\n ) %>%\n step_other(Neighborhood) %>% \n step_dummy(all_nominal()) %>%\n step_center(all_predictors()) %>%\n step_scale(all_predictors()) %>%\n step_log(Sale_Price, base = 10) %>% \n # estimate the means and standard deviations\n prep(training = ames_train, retain = TRUE)\n\n# Now let's fit the model using the processed version of the data\n\nglmn_fit <- \n linear_reg(penalty = 0.001, mixture = 0.5) %>% \n set_engine(\"glmnet\") %>%\n fit(Sale_Price ~ ., data = bake(norm_recipe, new_data = NULL))\nglmn_fit\n#> parsnip model object\n#> \n#> \n#> Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = \"gaussian\", alpha = ~0.5) \n#> \n#> Df %Dev Lambda\n#> 1 0 0.00 0.138300\n#> 2 1 1.96 0.126000\n#> 3 1 3.72 0.114800\n#> 4 1 5.28 0.104600\n#> 5 2 7.07 0.095320\n#> 6 3 9.64 0.086850\n#> 7 4 12.58 0.079140\n#> 8 5 15.45 0.072110\n#> 9 5 17.93 0.065700\n#> 10 7 20.81 0.059860\n#> 11 7 23.51 0.054550\n#> 12 7 25.82 0.049700\n#> 13 8 28.20 0.045290\n#> 14 8 30.31 0.041260\n#> 15 8 32.12 0.037600\n#> 16 8 33.66 0.034260\n#> 17 8 34.97 0.031210\n#> 18 8 36.08 0.028440\n#> 19 8 37.02 0.025910\n#> 20 9 37.90 0.023610\n#> 21 9 38.65 0.021510\n#> 22 9 39.29 0.019600\n#> 23 9 39.83 0.017860\n#> 24 9 40.28 0.016270\n#> 25 10 40.68 0.014830\n#> 26 11 41.06 0.013510\n#> 27 11 41.38 0.012310\n#> 28 11 41.65 0.011220\n#> 29 11 41.88 0.010220\n#> 30 12 42.09 0.009313\n#> 31 12 42.27 0.008486\n#> 32 12 42.43 0.007732\n#> 33 12 42.56 0.007045\n#> 34 12 42.66 0.006419\n#> 35 12 42.75 0.005849\n#> 36 12 42.83 0.005329\n#> 37 12 42.90 0.004856\n#> 38 12 42.95 0.004424\n#> 39 12 42.99 0.004031\n#> 40 12 43.03 0.003673\n#> 41 12 43.06 0.003347\n#> 42 12 43.09 0.003050\n#> 43 12 43.11 0.002779\n#> 44 12 43.13 0.002532\n#> 45 12 43.15 0.002307\n#> 46 12 43.16 0.002102\n#> 47 12 43.17 0.001915\n#> 48 12 43.18 0.001745\n#> 49 12 43.19 0.001590\n#> 50 12 43.19 0.001449\n#> 51 12 43.20 0.001320\n#> 52 12 43.20 0.001203\n#> 53 12 43.21 0.001096\n#> 54 12 43.21 0.000999\n#> 55 12 43.21 0.000910\n#> 56 12 43.21 0.000829\n#> 57 12 43.22 0.000755\n#> 58 12 43.22 0.000688\n#> 59 12 43.22 0.000627\n#> 60 12 43.22 0.000571\n#> 61 12 43.22 0.000521\n#> 62 12 43.22 0.000474\n#> 63 12 43.22 0.000432\n#> 64 12 43.22 0.000394\n#> 65 12 43.22 0.000359\n```\n:::\n\n\nIf `penalty` were not specified, all of the `lambda` values would be computed. \n\nTo get the predictions for this specific value of `lambda` (aka `penalty`):\n\n\n::: {.cell layout-align=\"center\" hash='cache/glmn-pred_673611c19e448251aeb977fec5788162'}\n\n```{.r .cell-code}\n# First, get the processed version of the test set predictors:\ntest_normalized <- bake(norm_recipe, new_data = ames_test, all_predictors())\n\ntest_results <- \n test_results %>%\n rename(`random forest` = .pred) %>%\n bind_cols(\n predict(glmn_fit, new_data = test_normalized) %>%\n rename(glmnet = .pred)\n )\ntest_results\n#> # A tibble: 733 × 3\n#> Sale_Price `random forest` glmnet\n#> \n#> 1 5.39 5.25 5.16\n#> 2 5.28 5.29 5.27\n#> 3 5.23 5.26 5.24\n#> 4 5.21 5.30 5.24\n#> 5 5.60 5.51 5.24\n#> 6 5.32 5.29 5.26\n#> 7 5.17 5.14 5.18\n#> 8 5.06 5.13 5.17\n#> 9 4.98 5.01 5.18\n#> 10 5.11 5.14 5.19\n#> # … with 723 more rows\n\ntest_results %>% metrics(truth = Sale_Price, estimate = glmnet) \n#> # A tibble: 3 × 3\n#> .metric .estimator .estimate\n#> \n#> 1 rmse standard 0.142 \n#> 2 rsq standard 0.391 \n#> 3 mae standard 0.0979\n\ntest_results %>% \n gather(model, prediction, -Sale_Price) %>% \n ggplot(aes(x = prediction, y = Sale_Price)) + \n geom_abline(col = \"green\", lty = 2) + \n geom_point(alpha = .4) + \n facet_wrap(~model) + \n coord_fixed()\n```\n\n::: {.cell-output-display}\n![](figs/glmn-pred-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThis final plot compares the performance of the random forest and regularized regression models.\n\n## Session information\n\n\n::: {.cell layout-align=\"center\" hash='cache/si_43a75b68dcc94565ba13180d7ad26a69'}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.2.0 (2022-04-22)\n#> os macOS Big Sur/Monterey 10.16\n#> system x86_64, darwin17.0\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/New_York\n#> date 2023-01-04\n#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.1 2022-08-29 [1] CRAN (R 4.2.0)\n#> dials * 1.1.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)\n#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)\n#> glmnet * 4.1-4 2022-04-15 [1] CRAN (R 4.2.0)\n#> infer * 1.0.3 2022-08-22 [1] CRAN (R 4.2.0)\n#> parsnip * 1.0.3 2022-11-11 [1] CRAN (R 4.2.0)\n#> purrr * 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)\n#> randomForest * 4.7-1.1 2022-05-23 [1] CRAN (R 4.2.0)\n#> ranger * 0.14.1 2022-06-18 [1] CRAN (R 4.2.0)\n#> recipes * 1.0.3.9000 2022-12-12 [1] local\n#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)\n#> rsample * 1.1.1.9000 2022-12-13 [1] local\n#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)\n#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.0)\n#> tune * 1.0.1.9001 2022-12-05 [1] Github (tidymodels/tune@e23abdf)\n#> workflows * 1.1.2 2022-11-16 [1] CRAN (R 4.2.0)\n#> yardstick * 1.1.0.9000 2022-11-30 [1] Github (tidymodels/yardstick@5f1b9ce)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/content/learn/statistics/many-models/index/execute-results/html.json b/_freeze/content/learn/statistics/many-models/index/execute-results/html.json new file mode 100644 index 00000000..47703bfd --- /dev/null +++ b/_freeze/content/learn/statistics/many-models/index/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "8b507d2b8c5252519f803be34ba5a47b", + "result": { + "markdown": "---\ntitle: \"Many Models\"\ncategories:\n - statistical analysis\n - tidying results\ntype: learn-subsection\nweight: 1\ndescription: | \n Analyze the results of many linear regressions and learn about nested data frames along the way.\n---\n\n\n\n\n\n\n\n_These materials were in the first edition of [R for Data Science](https://r4ds.had.co.nz). They are not in later editions and are shown here -- with some modifications-- with permission_. To use code in this article, you will need to install the following packages: gapminder, stringr, and tidymodels.\n\nIn this article, you're going to learn three powerful ideas that help you to work with large numbers of models with ease:\n\n1. Using many simple models to better understand complex datasets.\n\n1. Using list-columns to store arbitrary data structures in a data frame.\n For example, this will allow you to have a column that contains linear \n models.\n \n1. Using the __broom__ package, originally by David Robinson, to turn models into tidy \n data. This is a powerful technique for working with large numbers of models.\n\nWe'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.\n\nThe following sections will dive into more detail about the individual techniques:\n\n1. In [list-columns], you'll learn more about the list-column data structure,\n and why it's valid to put lists in data frames.\n \n1. In [creating list-columns], you'll learn the three main ways in which you'll\n create list-columns.\n \n1. In [simplifying list-columns] you'll learn how to convert list-columns back\n to regular atomic vectors (or sets of atomic vectors) so you can work\n with them more easily.\n \n1. In [making tidy data with broom], you'll learn about the full set of tools\n provided by broom, and see how they can be applied to other types of \n data structure.\n\nFirst, let's load some packages:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-3_68b436f9fd1b8a93a8377ef202bb7c9c'}\n\n```{.r .cell-code}\nlibrary(tidymodels)\ntidymodels_prefer()\n```\n:::\n\n\n## gapminder\n\nTo motivate the power of many simple models, we're going to look into the \"gapminder\" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, stop reading this chapter right now and go watch one of his videos! He is a fantastic data presenter and illustrates how you can use data to present a compelling story. A good place to start is this short video filmed in conjunction with the BBC: .\n\nThe gapminder data summarises the progression of countries over time, looking at statistics like life expectancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-4_61bfa05dd2f83dae069375d16a77211a'}\n\n```{.r .cell-code}\nlibrary(gapminder)\ngapminder\n#> # A tibble: 1,704 × 6\n#> country continent year lifeExp pop gdpPercap\n#> \n#> 1 Afghanistan Asia 1952 28.8 8425333 779.\n#> 2 Afghanistan Asia 1957 30.3 9240934 821.\n#> 3 Afghanistan Asia 1962 32.0 10267083 853.\n#> 4 Afghanistan Asia 1967 34.0 11537966 836.\n#> 5 Afghanistan Asia 1972 36.1 13079460 740.\n#> 6 Afghanistan Asia 1977 38.4 14880372 786.\n#> 7 Afghanistan Asia 1982 39.9 12881816 978.\n#> 8 Afghanistan Asia 1987 40.8 13867957 852.\n#> 9 Afghanistan Asia 1992 41.7 16317921 649.\n#> 10 Afghanistan Asia 1997 41.8 22227415 635.\n#> # … with 1,694 more rows\n```\n:::\n\n\nIn this case study, we're going to focus on just three variables to answer the question \"How does life expectancy (`lifeExp`) change over time (`year`) for each country (`country`)?\". A good place to start is with a plot:\n\n\n::: {.cell layout-align=\"center\" hash='cache/year-life-exp_2ae61e7d6272489aa9ea110790fd6fc6'}\n\n```{.r .cell-code}\ngapminder %>% \n ggplot(aes(year, lifeExp, group = country)) +\n geom_line(alpha = 1/3)\n```\n\n::: {.cell-output-display}\n![](figs/year-life-exp-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThis is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectancy has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?\n\nOne way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.\n\nYou already know how to do that if we had a single country:\n\n\n::: {.cell layout-align=\"center\" fig.asp='1' hash='cache/year-life-exp-nz_06fcd418a5a9e9d7c356ed3c88d34748'}\n\n```{.r .cell-code}\nnz <- filter(gapminder, country == \"New Zealand\")\nnz %>% \n ggplot(aes(year, lifeExp)) + \n geom_line() + \n ggtitle(\"Full data = \")\n```\n\n::: {.cell-output-display}\n![](figs/year-life-exp-nz-1.svg){fig-align='center' width=33%}\n:::\n\n```{.r .cell-code}\n\nnz_mod <- lm(lifeExp ~ year, data = nz)\n```\n:::\n\n\nThe broom package provides a general set of functions to turn models into tidy data. One function, `augment()`, is used to supplement a data set with columns such as the predicted values, residuals, and so on. We'll use this to plot the predicted versus observed data: \n\n\n::: {.cell layout-align=\"center\" hash='cache/year-life-exp-nz-resid_6f887ff3d02adfedd7ea443d09af9830'}\n\n```{.r .cell-code}\nnz_mod_res <- augment(nz_mod)\n\nnz_mod_res %>%\n ggplot(aes(year, .fitted)) + \n geom_line() + \n ggtitle(\"Linear trend + \")\n```\n\n::: {.cell-output-display}\n![](figs/year-life-exp-nz-resid-1.svg){fig-align='center' width=672}\n:::\n\n```{.r .cell-code}\n\nnz_mod_res %>% \n ggplot(aes(year, .resid)) + \n geom_hline(yintercept = 0, colour = \"white\", linewidth = 3) + \n geom_line() + \n ggtitle(\"Remaining pattern\")\n```\n\n::: {.cell-output-display}\n![](figs/year-life-exp-nz-resid-2.svg){fig-align='center' width=672}\n:::\n:::\n\n\nThere is a `data` argument to `augment()`. The default uses the same data points that the model was trained on. \n\nHow can we easily fit that model to every country?\n\n### Nested data\n\nYou could imagine copy and pasting that code multiple times; but you've already learned a better way! Extract out the common code with a function and repeat using a map function from purrr. This problem is structured a little differently to what you've seen before. Instead of repeating an action for each variable, we want to repeat an action for each country, a subset of rows. To do that, we need a new data structure: the __nested data frame__. To create a nested data frame we start with a grouped data frame, and \"nest\" it:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-8_cc1ce25bb88024098422348cd2fe7ee9'}\n\n```{.r .cell-code}\nby_country <- gapminder %>% \n group_by(country, continent) %>% \n nest()\n\nby_country\n#> # A tibble: 142 × 3\n#> # Groups: country, continent [142]\n#> country continent data \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\n(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)\n\nThis creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.\n\nThe `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-9_8d4fea0867420a2fefe97b6580e4e5be'}\n\n```{.r .cell-code}\nby_country$data[[1]]\n#> # A tibble: 12 × 4\n#> year lifeExp pop gdpPercap\n#> \n#> 1 1952 28.8 8425333 779.\n#> 2 1957 30.3 9240934 821.\n#> 3 1962 32.0 10267083 853.\n#> 4 1967 34.0 11537966 836.\n#> 5 1972 36.1 13079460 740.\n#> 6 1977 38.4 14880372 786.\n#> 7 1982 39.9 12881816 978.\n#> 8 1987 40.8 13867957 852.\n#> 9 1992 41.7 16317921 649.\n#> 10 1997 41.8 22227415 635.\n#> 11 2002 42.1 25268405 727.\n#> 12 2007 43.8 31889923 975.\n```\n:::\n\n\nNote the difference between a standard grouped data frame and a nested data frame: in a grouped data frame, each row is an observation; in a nested data frame, each row is a group. Another way to think about a nested dataset is we now have a meta-observation: a row that represents the complete time course for a country, rather than a single point in time.\n\n### List-columns\n\nNow that we have our nested data frame, we're in a good position to fit some models. We have a model-fitting function:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-10_64a76b96fa6c7b356edac37e78cd41e0'}\n\n```{.r .cell-code}\ncountry_model <- function(df) {\n lm(lifeExp ~ year, data = df)\n}\n```\n:::\n\n\nAnd we want to apply it to every data frame. The data frames are in a list, so we can use `purrr::map()` to apply `country_model` to each element:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-11_bc3c5e98ba55e9a5ca853844c95293af'}\n\n```{.r .cell-code}\nmodels <- map(by_country$data, country_model)\n```\n:::\n\n\nHowever, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?\n\nIn other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-12_290ff8735e10dce4997cd3751707c272'}\n\n```{.r .cell-code}\nby_country <- by_country %>% \n mutate(model = map(data, country_model))\nby_country\n#> # A tibble: 142 × 4\n#> # Groups: country, continent [142]\n#> country continent data model \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\nThis has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes care of that for you:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-13_eb34a6390d7d26b87d93852a737b52df'}\n\n```{.r .cell-code}\nby_country %>% \n filter(continent == \"Europe\")\n#> # A tibble: 30 × 4\n#> # Groups: country, continent [30]\n#> country continent data model \n#> \n#> 1 Albania Europe \n#> 2 Austria Europe \n#> 3 Belgium Europe \n#> 4 Bosnia and Herzegovina Europe \n#> 5 Bulgaria Europe \n#> 6 Croatia Europe \n#> 7 Czech Republic Europe \n#> 8 Denmark Europe \n#> 9 Finland Europe \n#> 10 France Europe \n#> # … with 20 more rows\nby_country %>% \n arrange(continent, country)\n#> # A tibble: 142 × 4\n#> # Groups: country, continent [142]\n#> country continent data model \n#> \n#> 1 Algeria Africa \n#> 2 Angola Africa \n#> 3 Benin Africa \n#> 4 Botswana Africa \n#> 5 Burkina Faso Africa \n#> 6 Burundi Africa \n#> 7 Cameroon Africa \n#> 8 Central African Republic Africa \n#> 9 Chad Africa \n#> 10 Comoros Africa \n#> # … with 132 more rows\n```\n:::\n\n\nIf your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!\n\n### Unnesting\n\nPreviously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `augment()` with each model pair:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-14_9ac5aa978a8091b63a9a551404452704'}\n\n```{.r .cell-code}\nby_country <- by_country %>% \n mutate(\n results = map(model, augment)\n )\nby_country\n#> # A tibble: 142 × 5\n#> # Groups: country, continent [142]\n#> country continent data model results \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\nBut how can you plot a list of data frames? Instead of struggling to answer that question, let's turn the list of data frames back into a regular data frame. Previously we used `nest()` to turn a regular data frame into an nested data frame, and now we do the opposite with `unnest()`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-15_dac9af647026b33256698116cb78dd11'}\n\n```{.r .cell-code}\nresids <- unnest(by_country, results)\nresids\n#> # A tibble: 1,704 × 12\n#> # Groups: country, continent [142]\n#> country conti…¹ data model lifeExp year .fitted .resid .hat .sigma\n#> \n#> 1 Afghanist… Asia 28.8 1952 29.9 -1.11 0.295 1.21\n#> 2 Afghanist… Asia 30.3 1957 31.3 -0.952 0.225 1.24\n#> 3 Afghanist… Asia 32.0 1962 32.7 -0.664 0.169 1.27\n#> 4 Afghanist… Asia 34.0 1967 34.0 -0.0172 0.127 1.29\n#> 5 Afghanist… Asia 36.1 1972 35.4 0.674 0.0991 1.27\n#> 6 Afghanist… Asia 38.4 1977 36.8 1.65 0.0851 1.15\n#> 7 Afghanist… Asia 39.9 1982 38.2 1.69 0.0851 1.15\n#> 8 Afghanist… Asia 40.8 1987 39.5 1.28 0.0991 1.21\n#> 9 Afghanist… Asia 41.7 1992 40.9 0.754 0.127 1.26\n#> 10 Afghanist… Asia 41.8 1997 42.3 -0.534 0.169 1.27\n#> # … with 1,694 more rows, 2 more variables: .cooksd , .std.resid ,\n#> # and abbreviated variable name ¹​continent\n```\n:::\n\n\nNote that each regular column is repeated once for each row of the nested tibble.\n\nNow we have regular data frame, we can plot the residuals:\n\n\n::: {.cell layout-align=\"center\" hash='cache/resid-col-country_2c4f5aac18cc015b75b9190ec1728554'}\n\n```{.r .cell-code}\nresids %>% \n ggplot(aes(year, .resid)) +\n geom_line(aes(group = country), alpha = 1 / 3) + \n geom_smooth(se = FALSE)\n#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs =\n#> \"cs\")'\n```\n\n::: {.cell-output-display}\n![](figs/resid-col-country-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nFacetting by continent is particularly revealing:\n \n\n::: {.cell layout-align=\"center\" hash='cache/resid-col-country-facetting_14d0637611062dbe6c16304bcf2c5b9a'}\n\n```{.r .cell-code}\nresids %>% \n ggplot(aes(year, .resid, group = country)) +\n geom_line(alpha = 1 / 3) + \n facet_wrap(~continent)\n```\n\n::: {.cell-output-display}\n![](figs/resid-col-country-facetting-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nIt looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.\n\n### Model quality\n\nInstead of looking at the residuals from the model, we could look at some general measurements of model quality. You learned how to compute some specific measures in the previous chapter. Here we'll show a different approach using the broom package and we'll use `broom::glance()` to extract some model quality metrics. If we apply it to a model, we get a data frame with a single row:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-18_29fbf5cd328907283445f10a58d7920e'}\n\n```{.r .cell-code}\nbroom::glance(nz_mod)\n#> # A tibble: 1 × 12\n#> r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵\n#> \n#> 1 0.954 0.949 0.804 205. 5.41e-8 1 -13.3 32.6 34.1 6.47 10\n#> # … with 1 more variable: nobs , and abbreviated variable names\n#> # ¹​r.squared, ²​adj.r.squared, ³​statistic, ⁴​deviance, ⁵​df.residual\n```\n:::\n\n\nWe can use `mutate()` and `unnest()` to create a data frame with a row for each country:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-19_f0d129ad2719cc5c9d7ebe4ebe270d50'}\n\n```{.r .cell-code}\nby_country %>% \n mutate(glance = map(model, broom::glance)) %>% \n unnest(glance)\n#> # A tibble: 142 × 17\n#> # Groups: country, continent [142]\n#> country continent data model results r.squa…¹ adj.r…² sigma stati…³\n#> \n#> 1 Afghanistan Asia 0.948 0.942 1.22 181. \n#> 2 Albania Europe 0.911 0.902 1.98 102. \n#> 3 Algeria Africa 0.985 0.984 1.32 662. \n#> 4 Angola Africa 0.888 0.877 1.41 79.1\n#> 5 Argentina Americas 0.996 0.995 0.292 2246. \n#> 6 Australia Oceania 0.980 0.978 0.621 481. \n#> 7 Austria Europe 0.992 0.991 0.407 1261. \n#> 8 Bahrain Asia 0.967 0.963 1.64 291. \n#> 9 Bangladesh Asia 0.989 0.988 0.977 930. \n#> 10 Belgium Europe 0.995 0.994 0.293 1822. \n#> # … with 132 more rows, 8 more variables: p.value , df ,\n#> # logLik , AIC , BIC , deviance , df.residual ,\n#> # nobs , and abbreviated variable names ¹​r.squared, ²​adj.r.squared,\n#> # ³​statistic\n```\n:::\n\n\nThis isn't quite the output we want, because it still includes all the list columns. This is default behaviour when `unnest()` works on single row data frames. To suppress these columns we use `select()`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-20_d27b527c4581198609b7d833d613ff9f'}\n\n```{.r .cell-code}\nglance <- by_country %>% \n mutate(glance = map(model, broom::glance)) %>% \n unnest(glance) %>% \n select(-data, -model, -results)\nglance\n#> # A tibble: 142 × 14\n#> # Groups: country, continent [142]\n#> country conti…¹ r.squ…² adj.r…³ sigma stati…⁴ p.value df logLik AIC\n#> \n#> 1 Afghanistan Asia 0.948 0.942 1.22 181. 9.84e- 8 1 -18.3 42.7 \n#> 2 Albania Europe 0.911 0.902 1.98 102. 1.46e- 6 1 -24.1 54.3 \n#> 3 Algeria Africa 0.985 0.984 1.32 662. 1.81e-10 1 -19.3 44.6 \n#> 4 Angola Africa 0.888 0.877 1.41 79.1 4.59e- 6 1 -20.0 46.1 \n#> 5 Argentina Americ… 0.996 0.995 0.292 2246. 4.22e-13 1 -1.17 8.35\n#> 6 Australia Oceania 0.980 0.978 0.621 481. 8.67e-10 1 -10.2 26.4 \n#> 7 Austria Europe 0.992 0.991 0.407 1261. 7.44e-12 1 -5.16 16.3 \n#> 8 Bahrain Asia 0.967 0.963 1.64 291. 1.02e- 8 1 -21.9 49.7 \n#> 9 Bangladesh Asia 0.989 0.988 0.977 930. 3.37e-11 1 -15.7 37.3 \n#> 10 Belgium Europe 0.995 0.994 0.293 1822. 1.20e-12 1 -1.20 8.40\n#> # … with 132 more rows, 4 more variables: BIC , deviance ,\n#> # df.residual , nobs , and abbreviated variable names ¹​continent,\n#> # ²​r.squared, ³​adj.r.squared, ⁴​statistic\n```\n:::\n\n\n(Pay attention to the variables that aren't printed: there's a lot of useful stuff there.)\n\nWith this data frame in hand, we can start to look for models that don't fit well:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-21_d83a1d55ca945dbc465db816ec04bfbf'}\n\n```{.r .cell-code}\nglance %>% \n arrange(r.squared)\n#> # A tibble: 142 × 14\n#> # Groups: country, continent [142]\n#> country conti…¹ r.squ…² adj.r.…³ sigma stati…⁴ p.value df logLik AIC\n#> \n#> 1 Rwanda Africa 0.0172 -0.0811 6.56 0.175 0.685 1 -38.5 83.0\n#> 2 Botswana Africa 0.0340 -0.0626 6.11 0.352 0.566 1 -37.7 81.3\n#> 3 Zimbabwe Africa 0.0562 -0.0381 7.21 0.596 0.458 1 -39.6 85.3\n#> 4 Zambia Africa 0.0598 -0.0342 4.53 0.636 0.444 1 -34.1 74.1\n#> 5 Swaziland Africa 0.0682 -0.0250 6.64 0.732 0.412 1 -38.7 83.3\n#> 6 Lesotho Africa 0.0849 -0.00666 5.93 0.927 0.358 1 -37.3 80.6\n#> 7 Cote d'Ivo… Africa 0.283 0.212 3.93 3.95 0.0748 1 -32.3 70.7\n#> 8 South Afri… Africa 0.312 0.244 4.74 4.54 0.0588 1 -34.6 75.2\n#> 9 Uganda Africa 0.342 0.276 3.19 5.20 0.0457 1 -29.8 65.7\n#> 10 Congo, Dem… Africa 0.348 0.283 2.43 5.34 0.0434 1 -26.6 59.2\n#> # … with 132 more rows, 4 more variables: BIC , deviance ,\n#> # df.residual , nobs , and abbreviated variable names ¹​continent,\n#> # ²​r.squared, ³​adj.r.squared, ⁴​statistic\n```\n:::\n\n\nThe worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-22_b17e73e36f9d4742520ff132c9efdb12'}\n\n```{.r .cell-code}\nglance %>% \n ggplot(aes(continent, r.squared)) + \n geom_jitter(width = 0.5)\n```\n\n::: {.cell-output-display}\n![](figs/unnamed-chunk-22-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nWe could pull out the countries with particularly bad $R^2$ and plot the data:\n\n\n::: {.cell layout-align=\"center\" hash='cache/bad-fit_13c41bcbbbf2d973e20cd72d72d0a3c5'}\n\n```{.r .cell-code}\nbad_fit <- filter(glance, r.squared < 0.25)\n\ngapminder %>% \n semi_join(bad_fit, by = \"country\") %>% \n ggplot(aes(year, lifeExp, colour = country)) +\n geom_line()\n```\n\n::: {.cell-output-display}\n![](figs/bad-fit-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\nWe see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwandan genocide.\n\n## List-columns\n\nNow that you've seen a basic workflow for managing many models, let's dive back into some of the details. In this section, we'll explore the list-column data structure in a little more detail. It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to use a list as a column of a data frame. However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:.\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-24_97763c91d39b29a3f52f57ee8179a474'}\n\n```{.r .cell-code}\ndata.frame(x = list(1:3, 3:5))\n#> x.1.3 x.3.5\n#> 1 1 3\n#> 2 2 4\n#> 3 3 5\n```\n:::\n\n\nYou can prevent `data.frame()` from doing this with `I()`, but the result doesn't print particularly well:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-25_fb3451cc239058315af5f7b64281bb29'}\n\n```{.r .cell-code}\ndata.frame(\n x = I(list(1:3, 3:5)), \n y = c(\"1, 2\", \"3, 4, 5\")\n)\n#> x y\n#> 1 1, 2, 3 1, 2\n#> 2 3, 4, 5 3, 4, 5\n```\n:::\n\n\nTibble alleviates this problem by being lazier (`tibble()` doesn't modify its inputs) and by providing a better print method:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-26_29701685194683f636ee283a989410c2'}\n\n```{.r .cell-code}\ntibble(\n x = list(1:3, 3:5), \n y = c(\"1, 2\", \"3, 4, 5\")\n)\n#> # A tibble: 2 × 2\n#> x y \n#> \n#> 1 1, 2 \n#> 2 3, 4, 5\n```\n:::\n\n\nIt's even easier with `tribble()` as it can automatically work out that you need a list:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-27_52dbc2d077eb5bf7e84d8cd78ab0fe23'}\n\n```{.r .cell-code}\ntribble(\n ~x, ~y,\n 1:3, \"1, 2\",\n 3:5, \"3, 4, 5\"\n)\n#> # A tibble: 2 × 2\n#> x y \n#> \n#> 1 1, 2 \n#> 2 3, 4, 5\n```\n:::\n\n\nList-columns are often most useful as intermediate data structure. They're hard to work with directly, because most R functions work with atomic vectors or data frames, but the advantage of keeping related items together in a data frame is worth a little hassle.\n\nGenerally there are three parts of an effective list-column pipeline:\n\n1. You create the list-column using one of `nest()`, `summarise()` + `list()`,\n or `mutate()` + a map function, as described in [Creating list-columns].\n\n1. You create other intermediate list-columns by transforming existing\n list columns with `map()`, `map2()` or `pmap()`. For example, \n in the case study above, we created a list-column of models by transforming\n a list-column of data frames.\n \n1. You simplify the list-column back down to a data frame or atomic vector,\n as described in [Simplifying list-columns].\n\n## Creating list-columns\n\nTypically, you won't create list-columns with `tibble()`. Instead, you'll create them from regular columns, using one of three methods: \n\n1. With `tidyr::nest()` to convert a grouped data frame into a nested data \n frame where you have list-column of data frames.\n \n1. With `mutate()` and vectorised functions that return a list.\n\n1. With `summarise()` and summary functions that return multiple results. \n\nAlternatively, you might create them from a named list, using `tibble::enframe()`.\n\nGenerally, when creating list-columns, you should make sure they're homogeneous: each element should contain the same type of thing. There are no checks to make sure this is true, but if you use purrr and remember what you've learned about type-stable functions, you should find it happens naturally.\n\n### With nesting\n\n`nest()` creates a nested data frame, which is a data frame with a list-column of data frames. In a nested data frame each row is a meta-observation: the other columns give variables that define the observation (like country and continent above), and the list-column of data frames gives the individual observations that make up the meta-observation.\n\nThere are two ways to use `nest()`. So far you've seen how to use it with a grouped data frame. When applied to a grouped data frame, `nest()` keeps the grouping columns as is, and bundles everything else into the list-column:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-28_f11a57ba961e752149808dbd5ddaf183'}\n\n```{.r .cell-code}\ngapminder %>% \n group_by(country, continent) %>% \n nest()\n#> # A tibble: 142 × 3\n#> # Groups: country, continent [142]\n#> country continent data \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\nYou can also use it on an ungrouped data frame, specifying which columns you want to nest:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-29_ad45d8ba48d66c90e9ccb0470adb9866'}\n\n```{.r .cell-code}\ngapminder %>% \n nest(data = c(year:gdpPercap))\n#> # A tibble: 142 × 3\n#> country continent data \n#> \n#> 1 Afghanistan Asia \n#> 2 Albania Europe \n#> 3 Algeria Africa \n#> 4 Angola Africa \n#> 5 Argentina Americas \n#> 6 Australia Oceania \n#> 7 Austria Europe \n#> 8 Bahrain Asia \n#> 9 Bangladesh Asia \n#> 10 Belgium Europe \n#> # … with 132 more rows\n```\n:::\n\n\n### From vectorised functions\n\nSome useful functions take an atomic vector and return a list. For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of character vectors. If you use that inside mutate, you'll get a list-column:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-30_6772ea994865952a8e7fd26904022394'}\n\n```{.r .cell-code}\ndf <- tribble(\n ~x1,\n \"a,b,c\", \n \"d,e,f,g\"\n) \n\ndf %>% \n mutate(x2 = stringr::str_split(x1, \",\"))\n#> # A tibble: 2 × 2\n#> x1 x2 \n#> \n#> 1 a,b,c \n#> 2 d,e,f,g \n```\n:::\n\n\n`unnest()` knows how to handle these lists of vectors:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-31_5383a0d2ba7df5eff168bb876a5d5ea8'}\n\n```{.r .cell-code}\ndf %>% \n mutate(x2 = stringr::str_split(x1, \",\")) %>% \n unnest(x2)\n#> # A tibble: 7 × 2\n#> x1 x2 \n#> \n#> 1 a,b,c a \n#> 2 a,b,c b \n#> 3 a,b,c c \n#> 4 d,e,f,g d \n#> 5 d,e,f,g e \n#> 6 d,e,f,g f \n#> 7 d,e,f,g g\n```\n:::\n\n\n(If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern).\n\n### From multivalued summaries\n\nOne restriction of `summarise()` is that it only works with summary functions that return a single value. That means that you can't use it with functions like `quantile()` that return a vector of arbitrary length:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-32_2d4fdca31406fbc4bb3e8f8a94dd5615'}\n\n```{.r .cell-code}\nmtcars %>% \n group_by(cyl) %>% \n summarise(q = quantile(mpg), .groups = \"drop\")\n#> # A tibble: 15 × 2\n#> cyl q\n#> \n#> 1 4 21.4\n#> 2 4 22.8\n#> 3 4 26 \n#> 4 4 30.4\n#> 5 4 33.9\n#> 6 6 17.8\n#> 7 6 18.6\n#> 8 6 19.7\n#> 9 6 21 \n#> 10 6 21.4\n#> 11 8 10.4\n#> 12 8 14.4\n#> 13 8 15.2\n#> 14 8 16.2\n#> 15 8 19.2\n```\n:::\n\n\nYou can however, wrap the result in a list! This obeys the contract of `summarise()`, because each summary is now a list (a vector) of length 1.\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-33_7253c1101e122f8878b4a515c4d52a44'}\n\n```{.r .cell-code}\nmtcars %>% \n group_by(cyl) %>% \n summarise(q = list(quantile(mpg)), .groups = \"drop\")\n#> # A tibble: 3 × 2\n#> cyl q \n#> \n#> 1 4 \n#> 2 6 \n#> 3 8 \n```\n:::\n\n\nTo make useful results with unnest, you'll also need to capture the probabilities:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-34_6afbf2872d9c478036eb13c03d34f396'}\n\n```{.r .cell-code}\nprobs <- c(0.01, 0.25, 0.5, 0.75, 0.99)\nmtcars %>% \n group_by(cyl) %>% \n summarise(p = list(probs), q = list(quantile(mpg, probs)), .groups = \"drop\") %>% \n unnest(c(p, q))\n#> # A tibble: 15 × 3\n#> cyl p q\n#> \n#> 1 4 0.01 21.4\n#> 2 4 0.25 22.8\n#> 3 4 0.5 26 \n#> 4 4 0.75 30.4\n#> 5 4 0.99 33.8\n#> 6 6 0.01 17.8\n#> 7 6 0.25 18.6\n#> 8 6 0.5 19.7\n#> 9 6 0.75 21 \n#> 10 6 0.99 21.4\n#> 11 8 0.01 10.4\n#> 12 8 0.25 14.4\n#> 13 8 0.5 15.2\n#> 14 8 0.75 16.2\n#> 15 8 0.99 19.1\n```\n:::\n\n\n### From a named list\n\nData frames with list-columns provide a solution to a common problem: what do you do if you want to iterate over both the contents of a list and its elements? Instead of trying to jam everything into one object, it's often easier to make a data frame: one column can contain the elements, and one column can contain the list. An easy way to create such a data frame from a list is `tibble::enframe()`. \n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-35_07316fc9ee6eadc2cd59054abbcd06b5'}\n\n```{.r .cell-code}\nx <- list(\n a = 1:5,\n b = 3:4, \n c = 5:6\n) \n\ndf <- enframe(x)\ndf\n#> # A tibble: 3 × 2\n#> name value \n#> \n#> 1 a \n#> 2 b \n#> 3 c \n```\n:::\n\n\nThe advantage of this structure is that it generalises in a straightforward way - names are useful if you have character vector of metadata, but don't help if you have other types of data, or multiple vectors.\n\nNow if you want to iterate over names and values in parallel, you can use `map2()`:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-36_dab9f369fd520608f3cb046d69335c92'}\n\n```{.r .cell-code}\ndf %>% \n mutate(\n smry = map2_chr(name, value, ~ stringr::str_c(.x, \": \", .y[1]))\n )\n#> # A tibble: 3 × 3\n#> name value smry \n#> \n#> 1 a a: 1 \n#> 2 b b: 3 \n#> 3 c c: 5\n```\n:::\n\n\n## Simplifying list-columns\n\nTo apply the techniques of data manipulation and visualisation, you'll need to simplify the list-column back to a regular column (an atomic vector), or set of columns. The technique you'll use to collapse back down to a simpler structure depends on whether you want a single value per element, or multiple values:\n\n1. If you want a single value, use `mutate()` with `map_lgl()`, \n `map_int()`, `map_dbl()`, and `map_chr()` to create an atomic vector.\n \n1. If you want many values, use `unnest()` to convert list-columns back\n to regular columns, repeating the rows as many times as necessary.\n\nThese are described in more detail below.\n\n### List to vector\n\nIf you can reduce your list column to an atomic vector then it will be a regular column. For example, you can always summarise an object with its type and length, so this code will work regardless of what sort of list-column you have:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-37_26a70c17974bd1d905cfc9be32c9bc5a'}\n\n```{.r .cell-code}\ndf <- tribble(\n ~x,\n letters[1:5],\n 1:3,\n runif(5)\n)\n \ndf %>% mutate(\n type = map_chr(x, typeof),\n length = map_int(x, length)\n)\n#> # A tibble: 3 × 3\n#> x type length\n#> \n#> 1 character 5\n#> 2 integer 3\n#> 3 double 5\n```\n:::\n\n\nThis is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you have a heterogeneous list, and want to filter out the parts aren't working for you.\n\nDon't forget about the `map_*()` shortcuts - you can use `map_chr(x, \"apple\")` to extract the string stored in `apple` for each element of `x`. This is useful for pulling apart nested lists into regular columns. Use the `.null` argument to provide a value to use if the element is missing (instead of returning `NULL`):\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-38_371e20614dbdc88361691abd40e06efa'}\n\n```{.r .cell-code}\ndf <- tribble(\n ~x,\n list(a = 1, b = 2),\n list(a = 2, c = 4)\n)\ndf %>% mutate(\n a = map_dbl(x, \"a\"),\n b = map_dbl(x, \"b\", .null = NA_real_)\n)\n#> # A tibble: 2 × 3\n#> x a b\n#> \n#> 1 1 2\n#> 2 2 NA\n```\n:::\n\n\n### Unnesting\n\n`unnest()` works by repeating the regular columns once for each element of the list-column. For example, in the following very simple example we repeat the first row 4 times (because there the first element of `y` has length four), and the second row once:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-39_9258557f61c71bc301193592c32d83e3'}\n\n```{.r .cell-code}\ntibble(x = 1:2, y = list(1:4, 1)) %>% unnest(y)\n#> # A tibble: 5 × 2\n#> x y\n#> \n#> 1 1 1\n#> 2 1 2\n#> 3 1 3\n#> 4 1 4\n#> 5 2 1\n```\n:::\n\n\nThis means that you can't simultaneously unnest two columns that contain different number of elements:\n\n\n::: {.cell layout-align=\"center\" hash='cache/unnamed-chunk-40_064f99a9fac9f6b66f927549a87cadf9'}\n\n```{.r .cell-code}\n# Ok, because y and z have the same number of elements in\n# every row\ndf1 <- tribble(\n ~x, ~y, ~z,\n 1, c(\"a\", \"b\"), 1:2,\n 2, \"c\", 3\n)\ndf1\n#> # A tibble: 2 × 3\n#> x y z \n#> \n#> 1 1 \n#> 2 2 \ndf1 %>% unnest(c(y, z))\n#> # A tibble: 3 × 3\n#> x y z\n#> \n#> 1 1 a 1\n#> 2 1 b 2\n#> 3 2 c 3\n\n# Doesn't work because y and z have different number of elements\ndf2 <- tribble(\n ~x, ~y, ~z,\n 1, \"a\", 1:2, \n 2, c(\"b\", \"c\"), 3\n)\ndf2\n#> # A tibble: 2 × 3\n#> x y z \n#> \n#> 1 1 \n#> 2 2 \ndf2 %>% unnest(c(y, z))\n#> # A tibble: 4 × 3\n#> x y z\n#> \n#> 1 1 a 1\n#> 2 1 a 2\n#> 3 2 b 3\n#> 4 2 c 3\n```\n:::\n\n\nThe same principle applies when unnesting list-columns of data frames. You can unnest multiple list-cols as long as all the data frames in each row have the same number of rows.\n\n\n## Making tidy data with broom\n\nThe broom package provides three general tools for turning models into tidy data frames:\n\n1. `broom::glance(model)` returns a row for each model. Each column gives a \n model summary: either a measure of model quality, or complexity, or a \n combination of the two.\n \n1. `broom::tidy(model)` returns a row for each coefficient in the model. Each \n column gives information about the estimate or its variability.\n \n1. `broom::augment(model, data)` returns a row for each row in `data`, adding\n extra values like residuals, and influence statistics.\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/content/learn/statistics/many-models/figs/bad-fit-1.svg b/content/learn/statistics/many-models/figs/bad-fit-1.svg new file mode 100644 index 00000000..b2365296 --- /dev/null +++ b/content/learn/statistics/many-models/figs/bad-fit-1.svg @@ -0,0 +1,105 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +30 +40 +50 +60 + + + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +year +lifeExp + +country + + + + + + + + + + + + +Botswana +Lesotho +Rwanda +Swaziland +Zambia +Zimbabwe + + diff --git a/content/learn/statistics/many-models/figs/resid-col-country-1.svg b/content/learn/statistics/many-models/figs/resid-col-country-1.svg new file mode 100644 index 00000000..76352ffd --- /dev/null +++ b/content/learn/statistics/many-models/figs/resid-col-country-1.svg @@ -0,0 +1,217 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +-10 +0 +10 + + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +year +.resid + + diff --git a/content/learn/statistics/many-models/figs/resid-col-country-facetting-1.svg b/content/learn/statistics/many-models/figs/resid-col-country-facetting-1.svg new file mode 100644 index 00000000..deead60e --- /dev/null +++ b/content/learn/statistics/many-models/figs/resid-col-country-facetting-1.svg @@ -0,0 +1,431 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Europe + + + + + + + + + + +Oceania + + + + + + + + + + + + + + + + + + + +Africa + + + + + + + + + + +Americas + + + + + + + + + + +Asia + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +-10 +0 +10 + + + +-10 +0 +10 + + + +year +.resid + + diff --git a/content/learn/statistics/many-models/figs/year-life-exp-1.svg b/content/learn/statistics/many-models/figs/year-life-exp-1.svg new file mode 100644 index 00000000..3dccfbaf --- /dev/null +++ b/content/learn/statistics/many-models/figs/year-life-exp-1.svg @@ -0,0 +1,216 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +40 +60 +80 + + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +year +lifeExp + + diff --git a/content/learn/statistics/many-models/figs/year-life-exp-nz-1.svg b/content/learn/statistics/many-models/figs/year-life-exp-nz-1.svg new file mode 100644 index 00000000..55081a99 --- /dev/null +++ b/content/learn/statistics/many-models/figs/year-life-exp-nz-1.svg @@ -0,0 +1,80 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +69 +72 +75 +78 + + + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +year +lifeExp +Full data = + + diff --git a/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-1.svg b/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-1.svg new file mode 100644 index 00000000..be515f2d --- /dev/null +++ b/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-1.svg @@ -0,0 +1,81 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +70.0 +72.5 +75.0 +77.5 + + + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +year +.fitted +Linear trend + + + diff --git a/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-2.svg b/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-2.svg new file mode 100644 index 00000000..5be0979f --- /dev/null +++ b/content/learn/statistics/many-models/figs/year-life-exp-nz-resid-2.svg @@ -0,0 +1,85 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +-1.0 +-0.5 +0.0 +0.5 +1.0 + + + + + + + + + + + +1950 +1960 +1970 +1980 +1990 +2000 +year +.resid +Remaining pattern + + diff --git a/docs/content/find/index.html b/docs/content/find/index.html index 632ac1d1..a0ed6f5f 100644 --- a/docs/content/find/index.html +++ b/docs/content/find/index.html @@ -2,7 +2,7 @@ - + @@ -81,27 +81,27 @@ @@ -907,14 +911,14 @@

Session information#> ─ Session info ───────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22) -#> os macOS Monterey 12.6.1 -#> system aarch64, darwin20 +#> os macOS Big Sur/Monterey 10.16 +#> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York -#> date 2023-01-05 +#> date 2023-01-04 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ───────────────────────────────────────────────────────── @@ -923,20 +927,20 @@

Session information diff --git a/docs/content/learn/index.html b/docs/content/learn/index.html index 12f0506e..9aecf6c7 100644 --- a/docs/content/learn/index.html +++ b/docs/content/learn/index.html @@ -2,7 +2,7 @@ - + @@ -114,27 +114,27 @@