03-eda.Rmd

# Explore the data {#eda}

Before one starts with the actual modeling it is crucial to get to know the data and to bring it to the correct format. This process of getting familiar with the data is well known as Exploratory Data Analysis (EDA). To do this many packages are used.[@tidyverse; @tidymodels; @viridis; @ggtext; @viridisLite ; @patchwork; @visdat; @lubridate; @latexplots; @ggally; @plotly] The most important ones will be loaded below.

```{r load_packages, message=FALSE, warning=FALSE}
library(tidyverse) # general data handling tools
library(tidymodels) # data modeling and preprocessing
# color palettes
library(viridis)
library(viridisLite)
library(patchwork) # composing of ggplots
```

Before the fun can begin a quick outline of the steps performed on each data set.

-   A general overview of the classes of the features and a visualization to detect any missing values.
-   The distribution and the main effect on the outcome variable of each feature.
-   The pairwise relationships of the predictors.
-   Optionally some feature engineering.
-   Using the gained knowledge to fix all pre-processing and feature engineering steps by using a `recipes::recipe`.

If one is mainly interested in the models themselves one can just have a look at the recipes and skip the rest of this section.

Some might ask why no interactions of predictors are covered in this EDA. If one would use a standard OLS, lasso or ridge regression it would be very important to have a look at them but as the focus here is on tree-based gradient boosting one already includes interactions if one is not restrictive to regression stumps.

It should also be mentioned that especially during the EDA most of the plotting code is not shown in order to support a better reading flow. If one wants to look up the code for any of these visualizations or also the full bookdown project one can have a deep dive at this [Github reposatory](https://github.com/EmanuelSommer/boosting_methods){.uri}. For example the respective code for the EDA can be found in the `03-eda.Rmd` file and the one for the modeling in the `04-modeling.Rmd` file. So having this said let's start with the first data set!

## Burnout data

The data is from the machine learning challenge *HackerEarth Machine Learning Challenge: Are your employees burning out?*. And can be downloaded here: <https://www.kaggle.com/blurredmachine/are-your-employees-burning-out?select=train.csv>

```{r loadBurn, message=FALSE}
# load the data
burnout_data <- read_csv("_data/burn_out_train.csv")
# convert colnames to snake_case
colnames(burnout_data) <- tolower(
  stringr::str_replace_all(
    colnames(burnout_data),
    " ",
    "_"
  ))
# omit missing values in the outcome variable
burnout_data <- burnout_data[!is.na(burnout_data$burn_rate),]
```

### Train-test split

To not allow information leakage the train-test split is performed at the very start of the whole analysis.

```{r traintest}
set.seed(2)
burnout_split <- rsample::initial_split(burnout_data, prop = 0.80)
burnout_train <- rsample::training(burnout_split)
burnout_test  <- rsample::testing(burnout_split)
```

```{r, include=FALSE}
remove(burnout_data)
remove(burnout_split)
```

The training data set contains `r nrow(burnout_train)` rows and `r ncol(burnout_train)` variables.

The test data set contains `r nrow(burnout_test)` observations and naturally also `r ncol(burnout_test)` variables.

### Quick general overview

First look at the classes of the variables.

```{r, echo=FALSE}
col_classes_burn <- burnout_train %>%
  summarise_all(class)
knitr::kable(
  data.frame(column = colnames(burnout_train),
             class = as.character(col_classes_burn[1, ]))
)
```

A general visualization of the whole data set to detect missing values below.

```{r visdat, echo=FALSE}
visdat::vis_dat(burnout_train)
```

```{r}
# percentage of missing values in the training data set
mean(rowSums(is.na(burnout_train)) > 0)
```

As we know that XGBoost can handle missing values we do not have to be concerned. Although one could of course think about imputation or even removal.

### What about the outcome variable?

`burn_rate`: For each employee telling the rate of burnout should be in $[0,1]$. The greater the score the worse the burnout (0 means no burnout at all). As the variable is continuous we have a regression task. Yet it has bounds which has to be treated with when predicting.

The five point summary below shows that the full range is covered and no invalid values are in the data.

```{r, echo=FALSE}
summary(burnout_train$burn_rate)
```

Now the distribution of the outcome.

```{r , echo=FALSE, warning=FALSE, message=FALSE}
burn_rate_box <- ggplot(burnout_train, aes(x = 1, y = burn_rate)) +
  geom_jitter(alpha = 0.05, col = plasma(1)) +
  geom_boxplot(col = "black", size = .8, fill = NA) +
  labs(x = "", y = "Burn rate") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())


burn_rate_hist <- ggplot(burnout_train, aes(x = burn_rate)) +
  geom_histogram(aes(y = ..density..),
                 fill = plasma(1), binwidth = 0.05) +
  stat_function(fun = dnorm, args = 
                  list(
                    mean = mean(burnout_train$burn_rate),
                    sd = sd(burnout_train$burn_rate)),
                col = plasma(3)[2],
                size = 1
    ) +
  annotate("text",
           x = 0.9, y = 1.6,
           label = latex2exp::TeX("$N(\\hat{\\mu}, \\hat{\\sigma}^2)$") ,
           col = plasma(3)[2], size = 5) +
  labs(x = "", y = "", subtitle = "binwidth = 0.05",
       title = "Outcome variable: **Burn Rate**") +
  geom_line(
    data = data.frame(
      x = c(0.75, 0.9),
      y = c(0.8, 1.3)
    ),
    aes(x = x, y = y),
    inherit.aes = FALSE,
    col = plasma(3)[2],
    size = .5) +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
burn_rate_raw <- burn_rate_hist / burn_rate_box
burn_rate_raw
ggsave("_pictures/burn_rate_raw.png", plot = burn_rate_raw)
remove(burn_rate_box, burn_rate_raw)
remove(burn_rate_hist)
```

The distribution of the outcome is very much symmetrical and bell shaped around 0.5 and the whole defined region $[0,1]$ is covered quite well. Actually by overlaying a normal distribution with the sample mean $\hat{\mu}$ and the sample standard deviation $\hat{\sigma}^2$ as the parameters one can clearly see that the outcome almost perfectly follows a normal distribution. One could further fit a Q-Q-plot to visualize the normality. **BUT** of course here there is a bounded domain while the normal distribution has the whole $\mathbb{R}$ as domain. This bounded domain does not interfere with the boosted model as tree-based models do not superimpose a distribution assumptions upon the target variable. Nevertheless one can transform the outcome with the empirical logit $log(\frac{y_i+0.5}{1-y_i+0.5})$.By doing this one removes the bounds on the target. One can then re-transform the predictions in the end by applying $\frac{2}{exp(-y)+1}-0.5$. Here to see whether this transformation changes the behavior or improves the boosting model one will have a look not only at the untransformed target `burn_rate` but also at the transformed one `burn_rate_trans`. The focus will be on the untransformed modeling as the low pre-processing strength of such boosting models should be emphasized. Below is the distribution of the transformed one.

```{r}
# Add transformed outcome
burnout_train$burn_rate_trans <- log((burnout_train$burn_rate + 0.5) /
                                       (1.5 - burnout_train$burn_rate))
burnout_test$burn_rate_trans <- log((burnout_test$burn_rate + 0.5) /
                                       (1.5 - burnout_test$burn_rate))
```

```{r , echo=FALSE, warning=FALSE, message=FALSE}
burn_rate_box <- ggplot(burnout_train, aes(x = 1, y = burn_rate_trans)) +
  geom_jitter(alpha = 0.05, col = plasma(1)) +
  geom_boxplot(col = "black", size = .8, fill = NA) +
  labs(x = "", y = "Burn rate transformed") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())


burn_rate_hist <- ggplot(burnout_train, aes(x = burn_rate_trans)) +
  geom_histogram(aes(y = ..density..),
                 fill = plasma(1), binwidth = 0.1) +
  stat_function(fun = dnorm, args = 
                  list(
                    mean = mean(burnout_train$burn_rate_trans),
                    sd = sd(burnout_train$burn_rate_trans)),
                col = plasma(3)[2],
                size = 1
    ) +
  annotate("text",
           x = 0.8, y = 0.7,
           label = latex2exp::TeX("$N(\\hat{\\mu}, \\hat{\\sigma}^2)$") ,
           col = plasma(3)[2], size = 5) +
  labs(x = "", y = "", subtitle = "binwidth = 0.1",
       title = "Outcome variable: **Burn Rate transformed**") +
  geom_line(
    data = data.frame(
      x = c(0.5, 0.63),
      y = c(0.4, 0.6)
    ),
    aes(x = x, y = y),
    inherit.aes = FALSE,
    col = plasma(3)[2],
    size = .5) +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
burn_rate_trans <- burn_rate_hist / burn_rate_box
burn_rate_trans
ggsave("_pictures/burn_rate_trans.png", plot = burn_rate_trans)
remove(burn_rate_box, burn_rate_trans)
remove(burn_rate_hist)
```

The transformed outcome basically resembles exactly the same properties as the untransformed one but the nice thing is that the bounds were removed. The further EDA will be based on the untransformed variable but the implications are the same due to the choice of the transformation via the empirical logit.

### Distribution and main effects of the predictors

#### Employee ID

`employee_id` is just an ID variable and thus is not useful for any prediction model. But one has to check for duplicates.

```{r}
# TRUE if there are NO duplicates
burnout_train %>%
  group_by(employee_id) %>%
  summarise(n = n()) %>%
  nrow() == nrow(burnout_train)
```

Thus there are no duplicates which is good.

#### Date of joining

`date_of_joining` is the date the employee has joined the company. Thus a continuous variable that most likely needs some kind of feature engineering.

```{r, echo=FALSE}
burnout_train %>%
  group_by(date_of_joining) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = date_of_joining, y = count)) +
  geom_line(col = plasma(1), na.rm = TRUE) +
  theme_light() +
  labs(title = "Distribution of the variable **Date of joining**",
       x = "Date of joining") +
  theme(plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
```

Although there is a lot of variation no major trends in hirings are visible from this plot. Overall the variable seems to be quite equally distributed over the year 2008.

```{r echo=FALSE}
ggplot(burnout_train,
       aes(y = date_of_joining,
           x = burn_rate)) +
  geom_point(alpha = 0.1, col = plasma(1),
             na.rm = TRUE) +
  labs(y = "Date of joining", x = "Burn rate",
       title = "Main effect of the variable **Date of joining**") +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

In its raw form the variable `date_of_joining` seems not to have a notable main effect on the outcome variable. Nevertheless the feature will be used in the model and as tree-based models have an in-built feature selection one can see after the fitting if the feature was helpful overall. The feature will not be included just as an integer (the default format how Dates are represented) but rather some more features like weekday or month will be extracted from the raw variable further down the road.

#### Gender

`gender` represents the gender of the employee. Definitely a categorical variable.

```{r}
# have a look at the discrete distribution
summary(factor(burnout_train$gender))
```

The two classes are well balanced. Now a look at the main effect of the feature.

```{r echo=FALSE}
gender_burn_rate_box <- ggplot(burnout_train,
                               aes(x = as.factor(gender),
                                   y = burn_rate,
                                   col = as.factor(gender))) +
  geom_jitter(alpha = 0.05) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "", y = "Burn rate", col = "") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "None",
        axis.text.y = element_blank())

gender_burn_rate_hist <- ggplot(burnout_train,
                                aes(x = burn_rate,
                                    fill = as.factor(gender))) +
  geom_histogram(binwidth = 0.05, alpha = 0.7,
                 position = "identity") +
  scale_fill_viridis_d(option = "C") +
  labs(x = "", y = "", subtitle = "binwidth = 0.05",
       fill = "",title = "Main effect of the variable **gender**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "top",
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))

gender_burn_rate_hist / gender_burn_rate_box
remove(gender_burn_rate_hist)
remove(gender_burn_rate_box)
```

For both classes the distributions are very similar and symmetrical. It seems like the male employees have overall a slightly higher risk of having a higher burn score i.e. a burnout.

#### Company type

`company_type` is a binary categorical variable that indicates whether the company is a service or product company.

```{r}
# have a look at the discrete distribution
summary(factor(burnout_train$company_type))
```

In this case the classes are not fully balanced but each class is still well represented. Now a look at the main effect of the feature.

```{r echo=FALSE}
comptype_burn_rate_box <- ggplot(burnout_train,
                               aes(x = as.factor(company_type),
                                   y = burn_rate,
                                   col = as.factor(company_type))) +
  geom_jitter(alpha = 0.05) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "", y = "Burn rate", col = "") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "None",
        axis.text.y = element_blank())

comptype_burn_rate_hist <- ggplot(burnout_train,
                                aes(x = burn_rate,
                                    fill = as.factor(company_type))) +
  geom_histogram(binwidth = 0.05, alpha = 0.6,
                 position = "identity") +
  scale_fill_viridis_d(option = "C") +
  labs(x = "", y = "", subtitle = "binwidth = 0.05",
       fill = "",title = "Main effect of the variable **Company Type**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "top",
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))

comptype_burn_rate_hist / comptype_burn_rate_box
remove(comptype_burn_rate_hist)
remove(comptype_burn_rate_box)
```

For both classes the distributions are almost identical and symmetrical. From an univariate point of view no notable main effect is visible from these visualizations.

#### Work from home setup

`wfh_setup_available` indicates whether a working from home setup is available for the employee. So this is again a binary variable.

```{r}
# have a look at the discrete distribution
summary(factor(burnout_train$wfh_setup_available))
```

The two classes are well balanced. Now a look at the main effect of the feature.

```{r echo=FALSE}
wfh_burn_rate_box <- ggplot(burnout_train,
                               aes(x = as.factor(wfh_setup_available),
                                   y = burn_rate,
                                   col = as.factor(wfh_setup_available))) +
  geom_jitter(alpha = 0.05) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "", y = "Burn rate", col = "") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "None",
        axis.text.y = element_blank())

wfh_burn_rate_hist <- ggplot(burnout_train,
                                aes(x = burn_rate,
                                    fill = as.factor(wfh_setup_available))) +
  geom_histogram(binwidth = 0.05, alpha = 0.6,
                 position = "identity") +
  scale_fill_viridis_d(option = "C") +
  labs(x = "", y = "", subtitle = "binwidth = 0.05",
       fill = "",title = "Main effect of the variable **Work from home setup**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "top",
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))

wfh_burn_rate_hist / wfh_burn_rate_box
remove(wfh_burn_rate_hist)
remove(wfh_burn_rate_box)
```

Again both distributions are quite similar i.e. bell shaped and symmetrical. Here quite a main effect is visible. A work from home setup most likely has a positive influence on the wellbeing and thus lowers the risk for a high burn rate.

#### Designation

`designation` A rate within $[0,5]$ that represents the designation in the company for the employee. High values indicate a greater amount of designation.

```{r}
# unique values of the feature
unique(burnout_train$designation)
```

As the feature has a natural ordering this variable will be treated as an ordinal one i.e. be encoded with the integers and not by one-hot-encoding.

```{r, echo=FALSE}
burnout_train %>%
  group_by(designation) %>%
  summarize(count = n()) %>%
  ggplot(aes(x = factor(designation), y = count)) +
  geom_bar(fill = plasma(1), stat = "identity") +
  coord_flip() +
  theme_light() +
  labs(title = "Distribution of the variable **Designation**",
       x = "level of designation") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

Here clearly the more extreme levels of designation are less represented in the data. This makes total sense w.r.t. the meaning of the variable.

```{r echo=FALSE}
ggplot(burnout_train,
       aes(x = as.factor(designation),
           y = burn_rate,
           col = as.factor(designation))) +
  geom_jitter(alpha = 0.1) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "level of designation", y = "Burn rate", col = "",
       title = "Main effect of the variable **Designation**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

A strong main effect is visible in the plot. The plot also further strengthens the hypothesis that we should treat the feature as ordinal. A higher level of designation seems to have an influence on the risk of having a burnout. For example employees from the training data set with a level of designation below 3 never even achieved a maximal burn score of one.

#### Resource allocation

`resource_allocation` A rate within $[1,10]$ that represents the resource allocation to the employee. High values indicate more resources allocated to the employee.

```{r}
# unique values of the feature
unique(burnout_train$resource_allocation)
```

Here again the question is whether one should encode this variable as a categorical or an ordinal categorical feature. In this case as there are quite some levels and again a natural ordering the variable will be encoded as a continuous integer score.

```{r, echo=FALSE}
burnout_train %>%
  group_by(resource_allocation) %>%
  summarize(count = n()) %>%
  ggplot(aes(x = factor(resource_allocation), y = count)) +
  geom_bar(fill = plasma(1), stat = "identity") +
  coord_flip() +
  theme_light() +
  labs(title = "Distribution of the variable **Resource Allocation**",
       x = "Resource Allocation") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

A similar behavior as the one of the previous variable is visible. But here there are some missing values (NA's).

```{r echo=FALSE}
burnout_train %>%
  mutate(resource_allocation = as.character(resource_allocation),
         resource_allocation = if_else(is.na(resource_allocation),
                                       "NA",
                                       resource_allocation),
         resource_allocation = as_factor(resource_allocation),
         resource_allocation = fct_relevel(resource_allocation,
                                           c("NA", paste(1:10)))) %>%
  ggplot(
       aes(x = resource_allocation,
           y = burn_rate,
           col = resource_allocation)) +
  geom_jitter(alpha = 0.1, na.rm = TRUE) +
  geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
  labs(x = "Resource Allocation", y = "Burn rate", col = "",
       title = "Main effect of the variable **Resource Allocation**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

A strong main effect is visible in the plot. The plot again further strengthens the hypothesis that we should treat this feature as ordinal. A higher amount of resources assigned to an employee seems to have a positive influence on the risk of having a burnout. The missing values do not seem to have some structure as they replicate the base distribution of the outcome variable.

#### Mental fatigue score

`mental_fatigue_score` is the level of mental fatigue the employee is facing.

```{r}
# number of unique values
length(unique(burnout_train$mental_fatigue_score)) 
```

This variable will without a question be treated in a continuous way.

```{r, echo=FALSE}
ggplot(burnout_train, aes(x = mental_fatigue_score)) +
  geom_density(fill = plasma(1), col = plasma(1),
               na.rm = TRUE, bw = .8) +
  theme_light() +
  labs(title = "Distribution of the variable **Mental fatigue score**",
       subtitle = "bw = 0.8",
       x = "Mental fatigue score") +
  theme(plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
```

Although there is a very slight skew towards a higher mental fatigue score the overall distribution is still more or less bell shaped and quite symmetrical. Moreover the whole allowed range is covered and the bounds are not violated. Next the main effect of the variable.

```{r echo=FALSE, message=FALSE}
mfs_main <- ggplot(burnout_train,
       aes(y = mental_fatigue_score,
           x = burn_rate)) +
  geom_point(alpha = 0.1, col = plasma(1),
             na.rm = TRUE) +
  labs(y = "Mental fatigue score", x = "Burn rate",
       title = "Main effect of the variable **Mental fatigue score**") +
  annotate("text", x = .75, y = 2.5,
           label = paste("Pearson Correlation:",
                         round(
                           cor(burnout_train$burn_rate,
                               burnout_train$mental_fatigue_score,
                               use = "comp"),
                           3
                         )),
           col = plasma(1), size = 5) +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
mfs_main
ggsave("_pictures/mfs_main.png", plot = mfs_main)
remove(mfs_main)
```

This scatterplot shows drastic results! The mental fatigue score has an almost perfect linear relationship with the outcome variable. This is also underlined by the very high pearson correlation. This indicates that mental fatigue score will be a most important predictor. If a communication with the data collector would be possible it would be important to check whether the two scores have common confounding variables as then one would have to question the practical usability of this predictor. This comes from the fact that no model would be needed if it was as hard to collect the data about the predictors as the outcome data. Moreover there are `r sum(is.na(burnout_train$mental_fatigue_score))` missing values in the feature so for those the model has to rely on the other maybe more weak predictors. It should be noted that when evaluating the final model one should consider to compare its performance to a trivial model (like a single intercept model). When constructing such a trivial model one could and maybe should also use this variable (when available) to get a trivial prediction by scaling the `mental_fatigue_score` feature by a simple scalar.

```{r, include=FALSE}
sum(is.na(burnout_train$mental_fatigue_score))
```

### Relationships between the predictors

An exploration of the relationships between the predictors could also be done by having a look at a correlation and scatterplot matrix. This approach is much quicker than looking at each pairwise relationship individually but also not as precise as the one presented here. Especially for a lot of features such a matrix can get too big to grasp the subtle details. If this is necessary depends on the use case. A very good option if one wants an initial overview is the function `GGally::ggpairs`. So in the following each pairwise relationship will be covered.

#### Date of joining vs. the others

```{r ggpairs, echo=FALSE, message=FALSE, warning=FALSE, out.width='90%'}
# first date of joining vs the purely continuous feature
date_vs_mental <- ggplot(burnout_train,
                         aes(y = date_of_joining,
                             x = mental_fatigue_score)) +
  geom_point(alpha = 0.1, col = plasma(1),
             na.rm = TRUE) +
  labs(y = "Mental fatigue score", x = "Date of joining",
       title = "") +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
# now the categorical ones
date_vs_cat <- sapply(colnames(burnout_train)[3:7], function(var) {
  ggplot(burnout_train, aes(y = date_of_joining,
                            x = factor(.data[[var]]),
                            col = factor(.data[[var]]))) +
    geom_jitter(alpha = 0.1, na.rm = TRUE) +
    geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
    labs(y = "Date of joining", x = var, col = "",
         title = "") +
    scale_color_viridis_d(option = "C") +
    coord_flip() +
    theme_light() +
    theme(legend.position = "None")
}, simplify = FALSE)

(date_vs_mental | date_vs_cat[[1]]) /
  (date_vs_cat[[2]] | date_vs_cat[[3]]) /
  (date_vs_cat[[4]] | date_vs_cat[[5]])

remove(date_vs_cat, date_vs_mental)
```

No major relationship can be detected here.

#### Gender vs. the remaining

**Contingency tables for the comparison of two binary features:**

```{r}
# Gender vs Company type
table(burnout_train$gender, burnout_train$company_type)
```

No huge tendency visible.

```{r}
# Gender vs Work from home setup
table(burnout_train$gender, burnout_train$wfh_setup_available)
```

Slightly more women have a work from home setup available.

**Now the ordinal variables:**

```{r, echo=FALSE}
burnout_train %>%
  group_by(designation, gender) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(designation), y = count,
             fill = gender)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Designation** vs **Gender**",
       x = "level of designation") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

It has to be noted again that female and male emplyees are almost equally represented in the data set. Thus one can see from the above plot that the biggest difference in distribution is for the levels 1 and 4 with opposing effects. While male employees more often have a quite high designation of 4 females are the much more frequent employee with designation level 1.

```{r, echo=FALSE}
burnout_train %>%
  group_by(resource_allocation, gender) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(resource_allocation), y = count,
             fill = gender)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Resource allocation** vs **Gender**",
       x = "Resource allocation") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

Here a major shift in distribution is visible towards men getting more resources allocated to them. This reflects the society that still promotes men much more often to high paying jobs that most often come with resource responsibility.

**Now the mental fatigue score:**

```{r, echo=FALSE}
ggplot(burnout_train, 
       aes(x = as.factor(gender),
       y = mental_fatigue_score,
       col = as.factor(gender))) +
  geom_jitter(alpha = 0.05, na.rm = TRUE) +
  geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
  labs(x = "", y = "Mental fatigue score", col = "",
       title = "**Mental fatigue score** vs **Gender**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

This is of course very similar to the main effect of the `gender` variable as the outcome and the feature `mental_fatigue_score` are highly linearly correlated.

#### Company type vs. the remaining

```{r}
# Company type vs Work from home setup
table(burnout_train$company_type, burnout_train$wfh_setup_available)
```

No notable trend.

```{r, echo=FALSE}
comp_type_plot_list <- list()
comp_type_plot_list$des <- burnout_train %>%
  group_by(designation, company_type) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(company_type) %>%
  mutate(rel_count = count / sum(count)) %>%
  ggplot(aes(x = factor(designation), y = rel_count,
             fill = company_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Designation** vs **Company type**",
       fill = "Company type", y = "relative frequency",
       x = "level of designation") +
  theme(plot.title = ggtext::element_markdown(size = 11))

comp_type_plot_list$ress <- burnout_train %>%
  group_by(resource_allocation, company_type) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(company_type) %>%
  mutate(rel_count = count / sum(count)) %>%
  ggplot(aes(x = factor(resource_allocation), y = rel_count,
             fill = company_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Resource allocation** vs **Company type**",
       fill = "Company type", y = "relative frequency",
       x = "Resource allocation") +
  theme(plot.title = ggtext::element_markdown(size = 11))

comp_type_plot_list$mfs <- ggplot(burnout_train, 
       aes(x = as.factor(company_type),
       y = mental_fatigue_score,
       col = as.factor(company_type))) +
  geom_jitter(alpha = 0.05, na.rm = TRUE) +
  geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
  labs(x = "", y = "Mental fatigue score", col = "Company type",
       title = "**Mental fatigue score** vs **Company type**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))

(comp_type_plot_list$des | comp_type_plot_list$ress) /
  comp_type_plot_list$mfs + plot_layout(guides = 'collect')

remove(comp_type_plot_list)
```

No trend here either.

#### Work from home setup vs. the remaining

```{r, echo=FALSE}
burnout_train %>%
  group_by(designation, wfh_setup_available) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(designation), y = count,
             fill = wfh_setup_available)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Designation** vs **Work from home setup**",
       fill = "Work from home setup",
       x = "level of designation") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

A work from home setup is way more often available for employees with a lower designation ($\leq 2$).

```{r, echo=FALSE}
burnout_train %>%
  group_by(resource_allocation, wfh_setup_available) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(resource_allocation), y = count,
             fill = wfh_setup_available)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Resource allocation** vs **Work from home setup**",
       fill = "Work from home setup",
       x = "Resource allocation") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

The same structure as in the previous comparison is visible here again. Employees with a lower amount of resources allocated to them have more often a work from home setup available. This could be due to the fewer responsibilities they have in the business.

```{r, echo=FALSE}
ggplot(burnout_train, 
       aes(x = as.factor(wfh_setup_available),
       y = mental_fatigue_score,
       col = as.factor(wfh_setup_available))) +
  geom_jitter(alpha = 0.05, na.rm = TRUE) +
  geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
  labs(y = "Mental fatigue score", col = "",
       x = "Work from home setup",
       title = "**Mental fatigue score** vs **Work from home setup**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

Again this is of course very similar to the main effect of the `wfh_setup_available` variable as the outcome and the feature `mental_fatigue_score` are highly linearly correlated.

#### Designation vs the remaining

```{r, echo=FALSE, message=FALSE}
designationVsRessources <- burnout_train %>%
  group_by(designation, resource_allocation) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(designation),
             y = factor(resource_allocation),
             fill = count,
             label = as.character(count))) +
  geom_tile() +
  geom_text() +
  scale_fill_viridis_c(option = "C") +
  labs(x = "Level of designation",
       y = "Resource allocation",
       title = "**Designation** vs **Resource allocation**") +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
designationVsRessources
ggsave("_pictures/designationVsRessources.png",
       plot = designationVsRessources, )
remove(designationVsRessources)
```

Here a strong quite linear relationship is visible. This is sensible as often more resource responsibility is given to employees with high designation.

The last two relationships will be omitted here as both for the variable `designation` as well as for `resource_allocation` the comparison with `mental_fatigue_score` will be very similar to the main effect of the two variables. This comes again from the high correlation of the latter with the outcome.

Overall some stronger and mainly less strong relationships between the predictors could be detected. Not like in ordinary least squares regression for gradient tree boosting no decorrelation and normalization of the features is needed. But before one fixes the pre-processing one can try to extract some more information from some features through some feature engineering.

### Some feature engineering

The only variable that allows for reasonable feature engineering is the date of joining predictor. One can try to extract some underlying patterns and see if an effect on the outcome is visible.

First extract the day of the week:

```{r echo=FALSE}
burnout_train %>%
  mutate(wday = lubridate::wday(date_of_joining),
         wday = ordered(wday,
      levels = c("2", "3", "4", "5", "6", "7", "1"),
      labels = c(
        "Monday", "Tuesday", "Wednesday",
        "Thursday", "Friday", "Saturday",
        "Sunday"
      )
    )) %>%
  ggplot(
       aes(x = as.factor(wday),
           y = burn_rate,
           col = as.factor(wday))
         ) +
  geom_jitter(alpha = 0.1) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "Weekday", y = "Burn rate", col = "",
       title = "Main effect of the variable **Weekday**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

No main effect is visible here. So try the month next.

```{r echo=FALSE}
burnout_train %>%
  mutate(wday = lubridate::month(date_of_joining)) %>%
  ggplot(
       aes(x = as.factor(wday),
           y = burn_rate,
           col = as.factor(wday))
         ) +
  geom_jitter(alpha = 0.1) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "Month", y = "Burn rate", col = "",
       title = "Main effect of the variable **Month**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

Again no main effect is visible.

Nevertheless one can include those two variables into the model because as mentioned before tree-based models actually perform a feature selection at each split. So including these just comes at a small computational cost. This minimal pre-processing that is needed when dealing with tree-based models is actually one of its biggest strengths. It is very robust against any kind of weird selection of features with different scales for example. This is one of the reasons, beside the strong predictive power, for the heavy use of such models in data mining applications.[@elements]

### Create the recipe

A recipe is an object that defines a series of steps for data pre-processing and feature engineering.

```{r recipeBurn}
### recipe for xgboost (nominal variables must be dummy variables)
# define outcome, predictors and training data set
burnout_rec_boost <- recipe(burn_rate ~ date_of_joining + gender +
                            company_type + wfh_setup_available +
                            designation + resource_allocation +
                            mental_fatigue_score,
                            data = burnout_train) %>%
  # extract the date features day of the week and month
  step_date(date_of_joining, features = c("dow", "month")) %>%
  # dummify all nominal features
  step_dummy(all_nominal()) %>%
  # encode the date as integers
  step_mutate(date_of_joining = as.integer(date_of_joining))

### recipe for xgboost (nominal variables must be dummy variables)
### HERE the TRANSFORMED target
burnout_rec_boost_trans <- recipe(burn_rate_trans ~ date_of_joining + gender +
                            company_type + wfh_setup_available +
                            designation + resource_allocation +
                            mental_fatigue_score,
                            data = burnout_train) %>%
  # extract the date features day of the week and month
  step_date(date_of_joining, features = c("dow", "month")) %>%
  # dummify all nominal features
  step_dummy(all_nominal()) %>%
  # encode the date as integers
  step_mutate(date_of_joining = as.integer(date_of_joining))

### recipe for a random forest model for comparison (no dummy encoding needed)
# same as above without dummification but here the na's have to be
# imputed (here via knn)
burnout_rec_rf <- recipe(burn_rate ~ date_of_joining + gender +
                         company_type + wfh_setup_available +
                         designation + resource_allocation +
                         mental_fatigue_score,
                         data = burnout_train) %>%
  step_string2factor(all_nominal()) %>%
  step_impute_knn(resource_allocation, neighbors = 5) %>%
  step_impute_knn(mental_fatigue_score, neighbors = 5) %>%
  step_date(date_of_joining, features = c("dow", "month")) %>%
  step_mutate(date_of_joining = as.integer(date_of_joining))
```

```{r, eval=FALSE, include=FALSE}
bake(prep(burnout_rec_boost), new_data = NULL) %>%
  select(!starts_with("date_of_joining_")) %>%
  visdat::gather_cor() %>%
  mutate(
    row_1 = str_replace_all(row_1, "_", " "),
    row_2 = str_replace_all(row_2, "_", " "),
    across(starts_with("row"), function(string) {
      string <- str_remove(string, " Yes")
      string <- str_remove(string, " Male")
      string <- str_remove(string, " Service")
      fac <- as.factor(string)
      fac
    }),
    row_1 = fct_reorder(row_1, value, sum),
    row_2 = fct_reorder(row_2, value, sum)
  ) %>%
  ggplot(aes(x = row_1,
             y = row_2,
             fill = value,
             label = as.character(round(value, 2)))) +
  geom_tile() +
  geom_text() +
  scale_fill_gradient2(low = plasma(3)[1],
                       high = plasma(3)[2],
                       mid = "white",
                       midpoint = 0) +
  labs(x = "",
       y = "", fill = "",
       title = "Pearson correlation matrix") +
  theme_light() +
  coord_equal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1,
                                   vjust = 1),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        axis.text = ggtext::element_markdown(size = 11))
```

## Insurence data

The insurance data set is part of the book *Machine Learning with R* by Brett Lantz. It can be downloaded here: <https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv>

```{r loadIns, message=FALSE}
# load the data
insurance_data <- read_csv("_data/insurance.csv")
```

### Train-test split

```{r traintestIns}
set.seed(2)
ins_split <- rsample::initial_split(insurance_data, prop = 0.80)
ins_train <- rsample::training(ins_split)
ins_test  <- rsample::testing(ins_split)
```

```{r, include=FALSE}
remove(insurance_data)
remove(ins_split)
```

The training data set contains `r nrow(ins_train)` rows and `r ncol(ins_train)` variables.

The test data set contains `r nrow(ins_test)` observations and naturally also `r ncol(ins_test)` variables.

### A general overview

A general visualization of the whole data set in order to detect missing values.

```{r visdatIns, echo=FALSE}
visdat::vis_dat(ins_train)
```

So there are no missing values!

### What about the outcome?

`charges`: Individual medical costs billed by health insurance.

The five point summary below shows that the no invalid values i.e. negative ones are in the data.

```{r, echo=FALSE}
summary(ins_train$charges)
```

Now the distribution of the outcome.

```{r , echo=FALSE}
charges_box <- ggplot(ins_train, aes(x = 1, y = charges)) +
  geom_jitter(alpha = 0.3, col = plasma(1)) +
  geom_boxplot(col = "black", size = .8, fill = NA) +
  labs(x = "", y = "Charges") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())


charges_hist <- ggplot(ins_train, aes(x = charges)) +
  geom_histogram(fill = plasma(1), binwidth = 800) +
  labs(x = "", y = "", subtitle = "binwidth = 800",
       title = "Outcome variable: **Charges**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
charges_hist / charges_box
remove(charges_box)
remove(charges_hist)
```

Not like the `burn_rate` previously this target distribution is not at all symmetrical but highly right skewed. A natural thing to do would be a log transformation of the outcome. The resulting `log10_charges` outcome variable is shown below.

```{r , echo=FALSE, message=FALSE, warning=FALSE, include=FALSE}
charges_box <- ggplot(ins_train, aes(x = 1, y = log10(charges))) +
  geom_jitter(alpha = 0.3, col = plasma(1)) +
  geom_boxplot(col = "black", size = .8, fill = NA) +
  labs(x = "", y = "log<sub>10</sub>(Charges)") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x = ggtext::element_markdown())


charges_hist <- ggplot(ins_train, aes(x = log10(charges))) +
  geom_histogram(fill = plasma(1), binwidth = 0.1) +
  labs(x = "", y = "", subtitle = "binwidth = 0.1",
       title = "Outcome variable: **log<sub>10</sub>(Charges)**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
charges_ins <- charges_hist / charges_box

```

```{r, message=FALSE, echo=FALSE}
charges_ins
ggsave("_pictures/charges_ins.png", plot = charges_ins)
remove(charges_box)
remove(charges_hist, charges_ins)
```

Although such a transformation of the outcome variable is not needed for tree-based modeling it can make the job of the algorithm somewhat easier.

### Distribution and main effects of the predictors

#### Age

`age`: The age of the insurance contractor. This is naturally a continuous variable.

```{r, echo=FALSE}
ggplot(ins_train, aes(x = age)) +
  geom_density(fill = plasma(1), col = plasma(1),
               na.rm = TRUE, bw = .8) +
  theme_light() +
  labs(title = "Distribution of the variable **Age**",
       subtitle = "bw = 0.8",
       x = "Age") +
  theme(plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
```

A wide range of ages is covered. Notably there is a peak at roughly 18 which means that many fresh adults were observed in this data set.

```{r echo=FALSE}
ggplot(ins_train,
       aes(y = age,
           x = log10(charges))) +
  geom_point(alpha = 0.4, col = plasma(1),
             na.rm = TRUE) +
  geom_smooth(method = "loess", se = FALSE, 
              col = plasma(1), formula = 'y ~ x',
              size = 1.2) +
  labs(y = "Age", x = "log<sub>10</sub>(Charges)",
       title = "Main effect of the variable **Age**") +
  theme_light() +
  theme(legend.position = "None",
        axis.title.x = ggtext::element_markdown(),
        plot.title = ggtext::element_markdown(size = 11))
```

There seems to be a strong main effect although it does not seem to be linear. The general trend is that older contractors generally accumulate more medical costs. This is very intuitive.

#### Sex

`sex`: The insurance contractors gender. Here either female or male. This means it is a binary variable and will be treated as such.

```{r}
summary(as.factor(ins_train$sex))
```

The classes are very well balanced. Now the main effect.

```{r echo=FALSE}
sex_charges_box <- ggplot(ins_train,
                               aes(x = as.factor(sex),
                                   y = log10(charges),
                                   col = as.factor(sex))) +
  geom_jitter(alpha = 0.3) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "", y = "log<sub>10</sub>(Charges)", col = "") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.title.x = ggtext::element_markdown(),
        legend.position = "None",
        axis.text.y = element_blank())

sex_charges_hist <- ggplot(ins_train,
                                aes(x = log10(charges),
                                    fill = as.factor(sex))) +
  geom_histogram(binwidth = 0.1, alpha = 0.7,
                 position = "identity") +
  scale_fill_viridis_d(option = "C") +
  labs(x = "", y = "", subtitle = "binwidth = 0.1",
       fill = "",title = "Main effect of the variable **Sex**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "top",
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))

sex_charges_hist / sex_charges_box
remove(sex_charges_hist)
remove(sex_charges_box)
```

No notable difference can be detected here.

#### Body mass index

`bmi`: The body mass index is providing an understanding of the body composition. It is a ratio composed out of the weight which is divided by the height $\frac{kg}{m^2}$. Ideally the ratio is between 18.5 and 24.9. The variable is obviously a continuous variable.

```{r, echo=FALSE}
ggplot(ins_train, aes(x = bmi)) +
  geom_density(fill = plasma(1), col = plasma(1),
               na.rm = TRUE, bw = 1) +
  theme_light() +
  labs(title = "Distribution of the variable **BMI**",
       subtitle = "bw = 1",
       x = "BMI") +
  geom_vline(xintercept = c(18.5, 24.9), linetype = "longdash",
             col = plasma(2)[2], size = 1) +
  annotate("text", x = 21.7, y = 0.005, label = "ideal range",
           col = plasma(2)[2]) +
  theme(plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))
```

The distribution is bell-shaped and symmetrical roughly around a bmi of 30 which is above the ideal range. Actually only a small amount of the data falls into the normal range here. Moreover the right tail is heavier than the left one. Now a look at the main effect of the variable.

```{r echo=FALSE}
ggplot(ins_train,
       aes(y = bmi,
           x = log10(charges))) +
  geom_point(alpha = 0.5, col = plasma(1),
             na.rm = TRUE) +
  geom_smooth(method = "loess", se = FALSE, 
              col = plasma(1), formula = 'y ~ x',
              size = 1.2) +
  labs(y = "BMI", x = "log<sub>10</sub>(Charges)",
       title = "Main effect of the variable **BMI**") +
  theme_light() +
  theme(legend.position = "None",
        axis.title.x = ggtext::element_markdown(),
        plot.title = ggtext::element_markdown(size = 11))
```

With some fantasy one can grasp some non-linear patterns on the right side of the plot but beside that no strong main effect is visible here.

#### Number of children

`children`: The number of children or dependents covered by the health insurance.

```{r}
# unique values of the feature
unique(ins_train$children)
```

This could be treated as categorical but as there is a natural ordering it will be encoded by the integers so the treatment is like the one of a continuous feature.

```{r, echo=FALSE}
ins_train %>%
  group_by(children) %>%
  summarize(count = n()) %>%
  ggplot(aes(x = factor(children), y = count)) +
  geom_bar(fill = plasma(1), stat = "identity") +
  coord_flip() +
  theme_light() +
  labs(title = "Distribution of the variable **Children**",
       x = "# children/ dependents") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

As one would also think the more children the lower the number of observed values. Especially the numbers greater than 3 are not well represented. If encoded by one-hot-encoding one would have to think about removing these then near-zero-variance variables. But as they will be encoded in a continuous way this is no problem at all. A look at the main effects can now strengthen or weaken this hypothesis of a natural ordering.

```{r echo=FALSE}
ggplot(ins_train,
       aes(x = as.factor(children),
           y = log10(charges),
           col = as.factor(children))) +
  geom_jitter(alpha = 0.7) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "# children/ dependents", y = "log<sub>10</sub>(Charges)", 
       col = "",
       title = "Main effect of the variable **Children**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        axis.title.x = ggtext::element_markdown(),
        plot.title = ggtext::element_markdown(size = 11))
```

This plot is quite similar to the main effect plot for the `age` feature. As most likely (will be checked later) the age is positively correlated with the number of children one can observe a rise of the minimal observed charges towards a greater amount of children. The upper two boxplots are built with just a few observations so they should not be interpreted in great detail. Overall there seems to be some kind of main effect.

#### Smoking

`smoker`: Is the contractor smoking? Of course a binary variable.

```{r}
summary(as.factor(ins_train$smoker))
```

The classes are not balanced but the class of the smokers is still represented with a good amount of observations. Now a look at the main effect.

```{r echo=FALSE}
smoke_charges_box <- ggplot(ins_train,
                               aes(x = as.factor(smoker),
                                   y = log10(charges),
                                   col = as.factor(smoker))) +
  geom_jitter(alpha = 0.3) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "", y = "log<sub>10</sub>(Charges)", col = "") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        axis.title.x = ggtext::element_markdown(),
        legend.position = "None",
        axis.text.y = element_blank())

smoke_charges_hist <- ggplot(ins_train,
                                aes(x = log10(charges),
                                    fill = as.factor(smoker))) +
  geom_histogram(binwidth = 0.05, alpha = 0.7,
                 position = "identity") +
  scale_fill_viridis_d(option = "C") +
  labs(x = "", y = "", subtitle = "binwidth = 0.05",
       fill = "",title = "Main effect of the variable **Smoker**") +
  theme_light() +
  theme(axis.ticks.y = element_blank(),
        legend.position = "top",
        axis.text.y = element_blank(),
        plot.title = ggtext::element_markdown(size = 11),
        plot.subtitle = ggtext::element_markdown(size = 8))

smoke_charges_hist / smoke_charges_box
remove(smoke_charges_hist)
remove(smoke_charges_box)
```

This main effect is as drastic as it is intuitive. Smoking seems to definitely increases the charges. This means that this variable has probably a lot of predictive power.

#### Region

`region`: The beneficiary's residential area in the US. Either northeast, southeast, southwest or northwest. This definitely is a categorical variable.

```{r, echo=FALSE}
ins_train %>%
  group_by(region) %>%
  summarize(count = n()) %>%
  ggplot(aes(x = factor(region), y = count)) +
  geom_bar(fill = plasma(1), stat = "identity") +
  coord_flip() +
  theme_light() +
  labs(title = "Distribution of the variable **Region**",
       x = "Region") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

The four regions are balanced. Now to the main effect.

```{r echo=FALSE}
ggplot(ins_train,
       aes(x = as.factor(region),
           y = log10(charges),
           col = as.factor(region))) +
  geom_jitter(alpha = 0.7) +
  geom_boxplot(size = 0.8, col = "black", fill = NA) +
  labs(x = "Region", y = "log<sub>10</sub>(Charges)", 
       col = "",
       title = "Main effect of the variable **Region**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        axis.title.x = ggtext::element_markdown(),
        plot.title = ggtext::element_markdown(size = 11))
```

No important main effect is detectable from this plot.

### Relationships between the predictors

#### Age vs the others

First the continuous one: `bmi`

```{r echo=FALSE}
ggplot(ins_train,
       aes(y = bmi,
           x = age)) +
  geom_point(alpha = 0.5, col = plasma(1),
             na.rm = TRUE) +
  geom_smooth(method = "loess", se = FALSE, 
              col = plasma(1), formula = 'y ~ x',
              size = 1.2) +
  labs(y = "BMI", x = "Age",
       title = "**Age** vs **BMI**") +
  theme_light() +
  theme(legend.position = "None",
        axis.title.x = ggtext::element_markdown(),
        plot.title = ggtext::element_markdown(size = 11))
```

No relationship detectable. The pearson correlation is with `r round(cor(ins_train$age, ins_train$bmi), 3)` also low.

```{r, include=FALSE}
cor(ins_train$age, ins_train$bmi)

```

```{r, echo=FALSE}
# now the categorical ones
age_vs_cat <- sapply(colnames(ins_train)[c(2, 4:6)], function(var) {
  ggplot(ins_train, aes(y = age,
                            x = factor(.data[[var]]),
                            col = factor(.data[[var]]))) +
    geom_jitter(alpha = 0.2, na.rm = TRUE) +
    geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
    labs(y = "Age", x = var, col = "",
         title = "") +
    scale_color_viridis_d(option = "C") +
    coord_flip() +
    theme_light() +
    theme(legend.position = "None")
}, simplify = FALSE)

(age_vs_cat[[1]] | age_vs_cat[[2]]) /
  (age_vs_cat[[3]] | age_vs_cat[[4]]) 

remove(age_vs_cat)
```

The most interesting take away from these four plots is that the hypothesis about the age of the contractors with children seem to be okish except for the ones with more than 3 children. But again this counterintuitive behavior could also be due to the few samples. At this point one might think about encoding the children variable as a categorical but in the following it will be left continuous.

#### Sex vs the remaining

```{r, echo=FALSE}
sex_plot_list <- list()
sex_plot_list$bmi <- ggplot(ins_train, 
       aes(x = as.factor(sex),
       y = bmi,
       col = as.factor(sex))) +
  geom_jitter(alpha = 0.7, na.rm = TRUE) +
  geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
  labs(y = "Sex", col = "",
       x = "BMI",
       title = "**Sex** vs **BMI**") +
  scale_color_viridis_d(option = "C") +
  coord_flip() +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))

sex_plot_list$child <- ins_train %>%
  group_by(children, sex) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(children), y = count,
             fill = sex)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Sex** vs **Children**", fill = "",
       x = "# children/ dependents") +
  theme(plot.title = ggtext::element_markdown(size = 11))

sex_plot_list$region <- ins_train %>%
  group_by(region, sex) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(region), y = count,
             fill = sex)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Sex** vs **Region**", fill = "",
       x = "Region") +
  theme(plot.title = ggtext::element_markdown(size = 11))

(sex_plot_list$region | sex_plot_list$child) / 
  sex_plot_list$bmi + plot_layout(guides = "collect")

remove(sex_plot_list)
```

No notable differences.

Contingency table for the binary variable `smoker`:

```{r}
# sex vs smoker
table(ins_train$sex, ins_train$smoker)
```

Slightly more men smoke but the difference is smallish.

#### BMI vs the remaining

```{r, echo=FALSE}
bmi_vs_cat <- sapply(colnames(ins_train)[c(4:6)], function(var) {
  ggplot(ins_train, aes(y = bmi,
                            x = factor(.data[[var]]),
                            col = factor(.data[[var]]))) +
    geom_jitter(alpha = 0.3, na.rm = TRUE) +
    geom_boxplot(size = 0.8, col = "black", fill = NA, na.rm = TRUE) +
    labs(y = "BMI", x = var, col = "",
         title = "") +
    scale_color_viridis_d(option = "C") +
    coord_flip() +
    theme_light() +
    theme(legend.position = "None")
}, simplify = FALSE)

(bmi_vs_cat[[1]] | bmi_vs_cat[[2]]) / bmi_vs_cat[[3]]

remove(bmi_vs_cat)
```

The most notable fact here is that southeast of the US seems to be a little more overweight than the rest.

#### Children vs the remaining

```{r, echo=FALSE}
ins_train %>%
  group_by(children, smoker) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(smoker) %>%
  mutate(count = count / sum(count)) %>%
  ggplot(aes(x = factor(children), y = count,
             fill = smoker)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Children** vs **Smoker**",
       y = "relative frequency",
       x = "Children") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

In relative terms actually the contractors with three children smoke the most but the other levels seem quite balanced.

```{r, echo=FALSE}
ins_train %>%
  group_by(children, region) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = factor(children),
             y = factor(region),
             fill = count,
             label = as.character(count))) +
  geom_tile() +
  geom_text() +
  scale_fill_viridis_c(option = "C") +
  labs(x = "# children/ dependents",
       y = "Region",
       title = "**Children** vs **Region**") +
  theme_light() +
  theme(legend.position = "None",
        plot.title = ggtext::element_markdown(size = 11))
```

Here no trend is visible.

#### Smoker vs Region

```{r, echo=FALSE}
ins_train %>%
  group_by(region, smoker) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(smoker) %>%
  mutate(count = count / sum(count)) %>%
  ggplot(aes(x = factor(region), y = count,
             fill = smoker)) +
  geom_bar(position = "dodge", stat = "identity") +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  theme_light() +
  labs(title = "**Smoker** vs **Region**",
       y = "relative frequency",
       x = "Region") +
  theme(plot.title = ggtext::element_markdown(size = 11))
```

So the southeast is not only the most overweight region but also the one with the most smokers in relative terms.

This concludes the tour of the pairwise relationships. Of course such an in detail look at all pairwise relationships for both data sets was only possible because there are quite few predictors and is not always needed in this extend. Besides a crystal clear understanding of the data one sees that there is not much room left for feature engineering for the insurance data set. Thus one can go on and define the recipe.

### Create the recipe

Transformations on the outcome variable are not good practice within a recipe thus this will be done now before hand by adding a new feature i.e. `log10_charges` to the train and test data set.

```{r}
# add the log transformed outcome variable to the data
ins_train$log10_charges <- log10(ins_train$charges)
ins_test$log10_charges <- log10(ins_test$charges)
```

```{r}
### recipe for xgboost (nominal variables must be dummy variables)
# define outcome, predictors and training data set
ins_rec_boost <- recipe(log10_charges ~ age + sex +
                            bmi + children +
                            smoker + region,
                            data = ins_train) %>%
  # dummify all nominal features (sex, smoker, region)
  step_dummy(all_nominal())

### recipe for a random forest model for comparison (no dummy encoding needed)
# same as above without dummification 
ins_rec_rf <- recipe(log10_charges ~ age + sex +
                         bmi + children +
                         smoker + region,
                         data = ins_train)
```

Having now all the recipes ready one can proceed with modeling. Finally!