|
1 |
| -# Residual-diagnostics Plots |
| 1 | +# Residual-Diagnostics Plots |
2 | 2 |
|
3 |
| -**Learning objectives:** |
| 3 | +**Learning Objectives:** |
4 | 4 |
|
5 |
| -- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY |
| 5 | +This section presents graphical methods for a detailed examination of **model performance** at both **overall** and **instance-specific levels**. |
6 | 6 |
|
7 |
| -## SLIDE 1 {-} |
| 7 | +- Residuals can be utilized to: |
| 8 | + - **Identify potentially problematic instances**. This can help define which factors contribute most significantly to prediction errors. |
| 9 | + - **Detect any systematic deviations from the expected behavior** that could be due to: |
| 10 | + - The omission of explanatory variables |
| 11 | + - The inclusion of a variable in an incorrect functional form. |
| 12 | + - **Identify the largest prediction errors**, irrespective of the overall performance of a predictive model. |
8 | 13 |
|
9 |
| -- ADD SLIDES AS SECTIONS (`##`). |
10 |
| -- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF. |
| 14 | +## Quality of predictions {-} |
| 15 | + |
| 16 | +- In a **"perfect" predictive model** `predicted value` == `actual value` of the variable for every observation. |
| 17 | + |
| 18 | +- We want the predictions to be **reasonably close** to the actual values. |
| 19 | + |
| 20 | +- To quantify the **quality of predictions** we can use the *difference* between the `predicted value` and the `actual value`called as **residual**. |
| 21 | + |
| 22 | +For a continuous dependent variable $Y$, residual $r_i$ for the $i$-th observation in a dataset: |
| 23 | + |
| 24 | +```{=tex} |
| 25 | +\begin{equation} |
| 26 | +r_i = y_i - f(\underline{x}_i) = y_i - \widehat{y}_i |
| 27 | +\end{equation} |
| 28 | +``` |
| 29 | +## Characteristics of a good model {-} |
| 30 | + |
| 31 | +To evaluate a model we need to study the *"behavior" of residuals* for a group of observations. To confirm that: |
| 32 | + |
| 33 | +- They are **deviating from zero randomly** implying that: |
| 34 | + - Their distribution should be **symmetric around zero**, so their mean (or median) value should be zero. |
| 35 | + - Their values most be **close to zero** to show low variability. |
| 36 | + |
| 37 | +## Graphical methods to verify proporties {-} |
| 38 | + |
| 39 | +- **Histogram**: To check the **symmetry** and **location** of the distribution of residuals *without any assumption*. |
| 40 | + |
| 41 | +- **Quantile-quantile plot**: To check whether the residuals follow a concrete distribution. |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +## Standardized *(Pearson)* residuals {-} |
| 46 | + |
| 47 | +```{=tex} |
| 48 | +\begin{equation} |
| 49 | +\tilde{r}_i = \frac{r_i}{\sqrt{\mbox{Var}(r_i)}} |
| 50 | +\end{equation} |
| 51 | +``` |
| 52 | +where $\mbox{Var}(r_i)$ is the **estimated variance** of the residual $r_i$. |
| 53 | + |
| 54 | +| **Model** | **Estimation Method** | |
| 55 | +|:-------------------------|:---------------------------------------------| |
| 56 | +| Classical linear-regression model | The design matrix. | |
| 57 | +| Poisson regression | The expected value of the count. | |
| 58 | +| Complicated models | A constant for all residuals. | |
| 59 | + |
| 60 | +## Exploring residuals for classification models {-} |
| 61 | + |
| 62 | +- Due their range ($[-1,1]$) limitations, the residuals $r_i$ are not very useful to explore the probability of observing $y_i$. |
| 63 | + |
| 64 | +- If **all explanatory variables are categorical** with a limited number of categories the **standard-normal approximation** is likely if follow the following steps: |
| 65 | + |
| 66 | + - Divide the observed values in $K$ groups sharing the same predicted value $f_k$. |
| 67 | + - Average the residuals $r_i$ per group and standardizing them with $\sqrt{f_k(1-f_k)/n_k}$ |
| 68 | + |
| 69 | +## Exploring residuals for classification models {-} |
| 70 | + |
| 71 | +In datasets, where different observations lead to **different model predictions** the *Pearson residuals* will **not** be approximated by the **standard-normal** one. |
| 72 | + |
| 73 | +But the **index plot** may still be useful to detect observations with **large residuals**. |
| 74 | + |
| 75 | +{width="45%" height="60%"} |
| 76 | + |
| 77 | +## Exploring residuals for classical linear-regression models {-} |
| 78 | + |
| 79 | +- Residuals should be normally distributed with mean zero |
| 80 | +- The leverage values from the diagonal of hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$. |
| 81 | + |
| 82 | +$$ |
| 83 | +\mathbf{\hat{y}} = |
| 84 | +\mathbf{X}\hat{\beta} = |
| 85 | +\mathbf{X}[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}] = |
| 86 | +\mathbf{H}\mathbf{y} |
| 87 | +$$ |
| 88 | + |
| 89 | +- Expected variance given by: $\text{Var}(e_i) = \sigma^2 (1 - h_{ii})$ |
| 90 | +- For independent explanatory variables, it should lead to a constant variance of residuals. |
| 91 | + |
| 92 | +## Residuals $r_i$ in function of predicted values {-} |
| 93 | + |
| 94 | +The plot should show points scattered **symmetrically** around the horizontal **straight line at 0**, but: |
| 95 | + |
| 96 | +- It has got a shape of a funnel, reflecting **increasing variability** of residuals for increasing fitted values. The variance is not constant *(homoscedasticity violation)*. |
| 97 | + |
| 98 | +- The smoothed line suggests that the mean of residuals becomes **increasingly positive** for increasing fitted values. The residuals don't seems to have a *zero-mean*. |
| 99 | + |
| 100 | +{width="45%" height="60%"} |
| 101 | + |
| 102 | +## Square root of standardized residuals $\sqrt{\tilde{r}_i}$ in function of predicted values {-} |
| 103 | + |
| 104 | +The plot should show points scattered **symmetrically** across the horizontal axis. |
| 105 | + |
| 106 | +- The increase in $\sqrt{\tilde{r}_i}$ indicates a violation of the *homoscedasticity assumption*. |
| 107 | + |
| 108 | +{width="45%" height="60%"} |
| 109 | + |
| 110 | +## Standardized residuals $\tilde{r}_i$ in function of leverage $l_i$ {-} |
| 111 | + |
| 112 | +- **Leverage** $l_i$ is a measure of how far away the **independent variable values** of an observation are from those of the other observations. |
| 113 | + |
| 114 | +- Data points with **large residuals (outliers)** and/or **high leverage** may distort the outcome and **accuracy of a regression**. |
| 115 | + |
| 116 | +- The **predicted sum-of-squares**: |
| 117 | + |
| 118 | +```{=tex} |
| 119 | +\begin{equation} |
| 120 | +PRESS = \sum_{i=1}^{n} (\widehat{y}_{i(-i)} - y_i)^2 = \sum_{i=1}^{n} \frac{r_i^2}{(1-l_{i})^2} |
| 121 | +\end{equation} |
| 122 | +``` |
| 123 | +- **Cook's distance** measures the effect of deleting a given observation. |
| 124 | + |
| 125 | +{width="45%" height="60%"} |
| 126 | + |
| 127 | +## Standardized residuals $\tilde{r}_i$ in function of leverage $l_i$ {-} |
| 128 | + |
| 129 | +Given that $\tilde{r}_i$ should have approximately **standard-normal distribution**, only about 0.5% of them should be **larger or lower than 2.57**. |
| 130 | + |
| 131 | +If there is an **excess of such observations**, this could be taken as a signal of issues with the **fit of the model**. At least two such observations (59 and 143). |
| 132 | + |
| 133 | +{width="45%" height="60%"} |
| 134 | + |
| 135 | +## Standardized residuals $\tilde{r}_i$ in function of values expected from the standard normal distribution {-} |
| 136 | + |
| 137 | +If the normality assumption is fulfilled, the plot should show a scatter of points close to the $45^{\circ}$ diagonal, but **this not the case**. |
| 138 | + |
| 139 | +{width="45%" height="60%"} |
| 140 | + |
| 141 | +## Apartment-prices: Model Performance {-} |
| 142 | + |
| 143 | +Both models have almost the same performance. |
| 144 | + |
| 145 | +```{r message=FALSE, results='hide'} |
| 146 | +library("DALEX") |
| 147 | +library("randomForest") |
| 148 | +
|
| 149 | +
|
| 150 | +model_apart_lm <- archivist::aread("pbiecek/models/55f19") |
| 151 | +model_apart_rf <- archivist::aread("pbiecek/models/fe7a5") |
| 152 | +
|
| 153 | +
|
| 154 | +explain_apart_lm <- DALEX::explain(model = model_apart_lm, |
| 155 | + data = apartments_test[,-1], |
| 156 | + y = apartments_test$m2.price, |
| 157 | + label = "Linear Regression") |
| 158 | +
|
| 159 | +explain_apart_rf <- DALEX::explain(model = model_apart_rf, |
| 160 | + data = apartments_test[,-1], |
| 161 | + y = apartments_test$m2.price, |
| 162 | + label = "Random Forest") |
| 163 | +
|
| 164 | +
|
| 165 | +mr_lm <- DALEX::model_performance(explain_apart_lm) |
| 166 | +mr_rf <- DALEX::model_performance(explain_apart_rf) |
| 167 | +``` |
| 168 | + |
| 169 | +```{r} |
| 170 | +list(lm = mr_lm, |
| 171 | + rf = mr_rf) |> |
| 172 | + lapply(\(x) unlist(x$measures) |> round(4)) |> |
| 173 | + as.data.frame() |
| 174 | +``` |
| 175 | + |
| 176 | +## Apartment-prices: Residual distribution {-} |
| 177 | + |
| 178 | +- The distributions of residuals for both models are different. |
| 179 | + |
| 180 | +- The residuals of **random forest** |
| 181 | + |
| 182 | + - They are centered around zero so the predictions are, on average, **close to the actual values**. |
| 183 | + - The skewness indicates that there are some predictions where the model significantly underestimated the actual values. |
| 184 | + |
| 185 | +- The residuals of **linear-regression**: |
| 186 | + |
| 187 | + - They are splitted into 2 separate normal-like parts, located about -200 and 400, which may suggest the **omission of a binary explanatory variable**. |
| 188 | + |
| 189 | +- **Random forest** residuals seem to be **centered at a value closer to zero** than the distribution for the **linear-regression**, but it shows a larger variation. |
| 190 | + |
| 191 | +```{r} |
| 192 | +plot(mr_rf, mr_lm, geom = "histogram") + |
| 193 | + ggplot2::geom_vline(xintercept = 0) |
| 194 | +``` |
| 195 | + |
| 196 | +## Apartment-prices: Residual distribution {-} |
| 197 | + |
| 198 | +The RMSE is comparable for the two models as: |
| 199 | + |
| 200 | +- The residuals for the **random forest model are more frequently smaller** than the residuals for the linear-regression model. |
| 201 | + |
| 202 | +- A **small fraction** of the random forest-model residuals is very large. |
| 203 | + |
| 204 | +```{r} |
| 205 | +# Run ?DALEX:::plot.model_performance to check documentation |
| 206 | +plot(mr_rf, mr_lm, |
| 207 | + geom = "boxplot", |
| 208 | + show_outliers = 1) |
| 209 | +``` |
| 210 | + |
| 211 | + |
| 212 | +## Apartment-prices: Residual distribution {-} |
| 213 | + |
| 214 | +The **linear-regression** model does not capture the **non-linear relationship** between the `price` and the `year of construction`. |
| 215 | + |
| 216 | +```{r} |
| 217 | +pdp_lm_year <- model_profile(explainer = explain_apart_lm, |
| 218 | + variables = "construction.year") |
| 219 | +
|
| 220 | +pdp_rf_year <- model_profile(explainer = explain_apart_rf, |
| 221 | + variables = "construction.year") |
| 222 | +
|
| 223 | +plot(pdp_rf_year, pdp_lm_year) |
| 224 | +``` |
| 225 | + |
| 226 | + |
| 227 | +## Random Forest: Residuals $r_i$ in function of observed values {-} |
| 228 | + |
| 229 | +> The random forest model, as the linear-regression model, assumes that residuals should be homoscedastic, i.e., that they should have a constant variance. |
| 230 | +
|
| 231 | +The plot suggests that the predictions are shifted (biased) towards the average. |
| 232 | + |
| 233 | +- For large observed the residuals are positive. |
| 234 | +- For small observed the residuals are negative. |
| 235 | + |
| 236 | +```{r} |
| 237 | +md_rf <- model_diagnostics(explain_apart_rf) |
| 238 | +
|
| 239 | +plot(md_rf, variable = "y", yvariable = "residuals") |
| 240 | +``` |
| 241 | + |
| 242 | +> For models like linear regression, such heteroscedasticity of the residuals would be worrying. In random forest models, however, it may be less of concern. |
| 243 | +
|
| 244 | +## Random Forest: Predicted in function of observed values {-} |
| 245 | + |
| 246 | +The plot suggests that the predictions are shifted (biased) towards the average. |
| 247 | + |
| 248 | +- For large observed the residuals are positive. |
| 249 | +- For small observed the residuals are negative. |
| 250 | + |
| 251 | +```{r} |
| 252 | +plot(md_rf, variable = "y", yvariable = "y_hat") + |
| 253 | + ggplot2::geom_abline(colour = "red", intercept = 0, slope = 1) |
| 254 | +``` |
| 255 | + |
| 256 | +## Random Forest: Residuals $r_i$ in function of an (arbitrary) identifier of the observation {-} |
| 257 | + |
| 258 | +The plot indicates: |
| 259 | + |
| 260 | +- An **asymmetric** distribution of residuals around zero |
| 261 | +- An **excess** of large positive (larger than 500) residuals without a corresponding fraction of negative values. |
| 262 | + |
| 263 | +```{r} |
| 264 | +plot(md_rf, variable = "ids", yvariable = "residuals") |
| 265 | +``` |
| 266 | + |
| 267 | +## Random Forest: Residuals $r_i$ in function of predicted value {-} |
| 268 | + |
| 269 | +It suggests that the predictions are shifted (biased) towards the average. |
| 270 | + |
| 271 | +```{r} |
| 272 | +plot(md_rf, variable = "y_hat", yvariable = "residuals") |
| 273 | +``` |
| 274 | + |
| 275 | +## Random Forest: Absolute value of residuals in function of the predicted {-} |
| 276 | + |
| 277 | +> Variant of the scale-location plot |
| 278 | +
|
| 279 | +- For homoscedastic residuals, we would expect a symmetric scatter around a horizontal line; the smoothed trend should be also horizontal. |
| 280 | + |
| 281 | +- The deviates from the expected pattern and indicates that the variability of the residuals depends on the (predicted) value of the dependent variable. |
| 282 | + |
| 283 | +```{r} |
| 284 | +plot(md_rf, variable = "y_hat", yvariable = "abs_residuals") |
| 285 | +``` |
| 286 | + |
| 287 | +## Pros and cons {-} |
| 288 | + |
| 289 | +- Diagnostic methods based on residuals are a very useful to identify: |
| 290 | + |
| 291 | + - Problems with distributional assumptions. |
| 292 | + |
| 293 | + - Problems with the assumed structure of the model *(in terms of the selection of the explanatory variables and their form)*. |
| 294 | + |
| 295 | + - Groups of observations for which a model’s predictions are biased. |
| 296 | + |
| 297 | + |
| 298 | +- It presents the following limitations: |
| 299 | + |
| 300 | + - Interpretation of the patterns seen in *graphs may not be straightforward*. |
| 301 | + |
| 302 | + - It's not be immediately obvious *which element of the model* may have to be changed. |
11 | 303 |
|
12 | 304 | ## Meeting Videos {-}
|
13 | 305 |
|
|
16 | 308 | `r knitr::include_url("https://www.youtube.com/embed/URL")`
|
17 | 309 |
|
18 | 310 | <details>
|
19 |
| -<summary> Meeting chat log </summary> |
20 | 311 |
|
21 |
| -``` |
22 |
| -LOG |
23 |
| -``` |
| 312 | +<summary> |
| 313 | + |
| 314 | +Meeting chat log |
| 315 | + |
| 316 | +</summary> |
| 317 | + |
| 318 | + LOG |
| 319 | + |
24 | 320 | </details>
|
0 commit comments