Skip to content

Commit 63df051

Browse files
authored
Chapter 19 (#20)
* ending theorical part * adding examples * ending summary chapter 19
1 parent f2eec16 commit 63df051

7 files changed

+306
-10
lines changed

19_residual-diagnostics-plots.Rmd

+306-10
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,305 @@
1-
# Residual-diagnostics Plots
1+
# Residual-Diagnostics Plots
22

3-
**Learning objectives:**
3+
**Learning Objectives:**
44

5-
- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
5+
This section presents graphical methods for a detailed examination of **model performance** at both **overall** and **instance-specific levels**.
66

7-
## SLIDE 1 {-}
7+
- Residuals can be utilized to:
8+
- **Identify potentially problematic instances**. This can help define which factors contribute most significantly to prediction errors.
9+
- **Detect any systematic deviations from the expected behavior** that could be due to:
10+
- The omission of explanatory variables
11+
- The inclusion of a variable in an incorrect functional form.
12+
- **Identify the largest prediction errors**, irrespective of the overall performance of a predictive model.
813

9-
- ADD SLIDES AS SECTIONS (`##`).
10-
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
14+
## Quality of predictions {-}
15+
16+
- In a **"perfect" predictive model** `predicted value` == `actual value` of the variable for every observation.
17+
18+
- We want the predictions to be **reasonably close** to the actual values.
19+
20+
- To quantify the **quality of predictions** we can use the *difference* between the `predicted value` and the `actual value`called as **residual**.
21+
22+
For a continuous dependent variable $Y$, residual $r_i$ for the $i$-th observation in a dataset:
23+
24+
```{=tex}
25+
\begin{equation}
26+
r_i = y_i - f(\underline{x}_i) = y_i - \widehat{y}_i
27+
\end{equation}
28+
```
29+
## Characteristics of a good model {-}
30+
31+
To evaluate a model we need to study the *"behavior" of residuals* for a group of observations. To confirm that:
32+
33+
- They are **deviating from zero randomly** implying that:
34+
- Their distribution should be **symmetric around zero**, so their mean (or median) value should be zero.
35+
- Their values most be **close to zero** to show low variability.
36+
37+
## Graphical methods to verify proporties {-}
38+
39+
- **Histogram**: To check the **symmetry** and **location** of the distribution of residuals *without any assumption*.
40+
41+
- **Quantile-quantile plot**: To check whether the residuals follow a concrete distribution.
42+
43+
![Source: <https://rpubs.com/stevenlsenior/normal_residuals_with_code>](img/19-residual-diagnostics-plots/01-histogram-quantile-plot.png)
44+
45+
## Standardized *(Pearson)* residuals {-}
46+
47+
```{=tex}
48+
\begin{equation}
49+
\tilde{r}_i = \frac{r_i}{\sqrt{\mbox{Var}(r_i)}}
50+
\end{equation}
51+
```
52+
where $\mbox{Var}(r_i)$ is the **estimated variance** of the residual $r_i$.
53+
54+
| **Model** | **Estimation Method** |
55+
|:-------------------------|:---------------------------------------------|
56+
| Classical linear-regression model | The design matrix. |
57+
| Poisson regression | The expected value of the count. |
58+
| Complicated models | A constant for all residuals. |
59+
60+
## Exploring residuals for classification models {-}
61+
62+
- Due their range ($[-1,1]$) limitations, the residuals $r_i$ are not very useful to explore the probability of observing $y_i$.
63+
64+
- If **all explanatory variables are categorical** with a limited number of categories the **standard-normal approximation** is likely if follow the following steps:
65+
66+
- Divide the observed values in $K$ groups sharing the same predicted value $f_k$.
67+
- Average the residuals $r_i$ per group and standardizing them with $\sqrt{f_k(1-f_k)/n_k}$
68+
69+
## Exploring residuals for classification models {-}
70+
71+
In datasets, where different observations lead to **different model predictions** the *Pearson residuals* will **not** be approximated by the **standard-normal** one.
72+
73+
But the **index plot** may still be useful to detect observations with **large residuals**.
74+
75+
![Source: <http://www.philender.com/courses/linearmodels/notes1/index.html>](img/19-residual-diagnostics-plots/02-index-plot.png){width="45%" height="60%"}
76+
77+
## Exploring residuals for classical linear-regression models {-}
78+
79+
- Residuals should be normally distributed with mean zero
80+
- The leverage values from the diagonal of hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$.
81+
82+
$$
83+
\mathbf{\hat{y}} =
84+
\mathbf{X}\hat{\beta} =
85+
\mathbf{X}[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}] =
86+
\mathbf{H}\mathbf{y}
87+
$$
88+
89+
- Expected variance given by: $\text{Var}(e_i) = \sigma^2 (1 - h_{ii})$
90+
- For independent explanatory variables, it should lead to a constant variance of residuals.
91+
92+
## Residuals $r_i$ in function of predicted values {-}
93+
94+
The plot should show points scattered **symmetrically** around the horizontal **straight line at 0**, but:
95+
96+
- It has got a shape of a funnel, reflecting **increasing variability** of residuals for increasing fitted values. The variance is not constant *(homoscedasticity violation)*.
97+
98+
- The smoothed line suggests that the mean of residuals becomes **increasingly positive** for increasing fitted values. The residuals don't seems to have a *zero-mean*.
99+
100+
![](img/19-residual-diagnostics-plots/03-residuals-vs-fitted.png){width="45%" height="60%"}
101+
102+
## Square root of standardized residuals $\sqrt{\tilde{r}_i}$ in function of predicted values {-}
103+
104+
The plot should show points scattered **symmetrically** across the horizontal axis.
105+
106+
- The increase in $\sqrt{\tilde{r}_i}$ indicates a violation of the *homoscedasticity assumption*.
107+
108+
![](img/19-residual-diagnostics-plots/04-scale-location.png){width="45%" height="60%"}
109+
110+
## Standardized residuals $\tilde{r}_i$ in function of leverage $l_i$ {-}
111+
112+
- **Leverage** $l_i$ is a measure of how far away the **independent variable values** of an observation are from those of the other observations.
113+
114+
- Data points with **large residuals (outliers)** and/or **high leverage** may distort the outcome and **accuracy of a regression**.
115+
116+
- The **predicted sum-of-squares**:
117+
118+
```{=tex}
119+
\begin{equation}
120+
PRESS = \sum_{i=1}^{n} (\widehat{y}_{i(-i)} - y_i)^2 = \sum_{i=1}^{n} \frac{r_i^2}{(1-l_{i})^2}
121+
\end{equation}
122+
```
123+
- **Cook's distance** measures the effect of deleting a given observation.
124+
125+
![](img/19-residual-diagnostics-plots/05-residuals-vs-leverage.png){width="45%" height="60%"}
126+
127+
## Standardized residuals $\tilde{r}_i$ in function of leverage $l_i$ {-}
128+
129+
Given that $\tilde{r}_i$ should have approximately **standard-normal distribution**, only about 0.5% of them should be **larger or lower than 2.57**.
130+
131+
If there is an **excess of such observations**, this could be taken as a signal of issues with the **fit of the model**. At least two such observations (59 and 143).
132+
133+
![](img/19-residual-diagnostics-plots/05-residuals-vs-leverage.png){width="45%" height="60%"}
134+
135+
## Standardized residuals $\tilde{r}_i$ in function of values expected from the standard normal distribution {-}
136+
137+
If the normality assumption is fulfilled, the plot should show a scatter of points close to the $45^{\circ}$ diagonal, but **this not the case**.
138+
139+
![](img/19-residual-diagnostics-plots/06-normal-q-q.png){width="45%" height="60%"}
140+
141+
## Apartment-prices: Model Performance {-}
142+
143+
Both models have almost the same performance.
144+
145+
```{r message=FALSE, results='hide'}
146+
library("DALEX")
147+
library("randomForest")
148+
149+
150+
model_apart_lm <- archivist::aread("pbiecek/models/55f19")
151+
model_apart_rf <- archivist::aread("pbiecek/models/fe7a5")
152+
153+
154+
explain_apart_lm <- DALEX::explain(model = model_apart_lm,
155+
data = apartments_test[,-1],
156+
y = apartments_test$m2.price,
157+
label = "Linear Regression")
158+
159+
explain_apart_rf <- DALEX::explain(model = model_apart_rf,
160+
data = apartments_test[,-1],
161+
y = apartments_test$m2.price,
162+
label = "Random Forest")
163+
164+
165+
mr_lm <- DALEX::model_performance(explain_apart_lm)
166+
mr_rf <- DALEX::model_performance(explain_apart_rf)
167+
```
168+
169+
```{r}
170+
list(lm = mr_lm,
171+
rf = mr_rf) |>
172+
lapply(\(x) unlist(x$measures) |> round(4)) |>
173+
as.data.frame()
174+
```
175+
176+
## Apartment-prices: Residual distribution {-}
177+
178+
- The distributions of residuals for both models are different.
179+
180+
- The residuals of **random forest**
181+
182+
- They are centered around zero so the predictions are, on average, **close to the actual values**.
183+
- The skewness indicates that there are some predictions where the model significantly underestimated the actual values.
184+
185+
- The residuals of **linear-regression**:
186+
187+
- They are splitted into 2 separate normal-like parts, located about -200 and 400, which may suggest the **omission of a binary explanatory variable**.
188+
189+
- **Random forest** residuals seem to be **centered at a value closer to zero** than the distribution for the **linear-regression**, but it shows a larger variation.
190+
191+
```{r}
192+
plot(mr_rf, mr_lm, geom = "histogram") +
193+
ggplot2::geom_vline(xintercept = 0)
194+
```
195+
196+
## Apartment-prices: Residual distribution {-}
197+
198+
The RMSE is comparable for the two models as:
199+
200+
- The residuals for the **random forest model are more frequently smaller** than the residuals for the linear-regression model.
201+
202+
- A **small fraction** of the random forest-model residuals is very large.
203+
204+
```{r}
205+
# Run ?DALEX:::plot.model_performance to check documentation
206+
plot(mr_rf, mr_lm,
207+
geom = "boxplot",
208+
show_outliers = 1)
209+
```
210+
211+
212+
## Apartment-prices: Residual distribution {-}
213+
214+
The **linear-regression** model does not capture the **non-linear relationship** between the `price` and the `year of construction`.
215+
216+
```{r}
217+
pdp_lm_year <- model_profile(explainer = explain_apart_lm,
218+
variables = "construction.year")
219+
220+
pdp_rf_year <- model_profile(explainer = explain_apart_rf,
221+
variables = "construction.year")
222+
223+
plot(pdp_rf_year, pdp_lm_year)
224+
```
225+
226+
227+
## Random Forest: Residuals $r_i$ in function of observed values {-}
228+
229+
> The random forest model, as the linear-regression model, assumes that residuals should be homoscedastic, i.e., that they should have a constant variance.
230+
231+
The plot suggests that the predictions are shifted (biased) towards the average.
232+
233+
- For large observed the residuals are positive.
234+
- For small observed the residuals are negative.
235+
236+
```{r}
237+
md_rf <- model_diagnostics(explain_apart_rf)
238+
239+
plot(md_rf, variable = "y", yvariable = "residuals")
240+
```
241+
242+
> For models like linear regression, such heteroscedasticity of the residuals would be worrying. In random forest models, however, it may be less of concern.
243+
244+
## Random Forest: Predicted in function of observed values {-}
245+
246+
The plot suggests that the predictions are shifted (biased) towards the average.
247+
248+
- For large observed the residuals are positive.
249+
- For small observed the residuals are negative.
250+
251+
```{r}
252+
plot(md_rf, variable = "y", yvariable = "y_hat") +
253+
ggplot2::geom_abline(colour = "red", intercept = 0, slope = 1)
254+
```
255+
256+
## Random Forest: Residuals $r_i$ in function of an (arbitrary) identifier of the observation {-}
257+
258+
The plot indicates:
259+
260+
- An **asymmetric** distribution of residuals around zero
261+
- An **excess** of large positive (larger than 500) residuals without a corresponding fraction of negative values.
262+
263+
```{r}
264+
plot(md_rf, variable = "ids", yvariable = "residuals")
265+
```
266+
267+
## Random Forest: Residuals $r_i$ in function of predicted value {-}
268+
269+
It suggests that the predictions are shifted (biased) towards the average.
270+
271+
```{r}
272+
plot(md_rf, variable = "y_hat", yvariable = "residuals")
273+
```
274+
275+
## Random Forest: Absolute value of residuals in function of the predicted {-}
276+
277+
> Variant of the scale-location plot
278+
279+
- For homoscedastic residuals, we would expect a symmetric scatter around a horizontal line; the smoothed trend should be also horizontal.
280+
281+
- The deviates from the expected pattern and indicates that the variability of the residuals depends on the (predicted) value of the dependent variable.
282+
283+
```{r}
284+
plot(md_rf, variable = "y_hat", yvariable = "abs_residuals")
285+
```
286+
287+
## Pros and cons {-}
288+
289+
- Diagnostic methods based on residuals are a very useful to identify:
290+
291+
- Problems with distributional assumptions.
292+
293+
- Problems with the assumed structure of the model *(in terms of the selection of the explanatory variables and their form)*.
294+
295+
- Groups of observations for which a model’s predictions are biased.
296+
297+
298+
- It presents the following limitations:
299+
300+
- Interpretation of the patterns seen in *graphs may not be straightforward*.
301+
302+
- It's not be immediately obvious *which element of the model* may have to be changed.
11303

12304
## Meeting Videos {-}
13305

@@ -16,9 +308,13 @@
16308
`r knitr::include_url("https://www.youtube.com/embed/URL")`
17309

18310
<details>
19-
<summary> Meeting chat log </summary>
20311

21-
```
22-
LOG
23-
```
312+
<summary>
313+
314+
Meeting chat log
315+
316+
</summary>
317+
318+
LOG
319+
24320
</details>
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)