You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/bart/bart_categorical_hawks.myst.md
+37-10Lines changed: 37 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -5,9 +5,9 @@ jupytext:
5
5
format_name: myst
6
6
format_version: 0.13
7
7
kernelspec:
8
-
display_name: Python 3 (ipykernel)
8
+
display_name: pymc-examples
9
9
language: python
10
-
name: python3
10
+
name: pymc-examples
11
11
myst:
12
12
substitutions:
13
13
conda_dependencies: pymc-bart
@@ -52,6 +52,11 @@ RANDOM_SEED = 8457
52
52
az.style.use("arviz-darkgrid")
53
53
```
54
54
55
+
```{code-cell} ipython3
56
+
%load_ext autoreload
57
+
%autoreload 2
58
+
```
59
+
55
60
## Hawks dataset
56
61
57
62
Here we will use a dataset that contains information about 3 species of hawks (*CH*=Cooper's, *RT*=Red-tailed, *SS*=Sharp-Shinned). This dataset has information for 908 individuals in total, each one containing 16 variables, in addition to the species. To simplify the example, we will use the following 5 covariables:
It can be observed that with the covariables `Hallux`, `Culmen`, and `Wing` we achieve the same R$^2$ value that we obtained with all the covariables, this is that the last two covariables contribute less than the other three to the classification. One thing we have to take into account in this is that the HDI is quite wide, which gives us less precision on the results, later we are going to see a way to reduce this.
151
+
It can be observed that with the covariables `Hallux`, `Culmen`, and `Wing` we achieve the same $R^2$ value that we obtained with all the covariables, this is that the last two covariables contribute less than the other three to the classification. One thing we have to take into account in this is that the HDI is quite wide, which gives us less precision on the results; later we are going to see a way to reduce this.
147
152
148
-
+++
153
+
We can also plot the scatter plot of the submodels' predictions to the full model's predictions to get an idea of how each new covariate improves the submodel's predictions.
plt.suptitle("Comparison of submodels' predictions to full model's\n", fontsize=18)
160
+
for ax, cat in zip(axes, np.repeat(species, len(vi_results["labels"]))):
161
+
ax.set(title=f"Species {cat}")
162
+
```
149
163
150
164
### Partial Dependence Plot
151
165
152
-
Let's check the behavior of each covariable for each species with `pmb.plot_pdp()`, which shows the marginal effect a covariate has on the predicted variable, while we average over all the other covariates.
166
+
Let's check the behavior of each covariable for each species with `pmb.plot_pdp()`, which shows the marginal effect a covariate has on the predicted variable, while we average over all the other covariates. Since our response variable is categorical, we use the `softmax_link=True` parameter to get the partial dependence plot in the probability space.
for (i, ax), cat in zip(enumerate(axes), np.tile(species, len(vi_results["labels"]))):
172
+
ax.set(title=f"Species {cat}")
156
173
```
157
174
158
-
The pdp plot, together with the Variable Importance plot, confirms that `Tail` is the covariable with the smaller effect over the predicted variable. In the Variable Importance plot `Tail` is the last covariable to be added and does not improve the result, in the pdp plot `Tail` has the flattest response. For the rest of the covariables in this plot, it's hard to see which of them have more effect over the predicted variable, because they have great variability, showed in the HDI wide, same as before later we are going to see a way to reduce this variability. Finally, some variability depends on the amount of data for each species, which we can see in the `counts` from one of the covariables using Pandas `.describe()` and grouping the data from "Species" with `.groupby("Species")`.
175
+
The Partial Dependence Plot, together with the Variable Importance plot, confirms that `Tail` is the covariable with the smaller effect over the predicted variable: in the Variable Importance plot,`Tail` is the last covariate to be added and does not improve the result; in the PDP plot `Tail` has the flattest response.
159
176
160
-
+++
177
+
For the rest of the covariate in this plot, it's hard to see which of them have more effect over the predicted variable, because they have great variability, showed in the HDI width.
178
+
179
+
Finally, some variability depends on the amount of data for each species, which we can see in the `counts` of each covariable for each species:
for (i, ax), cat in zip(enumerate(axes), np.tile(species, len(vi_results["labels"]))):
251
+
ax.set(title=f"Species {cat}")
226
252
```
227
253
228
-
Comparing these two plots with the previous ones shows a marked reduction in the variance for each one. In the case of `pmb.plot_variable_importance()` there are smallers error bands with an R$^{2}$ value more close to 1. And for `pm.plot_pdp()` we can see thinner bands and a reduction in the limits on the y-axis, this is a representation of the reduction of the uncertainty due to adjusting the trees separately. Another benefit of this is that is more visible the behavior of each covariable for each one of the species.
254
+
Comparing these two plots with the previous ones shows a marked reduction in the variance for each one. In the case of `pmb.plot_variable_importance()` there are smallers error bands with an $R^{2}$ value closer to 1. And for `pmb.plot_pdp()` we can see thinner HDI bands. This is a representation of the reduction of the uncertainty due to adjusting the trees separately. Another benefit of this is that the behavior of each covariable for each one of the species is more visible.
229
255
230
256
With all these together, we can select `Hallux`, `Culmen`, and, `Wing` as covariables to make the classification.
231
257
@@ -259,6 +285,7 @@ all
259
285
## Authors
260
286
- Authored by [Pablo Garay](https://github.com/PabloGGaray) and [Osvaldo Martin](https://aloctavodia.github.io/) in May, 2024
261
287
- Updated by Osvaldo Martin in Dec, 2024
288
+
- Expanded by [Alex Andorra](https://github.com/AlexAndorra) in Feb, 2025
0 commit comments