Skip to content

Commit 7093c62

Browse files
authored
MNT Improve wording in Exercise M4.03 (#822)
1 parent eec5165 commit 7093c62

File tree

4 files changed

+98
-60
lines changed

4 files changed

+98
-60
lines changed

notebooks/linear_models_ex_03.ipynb

+23-13
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
"source": [
77
"# \ud83d\udcdd Exercise M4.03\n",
88
"\n",
9-
"Now, we tackle a more realistic classification problem instead of making a\n",
10-
"synthetic dataset. We start by loading the Adult Census dataset with the\n",
9+
"Now, we tackle a (relatively) realistic classification problem instead of making\n",
10+
"a synthetic dataset. We start by loading the Adult Census dataset with the\n",
1111
"following snippet. For the moment we retain only the **numerical features**."
1212
]
1313
},
@@ -32,10 +32,12 @@
3232
"source": [
3333
"We confirm that all the selected features are numerical.\n",
3434
"\n",
35-
"Compute the generalization performance in terms of accuracy of a linear model\n",
36-
"composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
37-
"cross-validation with `return_estimator=True` to be able to inspect the\n",
38-
"trained estimators."
35+
"Define a linear model composed of a `StandardScaler` followed by a\n",
36+
"`LogisticRegression` with default parameters.\n",
37+
"\n",
38+
"Then use a 10-fold cross-validation to estimate its generalization performance\n",
39+
"in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
40+
"the trained estimators."
3941
]
4042
},
4143
{
@@ -93,11 +95,12 @@
9395
"- The numerical data must be scaled.\n",
9496
"- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
9597
" group categories concerning less than 1% of the total samples.\n",
96-
"- The predictor is a `LogisticRegression`. You may need to increase the number\n",
97-
" of `max_iter`, which is 100 by default.\n",
98+
"- The predictor is a `LogisticRegression` with default parameters, except that\n",
99+
" you may need to increase the number of `max_iter`, which is 100 by default.\n",
98100
"\n",
99101
"Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
100-
"above to evaluate this complex pipeline."
102+
"above to evaluate the full pipeline, including the feature scaling and encoding\n",
103+
"preprocessing."
101104
]
102105
},
103106
{
@@ -186,10 +189,11 @@
186189
"metadata": {},
187190
"source": [
188191
"Now create a similar pipeline consisting of the same preprocessor as above,\n",
189-
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
190-
"Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
191-
"Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
192-
"with the intercept of the subsequent logistic regression."
192+
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
193+
"enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
194+
"engineering step. Remember not to include a \"bias\" feature to avoid\n",
195+
"introducing a redundancy with the intercept of the subsequent logistic\n",
196+
"regression."
193197
]
194198
},
195199
{
@@ -205,6 +209,12 @@
205209
"cell_type": "markdown",
206210
"metadata": {},
207211
"source": [
212+
"Use the same 10-fold cross-validation strategy as above to evaluate this\n",
213+
"pipeline with interactions. In this case there is no need to return the\n",
214+
"estimator, as the number of features generated by the `PolynomialFeatures` step\n",
215+
"is much too large to be able to visually explore the learned coefficients of the\n",
216+
"final classifier.\n",
217+
"\n",
208218
"By comparing the cross-validation test scores of both models fold-to-fold,\n",
209219
"count the number of times the model using multiplicative interactions and both\n",
210220
"numerical and categorical features has a better test score than the model\n",

notebooks/linear_models_sol_03.ipynb

+26-17
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
"source": [
77
"# \ud83d\udcc3 Solution for Exercise M4.03\n",
88
"\n",
9-
"Now, we tackle a more realistic classification problem instead of making a\n",
10-
"synthetic dataset. We start by loading the Adult Census dataset with the\n",
9+
"Now, we tackle a (relatively) realistic classification problem instead of making\n",
10+
"a synthetic dataset. We start by loading the Adult Census dataset with the\n",
1111
"following snippet. For the moment we retain only the **numerical features**."
1212
]
1313
},
@@ -32,10 +32,12 @@
3232
"source": [
3333
"We confirm that all the selected features are numerical.\n",
3434
"\n",
35-
"Compute the generalization performance in terms of accuracy of a linear model\n",
36-
"composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
37-
"cross-validation with `return_estimator=True` to be able to inspect the\n",
38-
"trained estimators."
35+
"Define a linear model composed of a `StandardScaler` followed by a\n",
36+
"`LogisticRegression` with default parameters.\n",
37+
"\n",
38+
"Then use a 10-fold cross-validation to estimate its generalization performance\n",
39+
"in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
40+
"the trained estimators."
3941
]
4042
},
4143
{
@@ -130,11 +132,12 @@
130132
"- The numerical data must be scaled.\n",
131133
"- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
132134
" group categories concerning less than 1% of the total samples.\n",
133-
"- The predictor is a `LogisticRegression`. You may need to increase the number\n",
134-
" of `max_iter`, which is 100 by default.\n",
135+
"- The predictor is a `LogisticRegression` with default parameters, except that\n",
136+
" you may need to increase the number of `max_iter`, which is 100 by default.\n",
135137
"\n",
136138
"Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
137-
"above to evaluate this complex pipeline."
139+
"above to evaluate the full pipeline, including the feature scaling and encoding\n",
140+
"preprocessing."
138141
]
139142
},
140143
{
@@ -293,10 +296,11 @@
293296
"metadata": {},
294297
"source": [
295298
"Now create a similar pipeline consisting of the same preprocessor as above,\n",
296-
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
297-
"Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
298-
"Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
299-
"with the intercept of the subsequent logistic regression."
299+
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
300+
"enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
301+
"engineering step. Remember not to include a \"bias\" feature to avoid\n",
302+
"introducing a redundancy with the intercept of the subsequent logistic\n",
303+
"regression."
300304
]
301305
},
302306
{
@@ -308,18 +312,24 @@
308312
"# solution\n",
309313
"from sklearn.preprocessing import PolynomialFeatures\n",
310314
"\n",
311-
"model_with_interaction = make_pipeline(\n",
315+
"model_with_interactions = make_pipeline(\n",
312316
" preprocessor,\n",
313317
" PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),\n",
314318
" LogisticRegression(C=0.01, max_iter=5_000),\n",
315319
")\n",
316-
"model_with_interaction"
320+
"model_with_interactions"
317321
]
318322
},
319323
{
320324
"cell_type": "markdown",
321325
"metadata": {},
322326
"source": [
327+
"Use the same 10-fold cross-validation strategy as above to evaluate this\n",
328+
"pipeline with interactions. In this case there is no need to return the\n",
329+
"estimator, as the number of features generated by the `PolynomialFeatures` step\n",
330+
"is much too large to be able to visually explore the learned coefficients of the\n",
331+
"final classifier.\n",
332+
"\n",
323333
"By comparing the cross-validation test scores of both models fold-to-fold,\n",
324334
"count the number of times the model using multiplicative interactions and both\n",
325335
"numerical and categorical features has a better test score than the model\n",
@@ -334,11 +344,10 @@
334344
"source": [
335345
"# solution\n",
336346
"cv_results_interactions = cross_validate(\n",
337-
" model_with_interaction,\n",
347+
" model_with_interactions,\n",
338348
" data,\n",
339349
" target,\n",
340350
" cv=10,\n",
341-
" return_estimator=True,\n",
342351
" n_jobs=2,\n",
343352
")\n",
344353
"test_score_interactions = cv_results_interactions[\"test_score\"]\n",

python_scripts/linear_models_ex_03.py

+23-13
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@
1414
# %% [markdown]
1515
# # 📝 Exercise M4.03
1616
#
17-
# Now, we tackle a more realistic classification problem instead of making a
18-
# synthetic dataset. We start by loading the Adult Census dataset with the
17+
# Now, we tackle a (relatively) realistic classification problem instead of making
18+
# a synthetic dataset. We start by loading the Adult Census dataset with the
1919
# following snippet. For the moment we retain only the **numerical features**.
2020

2121
# %%
@@ -30,10 +30,12 @@
3030
# %% [markdown]
3131
# We confirm that all the selected features are numerical.
3232
#
33-
# Compute the generalization performance in terms of accuracy of a linear model
34-
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
35-
# cross-validation with `return_estimator=True` to be able to inspect the
36-
# trained estimators.
33+
# Define a linear model composed of a `StandardScaler` followed by a
34+
# `LogisticRegression` with default parameters.
35+
#
36+
# Then use a 10-fold cross-validation to estimate its generalization performance
37+
# in terms of accuracy. Also set `return_estimator=True` to be able to inspect
38+
# the trained estimators.
3739

3840
# %%
3941
# Write your code here.
@@ -61,11 +63,12 @@
6163
# - The numerical data must be scaled.
6264
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
6365
# group categories concerning less than 1% of the total samples.
64-
# - The predictor is a `LogisticRegression`. You may need to increase the number
65-
# of `max_iter`, which is 100 by default.
66+
# - The predictor is a `LogisticRegression` with default parameters, except that
67+
# you may need to increase the number of `max_iter`, which is 100 by default.
6668
#
6769
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
68-
# above to evaluate this complex pipeline.
70+
# above to evaluate the full pipeline, including the feature scaling and encoding
71+
# preprocessing.
6972

7073
# %%
7174
# Write your code here.
@@ -110,15 +113,22 @@
110113

111114
# %% [markdown]
112115
# Now create a similar pipeline consisting of the same preprocessor as above,
113-
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
114-
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
115-
# Remember not to include a "bias" feature to avoid introducing a redundancy
116-
# with the intercept of the subsequent logistic regression.
116+
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and
117+
# enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature
118+
# engineering step. Remember not to include a "bias" feature to avoid
119+
# introducing a redundancy with the intercept of the subsequent logistic
120+
# regression.
117121

118122
# %%
119123
# Write your code here.
120124

121125
# %% [markdown]
126+
# Use the same 10-fold cross-validation strategy as above to evaluate this
127+
# pipeline with interactions. In this case there is no need to return the
128+
# estimator, as the number of features generated by the `PolynomialFeatures` step
129+
# is much too large to be able to visually explore the learned coefficients of the
130+
# final classifier.
131+
#
122132
# By comparing the cross-validation test scores of both models fold-to-fold,
123133
# count the number of times the model using multiplicative interactions and both
124134
# numerical and categorical features has a better test score than the model

python_scripts/linear_models_sol_03.py

+26-17
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
# %% [markdown]
99
# # 📃 Solution for Exercise M4.03
1010
#
11-
# Now, we tackle a more realistic classification problem instead of making a
12-
# synthetic dataset. We start by loading the Adult Census dataset with the
11+
# Now, we tackle a (relatively) realistic classification problem instead of making
12+
# a synthetic dataset. We start by loading the Adult Census dataset with the
1313
# following snippet. For the moment we retain only the **numerical features**.
1414

1515
# %%
@@ -24,10 +24,12 @@
2424
# %% [markdown]
2525
# We confirm that all the selected features are numerical.
2626
#
27-
# Compute the generalization performance in terms of accuracy of a linear model
28-
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
29-
# cross-validation with `return_estimator=True` to be able to inspect the
30-
# trained estimators.
27+
# Define a linear model composed of a `StandardScaler` followed by a
28+
# `LogisticRegression` with default parameters.
29+
#
30+
# Then use a 10-fold cross-validation to estimate its generalization performance
31+
# in terms of accuracy. Also set `return_estimator=True` to be able to inspect
32+
# the trained estimators.
3133

3234
# %%
3335
# solution
@@ -84,11 +86,12 @@
8486
# - The numerical data must be scaled.
8587
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
8688
# group categories concerning less than 1% of the total samples.
87-
# - The predictor is a `LogisticRegression`. You may need to increase the number
88-
# of `max_iter`, which is 100 by default.
89+
# - The predictor is a `LogisticRegression` with default parameters, except that
90+
# you may need to increase the number of `max_iter`, which is 100 by default.
8991
#
9092
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
91-
# above to evaluate this complex pipeline.
93+
# above to evaluate the full pipeline, including the feature scaling and encoding
94+
# preprocessing.
9295

9396
# %%
9497
# solution
@@ -195,23 +198,30 @@
195198

196199
# %% [markdown]
197200
# Now create a similar pipeline consisting of the same preprocessor as above,
198-
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
199-
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
200-
# Remember not to include a "bias" feature to avoid introducing a redundancy
201-
# with the intercept of the subsequent logistic regression.
201+
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and
202+
# enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature
203+
# engineering step. Remember not to include a "bias" feature to avoid
204+
# introducing a redundancy with the intercept of the subsequent logistic
205+
# regression.
202206

203207
# %%
204208
# solution
205209
from sklearn.preprocessing import PolynomialFeatures
206210

207-
model_with_interaction = make_pipeline(
211+
model_with_interactions = make_pipeline(
208212
preprocessor,
209213
PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),
210214
LogisticRegression(C=0.01, max_iter=5_000),
211215
)
212-
model_with_interaction
216+
model_with_interactions
213217

214218
# %% [markdown]
219+
# Use the same 10-fold cross-validation strategy as above to evaluate this
220+
# pipeline with interactions. In this case there is no need to return the
221+
# estimator, as the number of features generated by the `PolynomialFeatures` step
222+
# is much too large to be able to visually explore the learned coefficients of the
223+
# final classifier.
224+
#
215225
# By comparing the cross-validation test scores of both models fold-to-fold,
216226
# count the number of times the model using multiplicative interactions and both
217227
# numerical and categorical features has a better test score than the model
@@ -220,11 +230,10 @@
220230
# %%
221231
# solution
222232
cv_results_interactions = cross_validate(
223-
model_with_interaction,
233+
model_with_interactions,
224234
data,
225235
target,
226236
cv=10,
227-
return_estimator=True,
228237
n_jobs=2,
229238
)
230239
test_score_interactions = cv_results_interactions["test_score"]

0 commit comments

Comments
 (0)