Skip to content

Commit 41519d4

Browse files
committed
Merge branch 'main' into update-to-scikit-learn-1.6
2 parents a1f0aef + 7093c62 commit 41519d4

9 files changed

+138
-80
lines changed

build_tools/generate-exercise-from-solution.py

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
import jupytext
77

88

9+
WRITE_YOUR_CODE_COMMENT = "# Write your code here."
10+
11+
912
def replace_simple_text(input_py_str):
1013
result = input_py_str.replace("📃 Solution for", "📝")
1114
return result
@@ -44,7 +47,24 @@ def remove_solution(input_py_str):
4447
]
4548

4649
for c in cells_to_modify:
47-
c["source"] = pattern.sub("# Write your code here.", c["source"])
50+
c["source"] = pattern.sub(WRITE_YOUR_CODE_COMMENT, c["source"])
51+
52+
previous_cell_is_write_your_code = False
53+
all_cells_before_deduplication = nb.cells
54+
nb.cells = []
55+
for c in all_cells_before_deduplication:
56+
if c["cell_type"] == "code" and c["source"] == WRITE_YOUR_CODE_COMMENT:
57+
current_cell_is_write_your_code = True
58+
else:
59+
current_cell_is_write_your_code = False
60+
if (
61+
current_cell_is_write_your_code
62+
and previous_cell_is_write_your_code
63+
):
64+
# Drop duplicated "write your code here" cells.
65+
continue
66+
nb.cells.append(c)
67+
previous_cell_is_write_your_code = current_cell_is_write_your_code
4868

4969
# TODO: we could potentially try to avoid changing the input file jupytext
5070
# header since this info is rarely useful. Let's keep it simple for now.
@@ -53,6 +73,7 @@ def remove_solution(input_py_str):
5373

5474

5575
def write_exercise(solution_path, exercise_path):
76+
print(f"Writing exercise to {exercise_path} from solution {solution_path}")
5677
input_str = solution_path.read_text()
5778

5879
output_str = input_str
@@ -67,7 +88,9 @@ def write_all_exercises(python_scripts_folder):
6788
for solution_path in solution_paths:
6889
exercise_path = Path(str(solution_path).replace("_sol_", "_ex_"))
6990
if not exercise_path.exists():
70-
print(f"{exercise_path} does not exist")
91+
print(
92+
f"{exercise_path} does not exist, generating it from solution."
93+
)
7194

7295
write_exercise(solution_path, exercise_path)
7396

notebooks/cross_validation_ex_01.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@
5252
"exercise.\n",
5353
"\n",
5454
"Also, this classifier can become more flexible/expressive by using a so-called\n",
55-
"kernel that makes the model become non-linear. Again, no undestanding regarding\n",
55+
"kernel that makes the model become non-linear. Again, no understanding regarding\n",
5656
"the mathematics is required to accomplish this exercise.\n",
5757
"\n",
5858
"We will use an RBF kernel where a parameter `gamma` allows to tune the\n",

notebooks/linear_models_ex_03.ipynb

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
"source": [
77
"# \ud83d\udcdd Exercise M4.03\n",
88
"\n",
9-
"Now, we tackle a more realistic classification problem instead of making a\n",
10-
"synthetic dataset. We start by loading the Adult Census dataset with the\n",
9+
"Now, we tackle a (relatively) realistic classification problem instead of making\n",
10+
"a synthetic dataset. We start by loading the Adult Census dataset with the\n",
1111
"following snippet. For the moment we retain only the **numerical features**."
1212
]
1313
},
@@ -32,10 +32,12 @@
3232
"source": [
3333
"We confirm that all the selected features are numerical.\n",
3434
"\n",
35-
"Compute the generalization performance in terms of accuracy of a linear model\n",
36-
"composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
37-
"cross-validation with `return_estimator=True` to be able to inspect the\n",
38-
"trained estimators."
35+
"Define a linear model composed of a `StandardScaler` followed by a\n",
36+
"`LogisticRegression` with default parameters.\n",
37+
"\n",
38+
"Then use a 10-fold cross-validation to estimate its generalization performance\n",
39+
"in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
40+
"the trained estimators."
3941
]
4042
},
4143
{
@@ -93,11 +95,12 @@
9395
"- The numerical data must be scaled.\n",
9496
"- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
9597
" group categories concerning less than 1% of the total samples.\n",
96-
"- The predictor is a `LogisticRegression`. You may need to increase the number\n",
97-
" of `max_iter`, which is 100 by default.\n",
98+
"- The predictor is a `LogisticRegression` with default parameters, except that\n",
99+
" you may need to increase the number of `max_iter`, which is 100 by default.\n",
98100
"\n",
99101
"Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
100-
"above to evaluate this complex pipeline."
102+
"above to evaluate the full pipeline, including the feature scaling and encoding\n",
103+
"preprocessing."
101104
]
102105
},
103106
{
@@ -186,10 +189,11 @@
186189
"metadata": {},
187190
"source": [
188191
"Now create a similar pipeline consisting of the same preprocessor as above,\n",
189-
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
190-
"Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
191-
"Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
192-
"with the intercept of the subsequent logistic regression."
192+
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
193+
"enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
194+
"engineering step. Remember not to include a \"bias\" feature to avoid\n",
195+
"introducing a redundancy with the intercept of the subsequent logistic\n",
196+
"regression."
193197
]
194198
},
195199
{
@@ -205,21 +209,18 @@
205209
"cell_type": "markdown",
206210
"metadata": {},
207211
"source": [
212+
"Use the same 10-fold cross-validation strategy as above to evaluate this\n",
213+
"pipeline with interactions. In this case there is no need to return the\n",
214+
"estimator, as the number of features generated by the `PolynomialFeatures` step\n",
215+
"is much too large to be able to visually explore the learned coefficients of the\n",
216+
"final classifier.\n",
217+
"\n",
208218
"By comparing the cross-validation test scores of both models fold-to-fold,\n",
209219
"count the number of times the model using multiplicative interactions and both\n",
210220
"numerical and categorical features has a better test score than the model\n",
211221
"without interactions."
212222
]
213223
},
214-
{
215-
"cell_type": "code",
216-
"execution_count": null,
217-
"metadata": {},
218-
"outputs": [],
219-
"source": [
220-
"# Write your code here."
221-
]
222-
},
223224
{
224225
"cell_type": "code",
225226
"execution_count": null,

notebooks/linear_models_sol_03.ipynb

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
"source": [
77
"# \ud83d\udcc3 Solution for Exercise M4.03\n",
88
"\n",
9-
"Now, we tackle a more realistic classification problem instead of making a\n",
10-
"synthetic dataset. We start by loading the Adult Census dataset with the\n",
9+
"Now, we tackle a (relatively) realistic classification problem instead of making\n",
10+
"a synthetic dataset. We start by loading the Adult Census dataset with the\n",
1111
"following snippet. For the moment we retain only the **numerical features**."
1212
]
1313
},
@@ -32,10 +32,12 @@
3232
"source": [
3333
"We confirm that all the selected features are numerical.\n",
3434
"\n",
35-
"Compute the generalization performance in terms of accuracy of a linear model\n",
36-
"composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
37-
"cross-validation with `return_estimator=True` to be able to inspect the\n",
38-
"trained estimators."
35+
"Define a linear model composed of a `StandardScaler` followed by a\n",
36+
"`LogisticRegression` with default parameters.\n",
37+
"\n",
38+
"Then use a 10-fold cross-validation to estimate its generalization performance\n",
39+
"in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
40+
"the trained estimators."
3941
]
4042
},
4143
{
@@ -130,11 +132,12 @@
130132
"- The numerical data must be scaled.\n",
131133
"- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
132134
" group categories concerning less than 1% of the total samples.\n",
133-
"- The predictor is a `LogisticRegression`. You may need to increase the number\n",
134-
" of `max_iter`, which is 100 by default.\n",
135+
"- The predictor is a `LogisticRegression` with default parameters, except that\n",
136+
" you may need to increase the number of `max_iter`, which is 100 by default.\n",
135137
"\n",
136138
"Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
137-
"above to evaluate this complex pipeline."
139+
"above to evaluate the full pipeline, including the feature scaling and encoding\n",
140+
"preprocessing."
138141
]
139142
},
140143
{
@@ -293,10 +296,11 @@
293296
"metadata": {},
294297
"source": [
295298
"Now create a similar pipeline consisting of the same preprocessor as above,\n",
296-
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
297-
"Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
298-
"Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
299-
"with the intercept of the subsequent logistic regression."
299+
"followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
300+
"enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
301+
"engineering step. Remember not to include a \"bias\" feature to avoid\n",
302+
"introducing a redundancy with the intercept of the subsequent logistic\n",
303+
"regression."
300304
]
301305
},
302306
{
@@ -308,18 +312,24 @@
308312
"# solution\n",
309313
"from sklearn.preprocessing import PolynomialFeatures\n",
310314
"\n",
311-
"model_with_interaction = make_pipeline(\n",
315+
"model_with_interactions = make_pipeline(\n",
312316
" preprocessor,\n",
313317
" PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),\n",
314318
" LogisticRegression(C=0.01, max_iter=5_000),\n",
315319
")\n",
316-
"model_with_interaction"
320+
"model_with_interactions"
317321
]
318322
},
319323
{
320324
"cell_type": "markdown",
321325
"metadata": {},
322326
"source": [
327+
"Use the same 10-fold cross-validation strategy as above to evaluate this\n",
328+
"pipeline with interactions. In this case there is no need to return the\n",
329+
"estimator, as the number of features generated by the `PolynomialFeatures` step\n",
330+
"is much too large to be able to visually explore the learned coefficients of the\n",
331+
"final classifier.\n",
332+
"\n",
323333
"By comparing the cross-validation test scores of both models fold-to-fold,\n",
324334
"count the number of times the model using multiplicative interactions and both\n",
325335
"numerical and categorical features has a better test score than the model\n",
@@ -334,11 +344,10 @@
334344
"source": [
335345
"# solution\n",
336346
"cv_results_interactions = cross_validate(\n",
337-
" model_with_interaction,\n",
347+
" model_with_interactions,\n",
338348
" data,\n",
339349
" target,\n",
340350
" cv=10,\n",
341-
" return_estimator=True,\n",
342351
" n_jobs=2,\n",
343352
")\n",
344353
"test_score_interactions = cv_results_interactions[\"test_score\"]\n",

notebooks/metrics_ex_02.ipynb

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,8 @@
8080
"cell_type": "markdown",
8181
"metadata": {},
8282
"source": [
83-
"Then, instead of using the $R^2$ score, use the mean absolute error. You need\n",
84-
"to refer to the documentation for the `scoring` parameter."
83+
"Then, instead of using the $R^2$ score, use the mean absolute error (MAE). You\n",
84+
"may need to refer to the documentation for the `scoring` parameter."
8585
]
8686
},
8787
{
@@ -102,6 +102,15 @@
102102
"compute the $R^2$ score and the mean absolute error for instance."
103103
]
104104
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"metadata": {},
109+
"outputs": [],
110+
"source": [
111+
"# Write your code here."
112+
]
113+
},
105114
{
106115
"cell_type": "code",
107116
"execution_count": null,

notebooks/parameter_tuning_ex_02.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@
7676
"source": [
7777
"Use the previously defined model (called `model`) and using two nested `for`\n",
7878
"loops, make a search of the best combinations of the `learning_rate` and\n",
79-
"`max_leaf_nodes` parameters. In this regard, you have to train and test the\n",
79+
"`max_leaf_nodes` parameters. In this regard, you need to train and test the\n",
8080
"model by setting the parameters. The evaluation of the model should be\n",
8181
"performed using `cross_val_score` on the training set. Use the following\n",
8282
"parameters search:\n",

notebooks/parameter_tuning_ex_03.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,8 @@
3131
"cell_type": "markdown",
3232
"metadata": {},
3333
"source": [
34-
"In this exercise, we progressively define the regression pipeline and\n",
35-
"later tune its hyperparameters.\n",
34+
"In this exercise, we progressively define the regression pipeline and later\n",
35+
"tune its hyperparameters.\n",
3636
"\n",
3737
"Start by defining a pipeline that:\n",
3838
"* uses a `StandardScaler` to normalize the numerical data;\n",

python_scripts/linear_models_ex_03.py

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@
1414
# %% [markdown]
1515
# # 📝 Exercise M4.03
1616
#
17-
# Now, we tackle a more realistic classification problem instead of making a
18-
# synthetic dataset. We start by loading the Adult Census dataset with the
17+
# Now, we tackle a (relatively) realistic classification problem instead of making
18+
# a synthetic dataset. We start by loading the Adult Census dataset with the
1919
# following snippet. For the moment we retain only the **numerical features**.
2020

2121
# %%
@@ -30,10 +30,12 @@
3030
# %% [markdown]
3131
# We confirm that all the selected features are numerical.
3232
#
33-
# Compute the generalization performance in terms of accuracy of a linear model
34-
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
35-
# cross-validation with `return_estimator=True` to be able to inspect the
36-
# trained estimators.
33+
# Define a linear model composed of a `StandardScaler` followed by a
34+
# `LogisticRegression` with default parameters.
35+
#
36+
# Then use a 10-fold cross-validation to estimate its generalization performance
37+
# in terms of accuracy. Also set `return_estimator=True` to be able to inspect
38+
# the trained estimators.
3739

3840
# %%
3941
# Write your code here.
@@ -61,11 +63,12 @@
6163
# - The numerical data must be scaled.
6264
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
6365
# group categories concerning less than 1% of the total samples.
64-
# - The predictor is a `LogisticRegression`. You may need to increase the number
65-
# of `max_iter`, which is 100 by default.
66+
# - The predictor is a `LogisticRegression` with default parameters, except that
67+
# you may need to increase the number of `max_iter`, which is 100 by default.
6668
#
6769
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
68-
# above to evaluate this complex pipeline.
70+
# above to evaluate the full pipeline, including the feature scaling and encoding
71+
# preprocessing.
6972

7073
# %%
7174
# Write your code here.
@@ -110,22 +113,26 @@
110113

111114
# %% [markdown]
112115
# Now create a similar pipeline consisting of the same preprocessor as above,
113-
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
114-
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
115-
# Remember not to include a "bias" feature to avoid introducing a redundancy
116-
# with the intercept of the subsequent logistic regression.
116+
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and
117+
# enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature
118+
# engineering step. Remember not to include a "bias" feature to avoid
119+
# introducing a redundancy with the intercept of the subsequent logistic
120+
# regression.
117121

118122
# %%
119123
# Write your code here.
120124

121125
# %% [markdown]
126+
# Use the same 10-fold cross-validation strategy as above to evaluate this
127+
# pipeline with interactions. In this case there is no need to return the
128+
# estimator, as the number of features generated by the `PolynomialFeatures` step
129+
# is much too large to be able to visually explore the learned coefficients of the
130+
# final classifier.
131+
#
122132
# By comparing the cross-validation test scores of both models fold-to-fold,
123133
# count the number of times the model using multiplicative interactions and both
124134
# numerical and categorical features has a better test score than the model
125135
# without interactions.
126136

127137
# %%
128138
# Write your code here.
129-
130-
# %%
131-
# Write your code here.

0 commit comments

Comments
 (0)