Merge branch 'main' into update-to-scikit-learn-1.6

ogrisel · ogrisel · commit 41519d49a65e · 2025-04-03T11:28:53.000+02:00
diff --git a/build_tools/generate-exercise-from-solution.py b/build_tools/generate-exercise-from-solution.py
@@ -6,6 +6,9 @@
 import jupytext
 
 
+WRITE_YOUR_CODE_COMMENT = "# Write your code here."
+
+
 def replace_simple_text(input_py_str):
     result = input_py_str.replace("📃 Solution for", "📝")
     return result
@@ -44,7 +47,24 @@ def remove_solution(input_py_str):
     ]
 
     for c in cells_to_modify:
-        c["source"] = pattern.sub("# Write your code here.", c["source"])
+        c["source"] = pattern.sub(WRITE_YOUR_CODE_COMMENT, c["source"])
+
+    previous_cell_is_write_your_code = False
+    all_cells_before_deduplication = nb.cells
+    nb.cells = []
+    for c in all_cells_before_deduplication:
+        if c["cell_type"] == "code" and c["source"] == WRITE_YOUR_CODE_COMMENT:
+            current_cell_is_write_your_code = True
+        else:
+            current_cell_is_write_your_code = False
+        if (
+            current_cell_is_write_your_code
+            and previous_cell_is_write_your_code
+        ):
+            # Drop duplicated "write your code here" cells.
+            continue
+        nb.cells.append(c)
+        previous_cell_is_write_your_code = current_cell_is_write_your_code
 
     # TODO: we could potentially try to avoid changing the input file jupytext
     # header since this info is rarely useful. Let's keep it simple for now.
@@ -53,6 +73,7 @@ def remove_solution(input_py_str):
 
 
 def write_exercise(solution_path, exercise_path):
+    print(f"Writing exercise to {exercise_path} from solution {solution_path}")
     input_str = solution_path.read_text()
 
     output_str = input_str
@@ -67,7 +88,9 @@ def write_all_exercises(python_scripts_folder):
     for solution_path in solution_paths:
         exercise_path = Path(str(solution_path).replace("_sol_", "_ex_"))
         if not exercise_path.exists():
-            print(f"{exercise_path} does not exist")
+            print(
+                f"{exercise_path} does not exist, generating it from solution."
+            )
 
         write_exercise(solution_path, exercise_path)
 
diff --git a/notebooks/cross_validation_ex_01.ipynb b/notebooks/cross_validation_ex_01.ipynb
@@ -52,7 +52,7 @@
     "exercise.\n",
     "\n",
     "Also, this classifier can become more flexible/expressive by using a so-called\n",
-    "kernel that makes the model become non-linear. Again, no undestanding regarding\n",
+    "kernel that makes the model become non-linear. Again, no understanding regarding\n",
     "the mathematics is required to accomplish this exercise.\n",
     "\n",
     "We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb
@@ -6,8 +6,8 @@
    "source": [
     "# \ud83d\udcdd Exercise M4.03\n",
     "\n",
-    "Now, we tackle a more realistic classification problem instead of making a\n",
-    "synthetic dataset. We start by loading the Adult Census dataset with the\n",
+    "Now, we tackle a (relatively) realistic classification problem instead of making\n",
+    "a synthetic dataset. We start by loading the Adult Census dataset with the\n",
     "following snippet. For the moment we retain only the **numerical features**."
    ]
   },
@@ -32,10 +32,12 @@
    "source": [
     "We confirm that all the selected features are numerical.\n",
     "\n",
-    "Compute the generalization performance in terms of accuracy of a linear model\n",
-    "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
-    "cross-validation with `return_estimator=True` to be able to inspect the\n",
-    "trained estimators."
+    "Define a linear model composed of a `StandardScaler` followed by a\n",
+    "`LogisticRegression` with default parameters.\n",
+    "\n",
+    "Then use a 10-fold cross-validation to estimate its generalization performance\n",
+    "in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
+    "the trained estimators."
    ]
   },
   {
@@ -93,11 +95,12 @@
     "- The numerical data must be scaled.\n",
     "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
     "  group categories concerning less than 1% of the total samples.\n",
-    "- The predictor is a `LogisticRegression`. You may need to increase the number\n",
-    "  of `max_iter`, which is 100 by default.\n",
+    "- The predictor is a `LogisticRegression` with default parameters, except that\n",
+    "  you may need to increase the number of `max_iter`, which is 100 by default.\n",
     "\n",
     "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
-    "above to evaluate this complex pipeline."
+    "above to evaluate the full pipeline, including the feature scaling and encoding\n",
+    "preprocessing."
    ]
   },
   {
@@ -186,10 +189,11 @@
    "metadata": {},
    "source": [
     "Now create a similar pipeline consisting of the same preprocessor as above,\n",
-    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
-    "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
-    "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
-    "with the intercept of the subsequent logistic regression."
+    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
+    "enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
+    "engineering step. Remember not to include a \"bias\" feature to avoid\n",
+    "introducing a redundancy with the intercept of the subsequent logistic\n",
+    "regression."
    ]
   },
   {
@@ -205,21 +209,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Use the same 10-fold cross-validation strategy as above to evaluate this\n",
+    "pipeline with interactions. In this case there is no need to return the\n",
+    "estimator, as the number of features generated by the `PolynomialFeatures` step\n",
+    "is much too large to be able to visually explore the learned coefficients of the\n",
+    "final classifier.\n",
+    "\n",
     "By comparing the cross-validation test scores of both models fold-to-fold,\n",
     "count the number of times the model using multiplicative interactions and both\n",
     "numerical and categorical features has a better test score than the model\n",
     "without interactions."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Write your code here."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb
@@ -6,8 +6,8 @@
    "source": [
     "# \ud83d\udcc3 Solution for Exercise M4.03\n",
     "\n",
-    "Now, we tackle a more realistic classification problem instead of making a\n",
-    "synthetic dataset. We start by loading the Adult Census dataset with the\n",
+    "Now, we tackle a (relatively) realistic classification problem instead of making\n",
+    "a synthetic dataset. We start by loading the Adult Census dataset with the\n",
     "following snippet. For the moment we retain only the **numerical features**."
    ]
   },
@@ -32,10 +32,12 @@
    "source": [
     "We confirm that all the selected features are numerical.\n",
     "\n",
-    "Compute the generalization performance in terms of accuracy of a linear model\n",
-    "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
-    "cross-validation with `return_estimator=True` to be able to inspect the\n",
-    "trained estimators."
+    "Define a linear model composed of a `StandardScaler` followed by a\n",
+    "`LogisticRegression` with default parameters.\n",
+    "\n",
+    "Then use a 10-fold cross-validation to estimate its generalization performance\n",
+    "in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
+    "the trained estimators."
    ]
   },
   {
@@ -130,11 +132,12 @@
     "- The numerical data must be scaled.\n",
     "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
     "  group categories concerning less than 1% of the total samples.\n",
-    "- The predictor is a `LogisticRegression`. You may need to increase the number\n",
-    "  of `max_iter`, which is 100 by default.\n",
+    "- The predictor is a `LogisticRegression` with default parameters, except that\n",
+    "  you may need to increase the number of `max_iter`, which is 100 by default.\n",
     "\n",
     "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
-    "above to evaluate this complex pipeline."
+    "above to evaluate the full pipeline, including the feature scaling and encoding\n",
+    "preprocessing."
    ]
   },
   {
@@ -293,10 +296,11 @@
    "metadata": {},
    "source": [
     "Now create a similar pipeline consisting of the same preprocessor as above,\n",
-    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
-    "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
-    "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
-    "with the intercept of the subsequent logistic regression."
+    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
+    "enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
+    "engineering step. Remember not to include a \"bias\" feature to avoid\n",
+    "introducing a redundancy with the intercept of the subsequent logistic\n",
+    "regression."
    ]
   },
   {
@@ -308,18 +312,24 @@
     "# solution\n",
     "from sklearn.preprocessing import PolynomialFeatures\n",
     "\n",
-    "model_with_interaction = make_pipeline(\n",
+    "model_with_interactions = make_pipeline(\n",
     "    preprocessor,\n",
     "    PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),\n",
     "    LogisticRegression(C=0.01, max_iter=5_000),\n",
     ")\n",
-    "model_with_interaction"
+    "model_with_interactions"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Use the same 10-fold cross-validation strategy as above to evaluate this\n",
+    "pipeline with interactions. In this case there is no need to return the\n",
+    "estimator, as the number of features generated by the `PolynomialFeatures` step\n",
+    "is much too large to be able to visually explore the learned coefficients of the\n",
+    "final classifier.\n",
+    "\n",
     "By comparing the cross-validation test scores of both models fold-to-fold,\n",
     "count the number of times the model using multiplicative interactions and both\n",
     "numerical and categorical features has a better test score than the model\n",
@@ -334,11 +344,10 @@
    "source": [
     "# solution\n",
     "cv_results_interactions = cross_validate(\n",
-    "    model_with_interaction,\n",
+    "    model_with_interactions,\n",
     "    data,\n",
     "    target,\n",
     "    cv=10,\n",
-    "    return_estimator=True,\n",
     "    n_jobs=2,\n",
     ")\n",
     "test_score_interactions = cv_results_interactions[\"test_score\"]\n",
diff --git a/notebooks/metrics_ex_02.ipynb b/notebooks/metrics_ex_02.ipynb
@@ -80,8 +80,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Then, instead of using the $R^2$ score, use the mean absolute error. You need\n",
-    "to refer to the documentation for the `scoring` parameter."
+    "Then, instead of using the $R^2$ score, use the mean absolute error (MAE). You\n",
+    "may need to refer to the documentation for the `scoring` parameter."
    ]
   },
   {
@@ -102,6 +102,15 @@
     "compute the $R^2$ score and the mean absolute error for instance."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
diff --git a/notebooks/parameter_tuning_ex_02.ipynb b/notebooks/parameter_tuning_ex_02.ipynb
@@ -76,7 +76,7 @@
    "source": [
     "Use the previously defined model (called `model`) and using two nested `for`\n",
     "loops, make a search of the best combinations of the `learning_rate` and\n",
-    "`max_leaf_nodes` parameters. In this regard, you have to train and test the\n",
+    "`max_leaf_nodes` parameters. In this regard, you need to train and test the\n",
     "model by setting the parameters. The evaluation of the model should be\n",
     "performed using `cross_val_score` on the training set. Use the following\n",
     "parameters search:\n",
diff --git a/notebooks/parameter_tuning_ex_03.ipynb b/notebooks/parameter_tuning_ex_03.ipynb
@@ -31,8 +31,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this exercise, we progressively define the regression pipeline and\n",
-    "later tune its hyperparameters.\n",
+    "In this exercise, we progressively define the regression pipeline and later\n",
+    "tune its hyperparameters.\n",
     "\n",
     "Start by defining a pipeline that:\n",
     "* uses a `StandardScaler` to normalize the numerical data;\n",
diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py
@@ -14,8 +14,8 @@
 # %% [markdown]
 # # 📝 Exercise M4.03
 #
-# Now, we tackle a more realistic classification problem instead of making a
-# synthetic dataset. We start by loading the Adult Census dataset with the
+# Now, we tackle a (relatively) realistic classification problem instead of making
+# a synthetic dataset. We start by loading the Adult Census dataset with the
 # following snippet. For the moment we retain only the **numerical features**.
 
 # %%
@@ -30,10 +30,12 @@
 # %% [markdown]
 # We confirm that all the selected features are numerical.
 #
-# Compute the generalization performance in terms of accuracy of a linear model
-# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
-# cross-validation with `return_estimator=True` to be able to inspect the
-# trained estimators.
+# Define a linear model composed of a `StandardScaler` followed by a
+# `LogisticRegression` with default parameters.
+#
+# Then use a 10-fold cross-validation to estimate its generalization performance
+# in terms of accuracy. Also set `return_estimator=True` to be able to inspect
+# the trained estimators.
 
 # %%
 # Write your code here.
@@ -61,11 +63,12 @@
 # - The numerical data must be scaled.
 # - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
 #   group categories concerning less than 1% of the total samples.
-# - The predictor is a `LogisticRegression`. You may need to increase the number
-#   of `max_iter`, which is 100 by default.
+# - The predictor is a `LogisticRegression` with default parameters, except that
+#   you may need to increase the number of `max_iter`, which is 100 by default.
 #
 # Use the same 10-fold cross-validation strategy with `return_estimator=True` as
-# above to evaluate this complex pipeline.
+# above to evaluate the full pipeline, including the feature scaling and encoding
+# preprocessing.
 
 # %%
 # Write your code here.
@@ -110,22 +113,26 @@
 
 # %% [markdown]
 # Now create a similar pipeline consisting of the same preprocessor as above,
-# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
-# Set `degree=2` and `interaction_only=True` to the feature engineering step.
-# Remember not to include a "bias" feature to avoid introducing a redundancy
-# with the intercept of the subsequent logistic regression.
+# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and
+# enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature
+# engineering step. Remember not to include a "bias" feature to avoid
+# introducing a redundancy with the intercept of the subsequent logistic
+# regression.
 
 # %%
 # Write your code here.
 
 # %% [markdown]
+# Use the same 10-fold cross-validation strategy as above to evaluate this
+# pipeline with interactions. In this case there is no need to return the
+# estimator, as the number of features generated by the `PolynomialFeatures` step
+# is much too large to be able to visually explore the learned coefficients of the
+# final classifier.
+#
 # By comparing the cross-validation test scores of both models fold-to-fold,
 # count the number of times the model using multiplicative interactions and both
 # numerical and categorical features has a better test score than the model
 # without interactions.
 
 # %%
 # Write your code here.
-
-# %%
-# Write your code here.
diff --git a/python_scripts/linear_models_sol_03.py b/python_scripts/linear_models_sol_03.py