MNT Improve wording in Exercise M4.03 (#822)

ArturoAmorQ · web-flow · commit 7093c62c2d36 · 2025-04-03T10:35:48.000+02:00
diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb
@@ -6,8 +6,8 @@
    "source": [
     "# \ud83d\udcdd Exercise M4.03\n",
     "\n",
-    "Now, we tackle a more realistic classification problem instead of making a\n",
-    "synthetic dataset. We start by loading the Adult Census dataset with the\n",
+    "Now, we tackle a (relatively) realistic classification problem instead of making\n",
+    "a synthetic dataset. We start by loading the Adult Census dataset with the\n",
     "following snippet. For the moment we retain only the **numerical features**."
    ]
   },
@@ -32,10 +32,12 @@
    "source": [
     "We confirm that all the selected features are numerical.\n",
     "\n",
-    "Compute the generalization performance in terms of accuracy of a linear model\n",
-    "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
-    "cross-validation with `return_estimator=True` to be able to inspect the\n",
-    "trained estimators."
+    "Define a linear model composed of a `StandardScaler` followed by a\n",
+    "`LogisticRegression` with default parameters.\n",
+    "\n",
+    "Then use a 10-fold cross-validation to estimate its generalization performance\n",
+    "in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
+    "the trained estimators."
    ]
   },
   {
@@ -93,11 +95,12 @@
     "- The numerical data must be scaled.\n",
     "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
     "  group categories concerning less than 1% of the total samples.\n",
-    "- The predictor is a `LogisticRegression`. You may need to increase the number\n",
-    "  of `max_iter`, which is 100 by default.\n",
+    "- The predictor is a `LogisticRegression` with default parameters, except that\n",
+    "  you may need to increase the number of `max_iter`, which is 100 by default.\n",
     "\n",
     "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
-    "above to evaluate this complex pipeline."
+    "above to evaluate the full pipeline, including the feature scaling and encoding\n",
+    "preprocessing."
    ]
   },
   {
@@ -186,10 +189,11 @@
    "metadata": {},
    "source": [
     "Now create a similar pipeline consisting of the same preprocessor as above,\n",
-    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
-    "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
-    "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
-    "with the intercept of the subsequent logistic regression."
+    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
+    "enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
+    "engineering step. Remember not to include a \"bias\" feature to avoid\n",
+    "introducing a redundancy with the intercept of the subsequent logistic\n",
+    "regression."
    ]
   },
   {
@@ -205,6 +209,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Use the same 10-fold cross-validation strategy as above to evaluate this\n",
+    "pipeline with interactions. In this case there is no need to return the\n",
+    "estimator, as the number of features generated by the `PolynomialFeatures` step\n",
+    "is much too large to be able to visually explore the learned coefficients of the\n",
+    "final classifier.\n",
+    "\n",
     "By comparing the cross-validation test scores of both models fold-to-fold,\n",
     "count the number of times the model using multiplicative interactions and both\n",
     "numerical and categorical features has a better test score than the model\n",
diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb
@@ -6,8 +6,8 @@
    "source": [
     "# \ud83d\udcc3 Solution for Exercise M4.03\n",
     "\n",
-    "Now, we tackle a more realistic classification problem instead of making a\n",
-    "synthetic dataset. We start by loading the Adult Census dataset with the\n",
+    "Now, we tackle a (relatively) realistic classification problem instead of making\n",
+    "a synthetic dataset. We start by loading the Adult Census dataset with the\n",
     "following snippet. For the moment we retain only the **numerical features**."
    ]
   },
@@ -32,10 +32,12 @@
    "source": [
     "We confirm that all the selected features are numerical.\n",
     "\n",
-    "Compute the generalization performance in terms of accuracy of a linear model\n",
-    "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
-    "cross-validation with `return_estimator=True` to be able to inspect the\n",
-    "trained estimators."
+    "Define a linear model composed of a `StandardScaler` followed by a\n",
+    "`LogisticRegression` with default parameters.\n",
+    "\n",
+    "Then use a 10-fold cross-validation to estimate its generalization performance\n",
+    "in terms of accuracy. Also set `return_estimator=True` to be able to inspect\n",
+    "the trained estimators."
    ]
   },
   {
@@ -130,11 +132,12 @@
     "- The numerical data must be scaled.\n",
     "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
     "  group categories concerning less than 1% of the total samples.\n",
-    "- The predictor is a `LogisticRegression`. You may need to increase the number\n",
-    "  of `max_iter`, which is 100 by default.\n",
+    "- The predictor is a `LogisticRegression` with default parameters, except that\n",
+    "  you may need to increase the number of `max_iter`, which is 100 by default.\n",
     "\n",
     "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
-    "above to evaluate this complex pipeline."
+    "above to evaluate the full pipeline, including the feature scaling and encoding\n",
+    "preprocessing."
    ]
   },
   {
@@ -293,10 +296,11 @@
    "metadata": {},
    "source": [
     "Now create a similar pipeline consisting of the same preprocessor as above,\n",
-    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
-    "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
-    "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
-    "with the intercept of the subsequent logistic regression."
+    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and\n",
+    "enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature\n",
+    "engineering step. Remember not to include a \"bias\" feature to avoid\n",
+    "introducing a redundancy with the intercept of the subsequent logistic\n",
+    "regression."
    ]
   },
   {
@@ -308,18 +312,24 @@
     "# solution\n",
     "from sklearn.preprocessing import PolynomialFeatures\n",
     "\n",
-    "model_with_interaction = make_pipeline(\n",
+    "model_with_interactions = make_pipeline(\n",
     "    preprocessor,\n",
     "    PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),\n",
     "    LogisticRegression(C=0.01, max_iter=5_000),\n",
     ")\n",
-    "model_with_interaction"
+    "model_with_interactions"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Use the same 10-fold cross-validation strategy as above to evaluate this\n",
+    "pipeline with interactions. In this case there is no need to return the\n",
+    "estimator, as the number of features generated by the `PolynomialFeatures` step\n",
+    "is much too large to be able to visually explore the learned coefficients of the\n",
+    "final classifier.\n",
+    "\n",
     "By comparing the cross-validation test scores of both models fold-to-fold,\n",
     "count the number of times the model using multiplicative interactions and both\n",
     "numerical and categorical features has a better test score than the model\n",
@@ -334,11 +344,10 @@
    "source": [
     "# solution\n",
     "cv_results_interactions = cross_validate(\n",
-    "    model_with_interaction,\n",
+    "    model_with_interactions,\n",
     "    data,\n",
     "    target,\n",
     "    cv=10,\n",
-    "    return_estimator=True,\n",
     "    n_jobs=2,\n",
     ")\n",
     "test_score_interactions = cv_results_interactions[\"test_score\"]\n",
diff --git a/python_scripts/linear_models_ex_03.py b/python_scripts/linear_models_ex_03.py
@@ -14,8 +14,8 @@
 # %% [markdown]
 # # 📝 Exercise M4.03
 #
-# Now, we tackle a more realistic classification problem instead of making a
-# synthetic dataset. We start by loading the Adult Census dataset with the
+# Now, we tackle a (relatively) realistic classification problem instead of making
+# a synthetic dataset. We start by loading the Adult Census dataset with the
 # following snippet. For the moment we retain only the **numerical features**.
 
 # %%
@@ -30,10 +30,12 @@
 # %% [markdown]
 # We confirm that all the selected features are numerical.
 #
-# Compute the generalization performance in terms of accuracy of a linear model
-# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
-# cross-validation with `return_estimator=True` to be able to inspect the
-# trained estimators.
+# Define a linear model composed of a `StandardScaler` followed by a
+# `LogisticRegression` with default parameters.
+#
+# Then use a 10-fold cross-validation to estimate its generalization performance
+# in terms of accuracy. Also set `return_estimator=True` to be able to inspect
+# the trained estimators.
 
 # %%
 # Write your code here.
@@ -61,11 +63,12 @@
 # - The numerical data must be scaled.
 # - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
 #   group categories concerning less than 1% of the total samples.
-# - The predictor is a `LogisticRegression`. You may need to increase the number
-#   of `max_iter`, which is 100 by default.
+# - The predictor is a `LogisticRegression` with default parameters, except that
+#   you may need to increase the number of `max_iter`, which is 100 by default.
 #
 # Use the same 10-fold cross-validation strategy with `return_estimator=True` as
-# above to evaluate this complex pipeline.
+# above to evaluate the full pipeline, including the feature scaling and encoding
+# preprocessing.
 
 # %%
 # Write your code here.
@@ -110,15 +113,22 @@
 
 # %% [markdown]
 # Now create a similar pipeline consisting of the same preprocessor as above,
-# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
-# Set `degree=2` and `interaction_only=True` to the feature engineering step.
-# Remember not to include a "bias" feature to avoid introducing a redundancy
-# with the intercept of the subsequent logistic regression.
+# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and
+# enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature
+# engineering step. Remember not to include a "bias" feature to avoid
+# introducing a redundancy with the intercept of the subsequent logistic
+# regression.
 
 # %%
 # Write your code here.
 
 # %% [markdown]
+# Use the same 10-fold cross-validation strategy as above to evaluate this
+# pipeline with interactions. In this case there is no need to return the
+# estimator, as the number of features generated by the `PolynomialFeatures` step
+# is much too large to be able to visually explore the learned coefficients of the
+# final classifier.
+#
 # By comparing the cross-validation test scores of both models fold-to-fold,
 # count the number of times the model using multiplicative interactions and both
 # numerical and categorical features has a better test score than the model
diff --git a/python_scripts/linear_models_sol_03.py b/python_scripts/linear_models_sol_03.py
@@ -8,8 +8,8 @@
 # %% [markdown]
 # # 📃 Solution for Exercise M4.03
 #
-# Now, we tackle a more realistic classification problem instead of making a
-# synthetic dataset. We start by loading the Adult Census dataset with the
+# Now, we tackle a (relatively) realistic classification problem instead of making
+# a synthetic dataset. We start by loading the Adult Census dataset with the
 # following snippet. For the moment we retain only the **numerical features**.
 
 # %%
@@ -24,10 +24,12 @@
 # %% [markdown]
 # We confirm that all the selected features are numerical.
 #
-# Compute the generalization performance in terms of accuracy of a linear model
-# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
-# cross-validation with `return_estimator=True` to be able to inspect the
-# trained estimators.
+# Define a linear model composed of a `StandardScaler` followed by a
+# `LogisticRegression` with default parameters.
+#
+# Then use a 10-fold cross-validation to estimate its generalization performance
+# in terms of accuracy. Also set `return_estimator=True` to be able to inspect
+# the trained estimators.
 
 # %%
 # solution
@@ -84,11 +86,12 @@
 # - The numerical data must be scaled.
 # - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
 #   group categories concerning less than 1% of the total samples.
-# - The predictor is a `LogisticRegression`. You may need to increase the number
-#   of `max_iter`, which is 100 by default.
+# - The predictor is a `LogisticRegression` with default parameters, except that
+#   you may need to increase the number of `max_iter`, which is 100 by default.
 #
 # Use the same 10-fold cross-validation strategy with `return_estimator=True` as
-# above to evaluate this complex pipeline.
+# above to evaluate the full pipeline, including the feature scaling and encoding
+# preprocessing.
 
 # %%
 # solution
@@ -195,23 +198,30 @@
 
 # %% [markdown]
 # Now create a similar pipeline consisting of the same preprocessor as above,
-# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
-# Set `degree=2` and `interaction_only=True` to the feature engineering step.
-# Remember not to include a "bias" feature to avoid introducing a redundancy
-# with the intercept of the subsequent logistic regression.
+# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01` and
+# enough `max_iter`. Set `degree=2` and `interaction_only=True` to the feature
+# engineering step. Remember not to include a "bias" feature to avoid
+# introducing a redundancy with the intercept of the subsequent logistic
+# regression.
 
 # %%
 # solution
 from sklearn.preprocessing import PolynomialFeatures
 
-model_with_interaction = make_pipeline(
+model_with_interactions = make_pipeline(
     preprocessor,
     PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),
     LogisticRegression(C=0.01, max_iter=5_000),
 )
-model_with_interaction
+model_with_interactions
 
 # %% [markdown]
+# Use the same 10-fold cross-validation strategy as above to evaluate this
+# pipeline with interactions. In this case there is no need to return the
+# estimator, as the number of features generated by the `PolynomialFeatures` step
+# is much too large to be able to visually explore the learned coefficients of the
+# final classifier.
+#
 # By comparing the cross-validation test scores of both models fold-to-fold,
 # count the number of times the model using multiplicative interactions and both
 # numerical and categorical features has a better test score than the model
@@ -220,11 +230,10 @@
 # %%
 # solution
 cv_results_interactions = cross_validate(
-    model_with_interaction,
+    model_with_interactions,
     data,
     target,
     cv=10,
-    return_estimator=True,
     n_jobs=2,
 )
 test_score_interactions = cv_results_interactions["test_score"]