MNT Update notebooks

ArturoAmorQ · ArturoAmorQ · commit cfc52d8679ff · 2025-04-01T10:52:32.000+02:00
diff --git a/notebooks/cross_validation_time.ipynb b/notebooks/cross_validation_time.ipynb
@@ -16,11 +16,13 @@
     "(as in \"independent and identically distributed random variables\").</p>\n",
     "</div>\n",
     "\n",
-    "This assumption is usually violated when dealing with time series. A sample\n",
-    "depends on past information.\n",
+    "This assumption is usually violated in time series, where each sample can be\n",
+    "influenced by previous samples (both their feature and target values) in an\n",
+    "inherently ordered sequence.\n",
     "\n",
-    "We will take an example to highlight such issues with non-i.i.d. data in the\n",
-    "previous cross-validation strategies presented. We are going to load financial\n",
+    "In this notebook we demonstrate the issues that arise when using the\n",
+    "cross-validation strategies we have presented so far, along with non-i.i.d.\n",
+    "data. For such purpose we load financial\n",
     "quotations from some energy companies."
    ]
   },
@@ -91,15 +93,21 @@
     "data, target = quotes.drop(columns=[\"Chevron\"]), quotes[\"Chevron\"]\n",
     "data_train, data_test, target_train, target_test = train_test_split(\n",
     "    data, target, shuffle=True, random_state=0\n",
-    ")"
+    ")\n",
+    "\n",
+    "# Shuffling breaks the index order, but we still want it to be time-ordered\n",
+    "data_train.sort_index(ascending=True, inplace=True)\n",
+    "data_test.sort_index(ascending=True, inplace=True)\n",
+    "target_train.sort_index(ascending=True, inplace=True)\n",
+    "target_test.sort_index(ascending=True, inplace=True)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We will use a decision tree regressor that we expect to overfit and thus not\n",
-    "generalize to unseen data. We will use a `ShuffleSplit` cross-validation to\n",
+    "generalize to unseen data. We use a `ShuffleSplit` cross-validation to\n",
     "check the generalization performance of our model.\n",
     "\n",
     "Let's first define our model"
@@ -138,7 +146,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Finally, we perform the evaluation."
+    "We then perform the evaluation using the `ShuffleSplit` strategy."
    ]
   },
   {
@@ -161,8 +169,10 @@
    "source": [
     "Surprisingly, we get outstanding generalization performance. We will\n",
     "investigate and find the reason for such good results with a model that is\n",
-    "expected to fail. We previously mentioned that `ShuffleSplit` is an iterative\n",
-    "cross-validation scheme that shuffles data and split. We will simplify this\n",
+    "expected to fail. We previously mentioned that `ShuffleSplit` is a\n",
+    "cross-validation method that iteratively shuffles and splits the data.\n",
+    "\n",
+    "We can simplify the\n",
     "procedure with a single split and plot the prediction. We can use\n",
     "`train_test_split` for this purpose."
    ]
@@ -202,7 +212,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Similarly, we obtain good results in terms of $R^2$. We will plot the\n",
+    "Similarly, we obtain good results in terms of $R^2$. We now plot the\n",
     "training, testing and prediction samples."
    ]
   },
@@ -225,18 +235,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So in this context, it seems that the model predictions are following the\n",
-    "testing. But we can also see that the testing samples are next to some\n",
-    "training sample. And with these time-series, we see a relationship between a\n",
-    "sample at the time `t` and a sample at `t+1`. In this case, we are violating\n",
-    "the i.i.d. assumption. The insight to get is the following: a model can output\n",
-    "of its training set at the time `t` for a testing sample at the time `t+1`.\n",
-    "This prediction would be close to the true value even if our model did not\n",
-    "learn anything, but just memorized the training dataset.\n",
+    "From the plot above, we can see that the training and testing samples are\n",
+    "alternating. This structure effectively evaluates the model\u2019s ability to\n",
+    "interpolate between neighboring data points, rather than its true\n",
+    "generalization ability. As a result, the model\u2019s predictions are close to the\n",
+    "actual values, even if it has not learned anything meaningful from the data.\n",
+    "This is a form of **data leakage**, where the model gains access to future\n",
+    "information (testing data) while training, leading to an over-optimistic\n",
+    "estimate of the generalization performance.\n",
     "\n",
-    "An easy way to verify this hypothesis is to not shuffle the data when doing\n",
+    "An easy way to verify this is to not shuffle the data during\n",
     "the split. In this case, we will use the first 75% of the data to train and\n",
-    "the remaining data to test."
+    "the remaining data to test. This way we preserve the time order of the data, and\n",
+    "ensure training on past data and evaluating on future data."
    ]
   },
   {
@@ -343,20 +354,19 @@
     "from sklearn.model_selection import TimeSeriesSplit\n",
     "\n",
     "cv = TimeSeriesSplit(n_splits=groups.nunique())\n",
-    "test_score = cross_val_score(\n",
-    "    regressor, data, target, cv=cv, groups=groups, n_jobs=2\n",
-    ")\n",
+    "test_score = cross_val_score(regressor, data, target, cv=cv, n_jobs=2)\n",
     "print(f\"The mean R2 is: {test_score.mean():.2f} \u00b1 {test_score.std():.2f}\")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In conclusion, it is really important to not use an out of the shelves\n",
+    "In conclusion, it is really important not to carelessly use a\n",
     "cross-validation strategy which do not respect some assumptions such as having\n",
-    "i.i.d data. It might lead to absurd results which could make think that a\n",
-    "predictive model might work."
+    "i.i.d data. It might lead to misleading outcomes, creating the false\n",
+    "impression that a predictive model performs well when it may not be the case\n",
+    "in the intended real-world scenario."
    ]
   }
  ],
diff --git a/notebooks/datasets_california_housing.ipynb b/notebooks/datasets_california_housing.ipynb
@@ -185,9 +185,9 @@
     "huge difference. It confirms the intuitions that there are a couple of extreme\n",
     "values.\n",
     "\n",
-    "Up to know, we discarded the longitude and latitude that carry geographical\n",
-    "information. In short, the combination of this feature could help us to decide\n",
-    "if there are locations associated with high-valued houses. Indeed, we could\n",
+    "Up to now, we discarded the longitude and latitude that carry geographical\n",
+    "information. In short, the combination of these features could help us decide\n",
+    "if there are locations associated with high-value houses. Indeed, we could\n",
     "make a scatter plot where the x- and y-axis would be the latitude and\n",
     "longitude and the circle size and color would be linked with the house value\n",
     "in the district."
diff --git a/notebooks/ensemble_hyperparameters.ipynb b/notebooks/ensemble_hyperparameters.ipynb
@@ -199,15 +199,15 @@
     "residuals are corrected and then less learners are required. Therefore,\n",
     "it can be beneficial to increase `max_iter` if `max_depth` is low.\n",
     "\n",
-    "Finally, we have overlooked the impact of the `learning_rate` parameter until\n",
-    "now. When fitting the residuals, we would like the tree to try to correct all\n",
-    "possible errors or only a fraction of them. The learning-rate allows you to\n",
-    "control this behaviour. A small learning-rate value would only correct the\n",
-    "residuals of very few samples. If a large learning-rate is set (e.g., 1), we\n",
-    "would fit the residuals of all samples. So, with a very low learning-rate, we\n",
-    "would need more estimators to correct the overall error. However, a too large\n",
-    "learning-rate tends to obtain an overfitted ensemble, similar to having very\n",
-    "deep trees."
+    "Finally, we have overlooked the impact of the `learning_rate` parameter\n",
+    "until now. This parameter controls how much each correction contributes to the\n",
+    "final prediction. A smaller learning-rate means the corrections of a new\n",
+    "tree result in small adjustments to the model prediction. When the\n",
+    "learning-rate is small, the model generally needs more trees to achieve good\n",
+    "performance. A higher learning-rate makes larger adjustments with each tree,\n",
+    "which requires fewer trees and trains faster, at the risk of overfitting. The\n",
+    "learning-rate needs to be tuned by hyperparameter tuning to obtain the best\n",
+    "value that results in a model with good generalization performance."
    ]
   },
   {
diff --git a/notebooks/parameter_tuning_grid_search.ipynb b/notebooks/parameter_tuning_grid_search.ipynb
@@ -250,7 +250,7 @@
     "<p>Be aware that the evaluation should normally be performed through\n",
     "cross-validation by providing <tt class=\"docutils literal\">model_grid_search</tt> as a model to the\n",
     "<tt class=\"docutils literal\">cross_validate</tt> function.</p>\n",
-    "<p class=\"last\">Here, we used a single train-test split to to evaluate <tt class=\"docutils literal\">model_grid_search</tt>. In\n",
+    "<p class=\"last\">Here, we used a single train-test split to evaluate <tt class=\"docutils literal\">model_grid_search</tt>. In\n",
     "a future notebook will go into more detail about nested cross-validation, when\n",
     "you use cross-validation both for hyperparameter tuning and model evaluation.</p>\n",
     "</div>"