|
16 | 16 | "(as in \"independent and identically distributed random variables\").</p>\n",
|
17 | 17 | "</div>\n",
|
18 | 18 | "\n",
|
19 |
| - "This assumption is usually violated when dealing with time series. A sample\n", |
20 |
| - "depends on past information.\n", |
| 19 | + "This assumption is usually violated in time series, where each sample can be\n", |
| 20 | + "influenced by previous samples (both their feature and target values) in an\n", |
| 21 | + "inherently ordered sequence.\n", |
21 | 22 | "\n",
|
22 |
| - "We will take an example to highlight such issues with non-i.i.d. data in the\n", |
23 |
| - "previous cross-validation strategies presented. We are going to load financial\n", |
| 23 | + "In this notebook we demonstrate the issues that arise when using the\n", |
| 24 | + "cross-validation strategies we have presented so far, along with non-i.i.d.\n", |
| 25 | + "data. For such purpose we load financial\n", |
24 | 26 | "quotations from some energy companies."
|
25 | 27 | ]
|
26 | 28 | },
|
|
91 | 93 | "data, target = quotes.drop(columns=[\"Chevron\"]), quotes[\"Chevron\"]\n",
|
92 | 94 | "data_train, data_test, target_train, target_test = train_test_split(\n",
|
93 | 95 | " data, target, shuffle=True, random_state=0\n",
|
94 |
| - ")" |
| 96 | + ")\n", |
| 97 | + "\n", |
| 98 | + "# Shuffling breaks the index order, but we still want it to be time-ordered\n", |
| 99 | + "data_train.sort_index(ascending=True, inplace=True)\n", |
| 100 | + "data_test.sort_index(ascending=True, inplace=True)\n", |
| 101 | + "target_train.sort_index(ascending=True, inplace=True)\n", |
| 102 | + "target_test.sort_index(ascending=True, inplace=True)" |
95 | 103 | ]
|
96 | 104 | },
|
97 | 105 | {
|
98 | 106 | "cell_type": "markdown",
|
99 | 107 | "metadata": {},
|
100 | 108 | "source": [
|
101 | 109 | "We will use a decision tree regressor that we expect to overfit and thus not\n",
|
102 |
| - "generalize to unseen data. We will use a `ShuffleSplit` cross-validation to\n", |
| 110 | + "generalize to unseen data. We use a `ShuffleSplit` cross-validation to\n", |
103 | 111 | "check the generalization performance of our model.\n",
|
104 | 112 | "\n",
|
105 | 113 | "Let's first define our model"
|
|
138 | 146 | "cell_type": "markdown",
|
139 | 147 | "metadata": {},
|
140 | 148 | "source": [
|
141 |
| - "Finally, we perform the evaluation." |
| 149 | + "We then perform the evaluation using the `ShuffleSplit` strategy." |
142 | 150 | ]
|
143 | 151 | },
|
144 | 152 | {
|
|
161 | 169 | "source": [
|
162 | 170 | "Surprisingly, we get outstanding generalization performance. We will\n",
|
163 | 171 | "investigate and find the reason for such good results with a model that is\n",
|
164 |
| - "expected to fail. We previously mentioned that `ShuffleSplit` is an iterative\n", |
165 |
| - "cross-validation scheme that shuffles data and split. We will simplify this\n", |
| 172 | + "expected to fail. We previously mentioned that `ShuffleSplit` is a\n", |
| 173 | + "cross-validation method that iteratively shuffles and splits the data.\n", |
| 174 | + "\n", |
| 175 | + "We can simplify the\n", |
166 | 176 | "procedure with a single split and plot the prediction. We can use\n",
|
167 | 177 | "`train_test_split` for this purpose."
|
168 | 178 | ]
|
|
202 | 212 | "cell_type": "markdown",
|
203 | 213 | "metadata": {},
|
204 | 214 | "source": [
|
205 |
| - "Similarly, we obtain good results in terms of $R^2$. We will plot the\n", |
| 215 | + "Similarly, we obtain good results in terms of $R^2$. We now plot the\n", |
206 | 216 | "training, testing and prediction samples."
|
207 | 217 | ]
|
208 | 218 | },
|
|
225 | 235 | "cell_type": "markdown",
|
226 | 236 | "metadata": {},
|
227 | 237 | "source": [
|
228 |
| - "So in this context, it seems that the model predictions are following the\n", |
229 |
| - "testing. But we can also see that the testing samples are next to some\n", |
230 |
| - "training sample. And with these time-series, we see a relationship between a\n", |
231 |
| - "sample at the time `t` and a sample at `t+1`. In this case, we are violating\n", |
232 |
| - "the i.i.d. assumption. The insight to get is the following: a model can output\n", |
233 |
| - "of its training set at the time `t` for a testing sample at the time `t+1`.\n", |
234 |
| - "This prediction would be close to the true value even if our model did not\n", |
235 |
| - "learn anything, but just memorized the training dataset.\n", |
| 238 | + "From the plot above, we can see that the training and testing samples are\n", |
| 239 | + "alternating. This structure effectively evaluates the model\u2019s ability to\n", |
| 240 | + "interpolate between neighboring data points, rather than its true\n", |
| 241 | + "generalization ability. As a result, the model\u2019s predictions are close to the\n", |
| 242 | + "actual values, even if it has not learned anything meaningful from the data.\n", |
| 243 | + "This is a form of **data leakage**, where the model gains access to future\n", |
| 244 | + "information (testing data) while training, leading to an over-optimistic\n", |
| 245 | + "estimate of the generalization performance.\n", |
236 | 246 | "\n",
|
237 |
| - "An easy way to verify this hypothesis is to not shuffle the data when doing\n", |
| 247 | + "An easy way to verify this is to not shuffle the data during\n", |
238 | 248 | "the split. In this case, we will use the first 75% of the data to train and\n",
|
239 |
| - "the remaining data to test." |
| 249 | + "the remaining data to test. This way we preserve the time order of the data, and\n", |
| 250 | + "ensure training on past data and evaluating on future data." |
240 | 251 | ]
|
241 | 252 | },
|
242 | 253 | {
|
|
343 | 354 | "from sklearn.model_selection import TimeSeriesSplit\n",
|
344 | 355 | "\n",
|
345 | 356 | "cv = TimeSeriesSplit(n_splits=groups.nunique())\n",
|
346 |
| - "test_score = cross_val_score(\n", |
347 |
| - " regressor, data, target, cv=cv, groups=groups, n_jobs=2\n", |
348 |
| - ")\n", |
| 357 | + "test_score = cross_val_score(regressor, data, target, cv=cv, n_jobs=2)\n", |
349 | 358 | "print(f\"The mean R2 is: {test_score.mean():.2f} \u00b1 {test_score.std():.2f}\")"
|
350 | 359 | ]
|
351 | 360 | },
|
352 | 361 | {
|
353 | 362 | "cell_type": "markdown",
|
354 | 363 | "metadata": {},
|
355 | 364 | "source": [
|
356 |
| - "In conclusion, it is really important to not use an out of the shelves\n", |
| 365 | + "In conclusion, it is really important not to carelessly use a\n", |
357 | 366 | "cross-validation strategy which do not respect some assumptions such as having\n",
|
358 |
| - "i.i.d data. It might lead to absurd results which could make think that a\n", |
359 |
| - "predictive model might work." |
| 367 | + "i.i.d data. It might lead to misleading outcomes, creating the false\n", |
| 368 | + "impression that a predictive model performs well when it may not be the case\n", |
| 369 | + "in the intended real-world scenario." |
360 | 370 | ]
|
361 | 371 | }
|
362 | 372 | ],
|
|
0 commit comments