MLForecast|forecast.py : possible data leakage risk in cross_validation (through X_df future exogenous variables)? #465

TAZ1657 · 2024-12-14T04:46:36Z

What happened + What you expected to happen

I am working with the VN1 competition dataset, trying to use GBM algoythm (XGBoost, LGBM, CatBoost) in both direct and recursive mode.

I have been using cross_validation for these models, and I noted unexpectedtly good results on cross_validation for recursive models: those results are substantially better than the results for the cross validation of the direct models, but also substantially better than the results of the recurssive model on a test set (for which I used naive forecast for the exogenous future values in X_df).
These observations have led me to believe there might be a data leakage issue.

I have looked the code of the cross_validation function in forecast.py, and specifically at how the X_df is created for each fold:
if my understanding is correct, X_df is crteated by copying the "valid" dataframe (ie the validation set for that fold), and removing some columns.

I am wondering if there could be a risk of data leakage in the cross_validation, if X_df is using actual values from the validation set of a given fold.

The relevant code section is in lines 947 to 957 of forecast.py

dynamic = [
c
for c in valid.columns
if c not in static + [id_col, time_col, target_col]
]
if dynamic:
X_df: Optional[DataFrame] = ufp.drop_columns(
valid, static + [target_col]
)
else:
X_df = None

Thank you for your attention to this.

Versions / Dependencies

Python 3.10.12
mlforecast 1.0.0
nixtla 0.6.4
lightgbm 4.5.0
catboost 1.2.7
xgboost 2.1.3

Reproduction script

This is not a bug, the code runs, the possible issue might rest in the logic used to create X_df future exogenous variables for a validation fold.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jmoralez · 2024-12-16T16:28:47Z

Hey @TAZ1657, thanks for using mlforecast.

The X_df argument is supposed to hold the (real) future values of the exogenous features, so that's what the cross validation method does automatically. If you want to do something custom like the naive forecast you're doing here you can't use the cross validation method.

github-actions · 2025-01-16T04:06:01Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one.

TAZ1657 added the bug label Dec 14, 2024

jmoralez added the awaiting response label Dec 16, 2024

github-actions bot closed this as completed Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLForecast|forecast.py : possible data leakage risk in cross_validation (through X_df future exogenous variables)? #465

MLForecast|forecast.py : possible data leakage risk in cross_validation (through X_df future exogenous variables)? #465

TAZ1657 commented Dec 14, 2024

jmoralez commented Dec 16, 2024

github-actions bot commented Jan 16, 2025

MLForecast|forecast.py : possible data leakage risk in cross_validation (through X_df future exogenous variables)? #465

MLForecast|forecast.py : possible data leakage risk in cross_validation (through X_df future exogenous variables)? #465

Comments

TAZ1657 commented Dec 14, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jmoralez commented Dec 16, 2024

github-actions bot commented Jan 16, 2025