Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLForecast|forecast.py : possible data leakage risk in cross_validation (through X_df future exogenous variables)? #465

Closed
TAZ1657 opened this issue Dec 14, 2024 · 2 comments

Comments

@TAZ1657
Copy link

TAZ1657 commented Dec 14, 2024

What happened + What you expected to happen

I am working with the VN1 competition dataset, trying to use GBM algoythm (XGBoost, LGBM, CatBoost) in both direct and recursive mode.

I have been using cross_validation for these models, and I noted unexpectedtly good results on cross_validation for recursive models: those results are substantially better than the results for the cross validation of the direct models, but also substantially better than the results of the recurssive model on a test set (for which I used naive forecast for the exogenous future values in X_df).
These observations have led me to believe there might be a data leakage issue.

I have looked the code of the cross_validation function in forecast.py, and specifically at how the X_df is created for each fold:
if my understanding is correct, X_df is crteated by copying the "valid" dataframe (ie the validation set for that fold), and removing some columns.

I am wondering if there could be a risk of data leakage in the cross_validation, if X_df is using actual values from the validation set of a given fold.

The relevant code section is in lines 947 to 957 of forecast.py

dynamic = [
c
for c in valid.columns
if c not in static + [id_col, time_col, target_col]
]
if dynamic:
X_df: Optional[DataFrame] = ufp.drop_columns(
valid, static + [target_col]
)
else:
X_df = None

Thank you for your attention to this.

Versions / Dependencies

Python 3.10.12
mlforecast 1.0.0
nixtla 0.6.4
lightgbm 4.5.0
catboost 1.2.7
xgboost 2.1.3

Reproduction script

This is not a bug, the code runs, the possible issue might rest in the logic used to create X_df future exogenous variables for a validation fold.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@TAZ1657 TAZ1657 added the bug label Dec 14, 2024
@jmoralez
Copy link
Member

Hey @TAZ1657, thanks for using mlforecast.

The X_df argument is supposed to hold the (real) future values of the exogenous features, so that's what the cross validation method does automatically. If you want to do something custom like the naive forecast you're doing here you can't use the cross validation method.

Copy link
Contributor

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants