You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working with the VN1 competition dataset, trying to use GBM algoythm (XGBoost, LGBM, CatBoost) in both direct and recursive mode.
I have been using cross_validation for these models, and I noted unexpectedtly good results on cross_validation for recursive models: those results are substantially better than the results for the cross validation of the direct models, but also substantially better than the results of the recurssive model on a test set (for which I used naive forecast for the exogenous future values in X_df).
These observations have led me to believe there might be a data leakage issue.
I have looked the code of the cross_validation function in forecast.py, and specifically at how the X_df is created for each fold:
if my understanding is correct, X_df is crteated by copying the "valid" dataframe (ie the validation set for that fold), and removing some columns.
I am wondering if there could be a risk of data leakage in the cross_validation, if X_df is using actual values from the validation set of a given fold.
The relevant code section is in lines 947 to 957 of forecast.py
dynamic = [
c
for c in valid.columns
if c not in static + [id_col, time_col, target_col]
]
if dynamic:
X_df: Optional[DataFrame] = ufp.drop_columns(
valid, static + [target_col]
)
else:
X_df = None
The X_df argument is supposed to hold the (real) future values of the exogenous features, so that's what the cross validation method does automatically. If you want to do something custom like the naive forecast you're doing here you can't use the cross validation method.
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one.
What happened + What you expected to happen
I am working with the VN1 competition dataset, trying to use GBM algoythm (XGBoost, LGBM, CatBoost) in both direct and recursive mode.
I have been using cross_validation for these models, and I noted unexpectedtly good results on cross_validation for recursive models: those results are substantially better than the results for the cross validation of the direct models, but also substantially better than the results of the recurssive model on a test set (for which I used naive forecast for the exogenous future values in X_df).
These observations have led me to believe there might be a data leakage issue.
I have looked the code of the cross_validation function in forecast.py, and specifically at how the X_df is created for each fold:
if my understanding is correct, X_df is crteated by copying the "valid" dataframe (ie the validation set for that fold), and removing some columns.
I am wondering if there could be a risk of data leakage in the cross_validation, if X_df is using actual values from the validation set of a given fold.
The relevant code section is in lines 947 to 957 of forecast.py
dynamic = [
c
for c in valid.columns
if c not in static + [id_col, time_col, target_col]
]
if dynamic:
X_df: Optional[DataFrame] = ufp.drop_columns(
valid, static + [target_col]
)
else:
X_df = None
Thank you for your attention to this.
Versions / Dependencies
Python 3.10.12
mlforecast 1.0.0
nixtla 0.6.4
lightgbm 4.5.0
catboost 1.2.7
xgboost 2.1.3
Reproduction script
This is not a bug, the code runs, the possible issue might rest in the logic used to create X_df future exogenous variables for a validation fold.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: