Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] on scikit-learn 1.6+, predict() raises misleading warning "X does not have valid feature names" #6798

Open
jameslamb opened this issue Jan 25, 2025 · 0 comments
Labels

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Jan 25, 2025

Description

Starting with scikit-learn 1.6 (I think), lightgbm.sklearn estimators emit this warning in predict(), even when the data passed to fit() did not have feature names (e.g. was just a numpy array):

/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but LGBMClassifier was fitted with feature names
warnings.warn(

This is confusing. Either:

lightgbm shouldn't raise this warning in situations where the input data did not have feature names but LightGBM automatically generated them.

Reproducible example

This is sufficient to produce the warning.

import lightgbm as lgb
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=1000, n_features=5, centers=2)
clf = lgb.LGBMClassifier(verbose=-1).fit(X, y)
clf.predict(X[:5])

It is also showing up in CI logs, for example: https://github.com/microsoft/LightGBM/actions/runs/12922031182/job/36037058867#step:3:6864

I think that what's happening here is that scikit-learn will always say that the model "was fitted with feature names" because LightGBM automatically assigns feature names of the form Column_0, Column_1, etc..

# lightgbm automatic feature names
clf.feature_names_in_
# array(['Column_0', 'Column_1', 'Column_2', 'Column_3', 'Column_4'], dtype='<U8'

If input does not contain feature names, they will be added during fitting in the format ``Column_0``, ``Column_1``, ..., ``Column_N``.

Environment info

LightGBM version or commit hash: 3654eca

Command(s) you used to install LightGBM

cmake -B build -S .
cmake --build build --target _lightgbm
sh build-python.sh install --precompile

Additional Comments

Here's where the warning comes from in scikit-learn:

https://github.com/scikit-learn/scikit-learn/blob/e44742ea6c06ee891e92facb886f268f7cfc033b/sklearn/utils/validation.py#L2736-L2741

Any solution to this should comply with all of the expectations for scikit-learn estimators at https://scikit-learn.org/stable/developers/develop.html

@jameslamb jameslamb added the bug label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant