From 7aabb2a3f3ffffc3c936c4cedb59665397164f0c Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 9 Mar 2023 09:42:04 +0100 Subject: [PATCH 01/11] Initial barebone draft --- slep021/proposal.rst | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 slep021/proposal.rst diff --git a/slep021/proposal.rst b/slep021/proposal.rst new file mode 100644 index 0000000..e2dad71 --- /dev/null +++ b/slep021/proposal.rst @@ -0,0 +1,36 @@ +.. _slep_021: + +================================================== +SLEP021: Unified API to compute feature importance +================================================== + +:Author: Thomas J Fan, Guillaume Lemaitre +:Status: Draft +:Type: Standards Track +:Created: 2023-03-09 + +Abstract +-------- + +Detailed description +-------------------- + +Discussion +---------- + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open Publication + License`_. +.. [2] `scikit-learn Governance and Decision-Making + `__ + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From e753bad9a5e55d404f98f377c56ff2c3f07dd12c Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 9 Mar 2023 10:46:26 +0100 Subject: [PATCH 02/11] draft motivation --- slep021/proposal.rst | 40 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 37 insertions(+), 3 deletions(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index e2dad71..f3579e3 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -12,9 +12,44 @@ SLEP021: Unified API to compute feature importance Abstract -------- +This SLEP proposes a common API for computing feature importance. + Detailed description -------------------- +Motivation +~~~~~~~~~~ + +Data scientists rely on feature importance when inspecting a trained model. +Feature importance is a measure of how much a feature contributes to the +prediction and thus gives insights on the model the predictions provided by +the model. + +However, there is currently not a single method to compute feature importance. +All available methods are designed upon axioms or hypotheses that are not +necessarly respected in practice. + +Some work in scikit-learn has been done to provide documentation to highlight +the limitations of some implemented methods. However, there is currently not +a common way to expose feature importance in scikit-learn. In addition, for +some historical reasons, some estimators (e.g. decision tree) provide a single +feature importance that could be used as the "method-to-use" to analyse the +model. It is problematic since there is not defacto standard to analyse the +feature importance of a model. + +Therefore, this SLEP proposes an API for providing feature importance allowing +to be flexible to switch between methods and extensible to add new methods. It +is a follow-up of initial discussions from [2]_. + +Current state +~~~~~~~~~~~~~ + +Current pitfalls +~~~~~~~~~~~~~~~~ + +Solution +~~~~~~~~ + Discussion ---------- @@ -24,8 +59,7 @@ References and Footnotes .. [1] Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the `Open Publication License`_. -.. [2] `scikit-learn Governance and Decision-Making - `__ +.. [2] .. _Open Publication License: https://www.opencontent.org/openpub/ @@ -33,4 +67,4 @@ References and Footnotes Copyright --------- -This document has been placed in the public domain. [1]_ +This document has been placed in the public domain [1]_. From 93dcd2f99220b5f4faabae83286c1cbd559ea49e Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 9 Mar 2023 14:33:16 +0100 Subject: [PATCH 03/11] add link --- slep021/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index f3579e3..1d675e2 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -59,7 +59,7 @@ References and Footnotes .. [1] Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the `Open Publication License`_. -.. [2] +.. [2] https://github.com/scikit-learn/scikit-learn/pull/25659#pullrequestreview-1330861709 .. _Open Publication License: https://www.opencontent.org/openpub/ From bcf37e738b8d90ced457012adf044f0268ee3236 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 9 Mar 2023 14:34:36 +0100 Subject: [PATCH 04/11] wrong link --- slep021/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index 1d675e2..d4d642b 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -59,7 +59,7 @@ References and Footnotes .. [1] Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the `Open Publication License`_. -.. [2] https://github.com/scikit-learn/scikit-learn/pull/25659#pullrequestreview-1330861709 +.. [2] https://github.com/scikit-learn/scikit-learn/issues/20059#issuecomment-869811256 .. _Open Publication License: https://www.opencontent.org/openpub/ From 0459715f019b9c8f76cf3a9ca038c6db6c7d1823 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 9 Mar 2023 16:45:12 +0100 Subject: [PATCH 05/11] add available methods and first use case --- index.rst | 1 + slep021/proposal.rst | 41 +++++++++++++++++++++++++++++++++++++++-- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/index.rst b/index.rst index ff7d43c..2a9ee3d 100644 --- a/index.rst +++ b/index.rst @@ -25,6 +25,7 @@ slep012/proposal slep017/proposal slep019/proposal + slep021/proposal .. toctree:: :maxdepth: 1 diff --git a/slep021/proposal.rst b/slep021/proposal.rst index d4d642b..fe350b5 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -39,11 +39,46 @@ feature importance of a model. Therefore, this SLEP proposes an API for providing feature importance allowing to be flexible to switch between methods and extensible to add new methods. It -is a follow-up of initial discussions from [2]_. +is a follow-up of initial discussions from :issue:`20059`. Current state ~~~~~~~~~~~~~ +Available methods +^^^^^^^^^^^^^^^^^ + +The following methods are available in scikit-learn to provide some feature +importance: + +- The function :func:`sklearn.inspection.permutation_importance`. It requests + a fitted estimator and a dataset. Addtional parameters can be provided. The + method returns a `Bunch` containing 3 attributes: all decrease in score for + all repeatitions, the mean, and the standard deviation across the repeats. + This method is therefore estimator agnostic. +- The linear estimators have a `coef_` attributes once fitted. +- The decision tree-based estimators have a `feature_importances_` attribute + once fitted. + +Use cases +^^^^^^^^^ + +The first usage of feature importance is to inspect a fitted model. Usually, +the feature importance will be plotted to visualize the importance of the +features:: + + >>> tree = DecisionTreeClassifier().fit(X_train, y_train) + >>> plt.barh(X_train.columns, tree.feature_importances_) + +The analysis can be done further by checking the variance of the feature +importance. :func:`sklearn.inspection.permutation_importance` already provides +a way to do that since it repeats the computation several time. For the model +specific feature importance, the user can use cross-validation to get an idea +of the dispersion:: + + >>> cv_results = cross_validate(tree, X_train, y_train, return_estimator=True) + >>> feature_importances = [est.feature_importances_ for est in cv_results["estimator"]] + >>> plt.boxplot(feature_importances, labels=X_train.columns) + Current pitfalls ~~~~~~~~~~~~~~~~ @@ -53,13 +88,15 @@ Solution Discussion ---------- +Issues where some discussions related to feature importance has been discussed: +:issue:`20059`, :issue:`21170`. + References and Footnotes ------------------------ .. [1] Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the `Open Publication License`_. -.. [2] https://github.com/scikit-learn/scikit-learn/issues/20059#issuecomment-869811256 .. _Open Publication License: https://www.opencontent.org/openpub/ From 341c79a4bd1dede62a75f433f62c9002b699f531 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 9 Mar 2023 17:09:16 +0100 Subject: [PATCH 06/11] iter --- slep021/proposal.rst | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index fe350b5..6419a3c 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -79,12 +79,49 @@ of the dispersion:: >>> feature_importances = [est.feature_importances_ for est in cv_results["estimator"]] >>> plt.boxplot(feature_importances, labels=X_train.columns) +The second usage is about the model selection. Meta-estimator such as +:class:`sklearn.model_selection.SelectFromModel` internally use an array of +length `(n_features,)` to select feature and retrain a model on this subset of +feature. + +By default :class:`sklearn.model_selection.SelectFromModel` relies on the +estimator to expose `coef_` or `feature_importances_`:: + + >>> SelectFromModel(tree).fit(X_train, y_train) # `tree` exposes `feature_importances_` + +For more flexbilibity, a string can be provided:: + + >>> linear_model = make_pipeline(StandardScaler(), LogisticRegression()) + >>> SelectFromModel( + ... linear_model, importance_getter="named_steps.logisticregression.coef_" + ... ).fit(X_train, y_train) + +:class:`sklearn.model_selection.SelectFromModel` rely by default on +the estimator to expose a `coef_` or `feature_importances_` attribute. It is +also possible to provide a string corresponding the attribute name returning +the feature importance. It allows to deal with estimator embedded inside a +pipeline, for instance. Finally, a callable taking an estimator and returning +a NumPy array can also be provided. + Current pitfalls ~~~~~~~~~~~~~~~~ Solution ~~~~~~~~ +Plotting +~~~~~~~~ + +Add a new :class:`sklearn.inspection.FeatureImportanceDisplay` class to +:mod:`sklearn.inspection`. Two methods could be useful for this display: (i) +:meth:`sklearn.inspection.FeatureImportanceDisplay.from_estimator` to plot a +a single estimate of feature importance and (ii) +:meth:`sklearn.inspection.FeatureImportanceDisplay.from_cv_results` to plot a +an estimate of the feature importance together with the variance. + +The display should therefore be aware how to retrieve the feature importance +given the esimator. + Discussion ---------- From f95a7a3d483e9ea2f8fe3b6ed7dac0bfa8c9d124 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Fri, 10 Mar 2023 14:48:38 +0100 Subject: [PATCH 07/11] iter --- slep021/proposal.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index 6419a3c..7127b50 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -110,7 +110,7 @@ Solution ~~~~~~~~ Plotting -~~~~~~~~ +^^^^^^^^ Add a new :class:`sklearn.inspection.FeatureImportanceDisplay` class to :mod:`sklearn.inspection`. Two methods could be useful for this display: (i) @@ -120,7 +120,7 @@ a single estimate of feature importance and (ii) an estimate of the feature importance together with the variance. The display should therefore be aware how to retrieve the feature importance -given the esimator. +given the estimator. Discussion ---------- From 7f1fcb507666f92af1903dd5dd70e99320f9cffd Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Fri, 10 Mar 2023 15:02:28 +0100 Subject: [PATCH 08/11] iter --- slep021/proposal.rst | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index 7127b50..eed0c95 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -89,7 +89,7 @@ estimator to expose `coef_` or `feature_importances_`:: >>> SelectFromModel(tree).fit(X_train, y_train) # `tree` exposes `feature_importances_` -For more flexbilibity, a string can be provided:: +For more flexibility, a string can be provided:: >>> linear_model = make_pipeline(StandardScaler(), LogisticRegression()) >>> SelectFromModel( @@ -106,6 +106,22 @@ a NumPy array can also be provided. Current pitfalls ~~~~~~~~~~~~~~~~ +On a methodological perspective, scikit-learn does not encourage for good +practice. Indeed, since it provides a defacto `feature_importances_` attribute +for the decision tree, it is tempting for user to believe that this method is +the best one. + +In the same spirit, the :class:`sklearn.model_selection.SelectFromModel` +meta-estimator uses defacto the `feature_importances_` or `coef_` for selecting +features. + +In both cases, it should be better to request the user to be more explicit and +request to choose a specific method to compute the feature importance for +inspection or feature selection. + +On an API perspective, the current functionality for feature importance are +available via functions or attributes, with no common API. + Solution ~~~~~~~~ From 487e7620d4f12e6cec690bfdf9adf815b9aa5a8a Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Fri, 10 Mar 2023 15:39:41 +0100 Subject: [PATCH 09/11] iter --- slep021/proposal.rst | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index eed0c95..eb0b28f 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -119,12 +119,38 @@ In both cases, it should be better to request the user to be more explicit and request to choose a specific method to compute the feature importance for inspection or feature selection. +Additionally, `feature_importances_` and `coef_` are statistics derived from +the training set. We already documented that the reported +`feature_importances_` will potentially show biases for features used by the +model to overfit. Thus, it will potentially negatively impact the feature +selection once used in the :class:`sklearn.model_selection.SelectFromModel` +meta-estimator. + On an API perspective, the current functionality for feature importance are available via functions or attributes, with no common API. Solution ~~~~~~~~ +A common API +^^^^^^^^^^^^ + +**Proposal 1**: Expose a parameter in `__init__` to select the method to use +to compute the feature importance. The computation will be done using a method, +e.g. `get_feature_importance` that could take additional parameters requested +by the feature importance method. This method could therefore be used +internally by :class:`sklearn.model_selection.SelectFromModel`. + +**Proposal 2**: Create a meta-estimator that takes a model and a method in +`__init__`. Then, a method `fit` could compute the feature importance given +some data. Then, the feature importance could be available through a fitted +attribute `feature_importances_` (or a method?). We could reuse such +meta-estimator in the :class:`sklearn.model_selection.SelectFromModel`. + +Then, we should rely on a common API for the methods computing the feature +importance. It seems that they should all at least accept a fitted estimator, +some dataset, and potentially some extra parameters. + Plotting ^^^^^^^^ @@ -144,12 +170,18 @@ Discussion Issues where some discussions related to feature importance has been discussed: :issue:`20059`, :issue:`21170`. +In SHAP package [2]_, the API is similar to the proposal 2. A class `Explainer` +takes a model, an algorithm, and some additional parameters (that could be +used by some algorithm). The computation of the Shapley values is done and +return using the method `shap_values`. + References and Footnotes ------------------------ .. [1] Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the `Open Publication License`_. +.. [2] https://shap.readthedocs.io/en/latest/ .. _Open Publication License: https://www.opencontent.org/openpub/ From 71fed85d5480d28b2c15d27669c927adb996a0cf Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Wed, 15 Mar 2023 16:01:11 +0100 Subject: [PATCH 10/11] some rephresaing and new proposal --- slep021/proposal.rst | 48 ++++++++++++++++++++++++++------------------ 1 file changed, 29 insertions(+), 19 deletions(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index eb0b28f..a11a72b 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -21,21 +21,16 @@ Motivation ~~~~~~~~~~ Data scientists rely on feature importance when inspecting a trained model. -Feature importance is a measure of how much a feature contributes to the -prediction and thus gives insights on the model the predictions provided by -the model. - -However, there is currently not a single method to compute feature importance. -All available methods are designed upon axioms or hypotheses that are not -necessarly respected in practice. - -Some work in scikit-learn has been done to provide documentation to highlight -the limitations of some implemented methods. However, there is currently not -a common way to expose feature importance in scikit-learn. In addition, for -some historical reasons, some estimators (e.g. decision tree) provide a single -feature importance that could be used as the "method-to-use" to analyse the -model. It is problematic since there is not defacto standard to analyse the -feature importance of a model. +However, there is currently not a single algorithm providing **the** feature +importance. In practice, several algorithms are available, all having their +pros and cons. + +In scikit-learn, there are different ways to compute and inspect feature +importances. Some models, e.g. some tree based models, expose a +`feature_importances_` attribute upon `fit`, and we also have utilities such as +the `permutation_importance` function to compute a different type of feature +importance. There has been some work documenting their limitations, but we have +not provided a nice API to implement alternatives. Therefore, this SLEP proposes an API for providing feature importance allowing to be flexible to switch between methods and extensible to add new methods. It @@ -55,7 +50,9 @@ importance: method returns a `Bunch` containing 3 attributes: all decrease in score for all repeatitions, the mean, and the standard deviation across the repeats. This method is therefore estimator agnostic. -- The linear estimators have a `coef_` attributes once fitted. +- The linear estimators have a `coef_` attributes once fitted, which is + sometimes used their corresponding importance. We documented the limitations + when it comes to interpret those coefficients. - The decision tree-based estimators have a `feature_importances_` attribute once fitted. @@ -79,7 +76,7 @@ of the dispersion:: >>> feature_importances = [est.feature_importances_ for est in cv_results["estimator"]] >>> plt.boxplot(feature_importances, labels=X_train.columns) -The second usage is about the model selection. Meta-estimator such as +The second usage is about the feature selection. Meta-estimator such as :class:`sklearn.model_selection.SelectFromModel` internally use an array of length `(n_features,)` to select feature and retrain a model on this subset of feature. @@ -144,13 +141,26 @@ internally by :class:`sklearn.model_selection.SelectFromModel`. **Proposal 2**: Create a meta-estimator that takes a model and a method in `__init__`. Then, a method `fit` could compute the feature importance given some data. Then, the feature importance could be available through a fitted -attribute `feature_importances_` (or a method?). We could reuse such -meta-estimator in the :class:`sklearn.model_selection.SelectFromModel`. +attribute `feature_importances_` or a method `get_feature_importance`. We could +reuse such meta-estimator in the +:class:`sklearn.model_selection.SelectFromModel`. Then, we should rely on a common API for the methods computing the feature importance. It seems that they should all at least accept a fitted estimator, some dataset, and potentially some extra parameters. +**Proposal 3**: Similarly to the proposal 2 and taking inspiration from the +SHAP package [2]_, we could create a class `Explainer` providing a +`get_feature_importance` method given some data. + +Currently scikit-learn provides only global feature importance. The previous +API could be extended by providing a `get_samples_importance` to compute an +explanation per sample if the given method supports it (e.g. Shapley values). + +**Proposal 4**: Create a meta-estimator `FeatureImportanceCalculator` that +could be passed around plotting displays or to an +`estimator.get_feature_importance` method. + Plotting ^^^^^^^^ From 934904c8d8907a8f04c8bfc9af9513f8187c7fcf Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 16 May 2024 22:05:46 +0200 Subject: [PATCH 11/11] add related method for feature importances --- slep021/proposal.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index a11a72b..8824ed6 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -185,6 +185,18 @@ takes a model, an algorithm, and some additional parameters (that could be used by some algorithm). The computation of the Shapley values is done and return using the method `shap_values`. +Related issues +-------------- + +Some discussions happened in the past. In this section, we aggregate all issues +related to this topic: + +- :issue:`15132`: proposal to add `feature_importances_` into the + `HistGradientBoosting` classifier and regressor models. +- :issue:`18223`: proposal to implement the PIMP feature importance. +- :issue:`18603`: implement OOB permutation importance for `RandomForest`. +- :issue:`21170`: implement variable importances for linear models. + References and Footnotes ------------------------