[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

momijiame · 2020-07-03T02:00:58Z

This PR allows users to get trained boosters from cv function directly.
This feature is useful for using ensemble techniques and OOF predictions.

Add the changes that proposed in #283 discussion.

Add return_cvbooster flag to cv function
Rename _CVBooster to make public

…er to make public (#283,#2105)

ghost · 2020-07-03T02:01:11Z

All CLA requirements met.

momijiame · 2020-07-03T04:51:15Z

Hmm... At a glance, I think this change did not trigger breaking CI.

/__w/1/s/.ci/test.sh: line 103: pytest: command not found
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=6577&view=logs&j=02a2c3ba-81f8-54e3-0767-5d5adbb0daa9&t=720ee3fa-96d4-5b47-dbf4-01607b74ade2&l=4772

StrikerRUS · 2020-07-03T17:36:32Z

Hmm... At a glance, I think this change did not trigger breaking CI.

Random network error. Fixed. Sorry for the inconvenience.

StrikerRUS · 2020-07-03T17:38:04Z

@momijiame Please add new public class to the documentation: https://github.com/microsoft/LightGBM/blob/master/docs/Python-API.rst.

@matsuken92 Can you please help to review this PR?

momijiame · 2020-07-04T05:30:11Z

@StrikerRUS Thank you quick response! I added the CVBooster to the documentation.

StrikerRUS

@momijiame Thanks for your quick fixes! Please take a look at some my comments below:

python-package/lightgbm/__init__.py

python-package/lightgbm/engine.py

tests/python_package_test/test_engine.py

- Add some clarifications to the documentation - Rename CVBooster.append to make private - Decrease iteration rounds of testing to save CI time - Use CVBooster as root member of lgb

momijiame · 2020-07-06T10:04:29Z

@StrikerRUS Thank you so much for the feedback! I reflected the comments.

StrikerRUS

Please add some more checks in tests and I believe we should document best_iteration and boosters attributes to let users know about them. (You can simply Commit suggestion if you agree with it.)

python-package/lightgbm/engine.py

tests/python_package_test/test_engine.py

Co-authored-by: Nikita Titov <[email protected]>

momijiame · 2020-07-08T02:57:44Z

Thank you for the great suggestion. I committed.

StrikerRUS

@momijiame Thank you very much for this enhancement! LGTM!

I'd like to get one more review before merging (preferably from @matsuken92 or @henry0312).

guolinke

LGTM

StrikerRUS · 2020-07-20T16:06:11Z

Gently ping @matsuken92 as you wanted to give your review for the general idea of this PR.

momijiame · 2020-07-21T06:33:16Z

I noticed that this PR also solves #1445, so I updated the title.
Users can extract feature importances from fitted Boosters.

NOTE: It would be more useful to support CVBooster and/or List[Booster] for plot_importance()

mirekphd · 2020-09-12T17:13:21Z

@momijiame : thank you for this much needed feature!
I seem to have found a rather important omission: did we ensure best_iteration is returned for each fold from CVBooster when using early stopping? Did we perhaps assume somewhere that all folds stop at the same iteration?

After running lgbm_cv_dict = lgbm.cv(params=params_dict, ..., return_cvbooster=True), with params_dict containing a custom early_stopping_rounds value,
lgbm_cv_dict['cvbooster'].boosters[k].best_iteration seems to be missing (equal to -1 for all folds).

For example:

Numbers of iterations (current / best) for re-fitted CV models :

Fold 0:
154
-1

Fold 1:
150
-1

Fold 2:
155
-1

Fold 3:
151
-1

Fold 4:
155
-1

CVBooster best iteration:
123

On the other hand, lgbm_cv_dict['cvbooster'].best_iteration is returned correctly (e.g. 123) (it would be -1 if early stopping rounds were not specified, i.e. early stopping were not used). Current/last/max iterations numbers seem to be also unaffected (i.e. lgbm_cv_dict['cvbooster'].boosters[k].current_iteration() and lgbm_cv_dict['cvbooster'].boosters[k].num_trees() are never -1 and are always equal to params_dict[num_boost_round], i.e. the maximum number of iterations allowed when using early stopping).

Without best iteration it is not possible to obtain predictions for each fold nor feature importances (which rely on the iterations number). In fact I suspect they may be incorrect now, because they are the same regardless of the value their iteration parameter is set to:

lgbm_cv_dict['cvbooster'].boosters[k].feature_importance()
# or
lgbm_cv_dict['cvbooster'].boosters[k].feature_importance(iteration=fold_best_iter_num)
# or
lgbm_cv_dict['cvbooster'].boosters[k].feature_importance(iteration=fold_cur_iter_num)

# where:
fold_best_iter_num=lgbm_cv_dict['cvbooster'].boosters[k].best_iteration
fold_cur_iter_num=lgbm_cv_dict['cvbooster'].boosters[k].current_iteration()

For individual folds it is also not possible to perform the usual argmax / argmin trick, because lgbm.cv still returns evaluation history for the entire CV model, with metrics already averaged over all folds:

lgbm_best_mean_iter = np.argmax(lgbm_cv_ret_dict[METRIC+'-mean'])+1
print("\nBest mean iteration based on the evaluation history for re-fitted CV models: %d" % lgbm_best_mean_iter)

assert(lgbm_best_mean_iter == lgbm_cv_ret_dict['cvbooster'].best_iteration)

momijiame · 2020-09-14T01:31:24Z

@mirekphd Thank you for your feedback.
I do not know if it is documented or not, but it is expected behavior. From before, cv() function updates Boosters while internally synchronizing all folds. early_stopping is triggered by the average metric of all folds. Therefore, best_iteration is equal in all folds. IMHO, this behavior makes cross validation conservative.

We could consider setting best_iteration of Boosters to all the same value, but that would break backwards compatibility. Even before return_cvbooster option was added, the technique of retrieving CVBooster (_CVBooster) by using Callback was exist.

Also, if we want to train Boosters on each folds, we can simply use train() function. The documentation could be improved on the above, but I do not think the behavior should be changed.

mirekphd · 2020-09-14T10:46:04Z

Therefore, best_iteration is equal in all folds.

Not in this case:). Please note that we are talking only about early stopping here, where best_iteration is unknown a priori until each fold model completes (only the maximum number of rounds can be specified by the user - we know this one, and CVBooster params return it correctly). Only then - after each CV model completes - can we compute average best iteration across all folds and return it as CVBooster's "best iteration".

To see that each fold uses a different number of iterations it's enough to see that we have only one model training pass per each CV model (as opposed to two passes), and also from my example above, where training stopped at iterations from 150 to 155 rounds (as reported by .current_iteration() for each fold).

The information on the best_iteration for each fold is already available somewhere, we just need to expose it to the user, replacing those -1 values, just like we could return it in case of .current_iteration() (which is higher or equal to the best_iteration in this case of early stopping).

@momijiame suggested that ligthgbm uses average iterations number computed across all folds in the first modeling pass (or somehow avoids two passes by synchronizing between folds to avoid performance penalty of two passes and knowing average after a single pass). That would be possibly inefficient (if two passes are needed), but more importantly could lead to lower predictive accuracy, with some folds underfitted and some overfitted, with respectively too few or too many boosting rounds (depending on their to their relation to the average). We have been always using separate numbers of boosting rounds for each CV fold as far as I can remember in our internal library, and this may explain its accuracy differences as compared to lightgbm.cv().

@guolinke - could you find time for a few words clarification here? Are all folds in cv() sharing the same number of boosting rounds eventually (i.e. for prediction purposes) when using early stopping and is it seen as so beneficial that it's the only available implementation?

guolinke · 2020-09-15T04:25:18Z

@mirekphd
I think in both XGBoost and LightGBM, the CV will use the average scores from all folds, and use this for the early stopping. Therefore, best_iteration is the same in all folds.
I think this is more stable since the average_score is computed over all data samples, not just over the current fold.

github-actions · 2023-08-24T04:53:23Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

[python] add return_cvbooster flag to cv function and rename _CVBoost…

6bb8008

…er to make public (#283,#2105)

momijiame requested review from chivee, guolinke, henry0312, jameslamb, Laurae2, StrikerRUS and wxchan as code owners July 3, 2020 02:00

[python] Reduce expected metric of unit testing

ec3a1b8

StrikerRUS added the feature label Jul 3, 2020

[docs] add the CVBooster to the documentation

10567ca

StrikerRUS requested changes Jul 4, 2020

View reviewed changes

[python] reflect the review comments

39a60ca

- Add some clarifications to the documentation - Rename CVBooster.append to make private - Decrease iteration rounds of testing to save CI time - Use CVBooster as root member of lgb

momijiame requested a review from StrikerRUS July 7, 2020 00:13

jameslamb removed their request for review July 7, 2020 18:26

StrikerRUS requested changes Jul 7, 2020

View reviewed changes

python-package/lightgbm/engine.py Outdated Show resolved Hide resolved

python-package/lightgbm/engine.py Outdated Show resolved Hide resolved

tests/python_package_test/test_engine.py Show resolved Hide resolved

momijiame and others added 3 commits July 8, 2020 10:23

[python] add more checks in testing for cv

0e271d2

Co-authored-by: Nikita Titov <[email protected]>

[python] add docstring for instance attributes of CVBooster

6670f39

Co-authored-by: Nikita Titov <[email protected]>

[python] fix docstring

237e3ef

Co-authored-by: Nikita Titov <[email protected]>

StrikerRUS approved these changes Jul 8, 2020

View reviewed changes

guolinke approved these changes Jul 20, 2020

View reviewed changes

momijiame changed the title ~~[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105)~~ [python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) Jul 21, 2020

StrikerRUS added 2 commits August 1, 2020 21:06

Merge branch 'master' into master

41792e8

Merge branch 'master' into master

698c2c8

StrikerRUS merged commit 1d59a04 into microsoft:master Aug 2, 2020

This was referenced Aug 2, 2020

Why is _CVBooster object hidden class? #2105

Closed

Add option to keep cv predicted values #283

Closed

Lightgbm cv feature importance python #1445

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

momijiame commented Jul 3, 2020

ghost commented Jul 3, 2020 •

edited by ghost

Loading

momijiame commented Jul 3, 2020

StrikerRUS commented Jul 3, 2020

StrikerRUS commented Jul 3, 2020

momijiame commented Jul 4, 2020

StrikerRUS left a comment

momijiame commented Jul 6, 2020

StrikerRUS left a comment •

edited

Loading

momijiame commented Jul 8, 2020

StrikerRUS left a comment

guolinke left a comment

StrikerRUS commented Jul 20, 2020

momijiame commented Jul 21, 2020

mirekphd commented Sep 12, 2020 •

edited

Loading

momijiame commented Sep 14, 2020

mirekphd commented Sep 14, 2020 •

edited

Loading

guolinke commented Sep 15, 2020

github-actions bot commented Aug 24, 2023

[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

Conversation

momijiame commented Jul 3, 2020

ghost commented Jul 3, 2020 • edited by ghost Loading

momijiame commented Jul 3, 2020

StrikerRUS commented Jul 3, 2020

StrikerRUS commented Jul 3, 2020

momijiame commented Jul 4, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

momijiame commented Jul 6, 2020

StrikerRUS left a comment • edited Loading

Choose a reason for hiding this comment

momijiame commented Jul 8, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

guolinke left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jul 20, 2020

momijiame commented Jul 21, 2020

mirekphd commented Sep 12, 2020 • edited Loading

momijiame commented Sep 14, 2020

mirekphd commented Sep 14, 2020 • edited Loading

guolinke commented Sep 15, 2020

github-actions bot commented Aug 24, 2023

ghost commented Jul 3, 2020 •

edited by ghost

Loading

StrikerRUS left a comment •

edited

Loading

mirekphd commented Sep 12, 2020 •

edited

Loading

mirekphd commented Sep 14, 2020 •

edited

Loading