Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) #3204

Merged
merged 9 commits into from
Aug 2, 2020

Conversation

momijiame
Copy link
Contributor

This PR allows users to get trained boosters from cv function directly.
This feature is useful for using ensemble techniques and OOF predictions.

Add the changes that proposed in #283 discussion.

  • Add return_cvbooster flag to cv function
  • Rename _CVBooster to make public

@ghost
Copy link

ghost commented Jul 3, 2020

CLA assistant check
All CLA requirements met.

@momijiame
Copy link
Contributor Author

Hmm... At a glance, I think this change did not trigger breaking CI.

/__w/1/s/.ci/test.sh: line 103: pytest: command not found
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=6577&view=logs&j=02a2c3ba-81f8-54e3-0767-5d5adbb0daa9&t=720ee3fa-96d4-5b47-dbf4-01607b74ade2&l=4772

@StrikerRUS
Copy link
Collaborator

Hmm... At a glance, I think this change did not trigger breaking CI.

Random network error. Fixed. Sorry for the inconvenience.

@StrikerRUS
Copy link
Collaborator

@momijiame Please add new public class to the documentation: https://github.com/microsoft/LightGBM/blob/master/docs/Python-API.rst.

@matsuken92 Can you please help to review this PR?

@momijiame
Copy link
Contributor Author

@StrikerRUS Thank you quick response! I added the CVBooster to the documentation.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momijiame Thanks for your quick fixes! Please take a look at some my comments below:

python-package/lightgbm/__init__.py Show resolved Hide resolved
python-package/lightgbm/engine.py Show resolved Hide resolved
python-package/lightgbm/engine.py Outdated Show resolved Hide resolved
python-package/lightgbm/engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
- Add some clarifications to the documentation
- Rename CVBooster.append to make private
- Decrease iteration rounds of testing to save CI time
- Use CVBooster as root member of lgb
@momijiame
Copy link
Contributor Author

@StrikerRUS Thank you so much for the feedback! I reflected the comments.

@momijiame momijiame requested a review from StrikerRUS July 7, 2020 00:13
@jameslamb jameslamb removed their request for review July 7, 2020 18:26
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some more checks in tests and I believe we should document best_iteration and boosters attributes to let users know about them. (You can simply Commit suggestion if you agree with it.)

python-package/lightgbm/engine.py Outdated Show resolved Hide resolved
python-package/lightgbm/engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Show resolved Hide resolved
@momijiame
Copy link
Contributor Author

Thank you for the great suggestion. I committed.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momijiame Thank you very much for this enhancement! LGTM!

I'd like to get one more review before merging (preferably from @matsuken92 or @henry0312).

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StrikerRUS
Copy link
Collaborator

Gently ping @matsuken92 as you wanted to give your review for the general idea of this PR.

@momijiame momijiame changed the title [python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105) [python] add return_cvbooster flag to cv func and publish _CVBooster (#283,#2105,#1445) Jul 21, 2020
@momijiame
Copy link
Contributor Author

I noticed that this PR also solves #1445, so I updated the title.
Users can extract feature importances from fitted Boosters.

NOTE: It would be more useful to support CVBooster and/or List[Booster] for plot_importance()

@mirekphd
Copy link

mirekphd commented Sep 12, 2020

@momijiame : thank you for this much needed feature!
I seem to have found a rather important omission: did we ensure best_iteration is returned for each fold from CVBooster when using early stopping? Did we perhaps assume somewhere that all folds stop at the same iteration?

After running lgbm_cv_dict = lgbm.cv(params=params_dict, ..., return_cvbooster=True), with params_dict containing a custom early_stopping_rounds value,
lgbm_cv_dict['cvbooster'].boosters[k].best_iteration seems to be missing (equal to -1 for all folds).

For example:

Numbers of iterations (current / best) for re-fitted CV models :

Fold 0:
154
-1

Fold 1:
150
-1

Fold 2:
155
-1

Fold 3:
151
-1

Fold 4:
155
-1

CVBooster best iteration:
123

On the other hand, lgbm_cv_dict['cvbooster'].best_iteration is returned correctly (e.g. 123) (it would be -1 if early stopping rounds were not specified, i.e. early stopping were not used). Current/last/max iterations numbers seem to be also unaffected (i.e. lgbm_cv_dict['cvbooster'].boosters[k].current_iteration() and lgbm_cv_dict['cvbooster'].boosters[k].num_trees() are never -1 and are always equal to params_dict[num_boost_round], i.e. the maximum number of iterations allowed when using early stopping).

Without best iteration it is not possible to obtain predictions for each fold nor feature importances (which rely on the iterations number). In fact I suspect they may be incorrect now, because they are the same regardless of the value their iteration parameter is set to:

lgbm_cv_dict['cvbooster'].boosters[k].feature_importance()
# or
lgbm_cv_dict['cvbooster'].boosters[k].feature_importance(iteration=fold_best_iter_num)
# or
lgbm_cv_dict['cvbooster'].boosters[k].feature_importance(iteration=fold_cur_iter_num)

# where:
fold_best_iter_num=lgbm_cv_dict['cvbooster'].boosters[k].best_iteration
fold_cur_iter_num=lgbm_cv_dict['cvbooster'].boosters[k].current_iteration()

For individual folds it is also not possible to perform the usual argmax / argmin trick, because lgbm.cv still returns evaluation history for the entire CV model, with metrics already averaged over all folds:

lgbm_best_mean_iter = np.argmax(lgbm_cv_ret_dict[METRIC+'-mean'])+1
print("\nBest mean iteration based on the evaluation history for re-fitted CV models: %d" % lgbm_best_mean_iter)

assert(lgbm_best_mean_iter == lgbm_cv_ret_dict['cvbooster'].best_iteration)

@momijiame
Copy link
Contributor Author

@mirekphd Thank you for your feedback.
I do not know if it is documented or not, but it is expected behavior. From before, cv() function updates Boosters while internally synchronizing all folds. early_stopping is triggered by the average metric of all folds. Therefore, best_iteration is equal in all folds. IMHO, this behavior makes cross validation conservative.

We could consider setting best_iteration of Boosters to all the same value, but that would break backwards compatibility. Even before return_cvbooster option was added, the technique of retrieving CVBooster (_CVBooster) by using Callback was exist.

Also, if we want to train Boosters on each folds, we can simply use train() function. The documentation could be improved on the above, but I do not think the behavior should be changed.

@mirekphd
Copy link

mirekphd commented Sep 14, 2020

Therefore, best_iteration is equal in all folds.

Not in this case:). Please note that we are talking only about early stopping here, where best_iteration is unknown a priori until each fold model completes (only the maximum number of rounds can be specified by the user - we know this one, and CVBooster params return it correctly). Only then - after each CV model completes - can we compute average best iteration across all folds and return it as CVBooster's "best iteration".

To see that each fold uses a different number of iterations it's enough to see that we have only one model training pass per each CV model (as opposed to two passes), and also from my example above, where training stopped at iterations from 150 to 155 rounds (as reported by .current_iteration() for each fold).

The information on the best_iteration for each fold is already available somewhere, we just need to expose it to the user, replacing those -1 values, just like we could return it in case of .current_iteration() (which is higher or equal to the best_iteration in this case of early stopping).

@momijiame suggested that ligthgbm uses average iterations number computed across all folds in the first modeling pass (or somehow avoids two passes by synchronizing between folds to avoid performance penalty of two passes and knowing average after a single pass). That would be possibly inefficient (if two passes are needed), but more importantly could lead to lower predictive accuracy, with some folds underfitted and some overfitted, with respectively too few or too many boosting rounds (depending on their to their relation to the average). We have been always using separate numbers of boosting rounds for each CV fold as far as I can remember in our internal library, and this may explain its accuracy differences as compared to lightgbm.cv().

@guolinke - could you find time for a few words clarification here? Are all folds in cv() sharing the same number of boosting rounds eventually (i.e. for prediction purposes) when using early stopping and is it seen as so beneficial that it's the only available implementation?

@guolinke
Copy link
Collaborator

@mirekphd
I think in both XGBoost and LightGBM, the CV will use the average scores from all folds, and use this for the early stopping. Therefore, best_iteration is the same in all folds.
I think this is more stable since the average_score is computed over all data samples, not just over the current fold.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants