-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
[enc][dask] Support training continuation. #11609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ed18017 to
f011826
Compare
Evaluation. Fixes. Documents.
Co-authored-by: Copilot <[email protected]>
4de4b85 to
b7910a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for training continuation in the Dask interface with categorical encoding. The main goal is to enable XGBoost to preserve categorical encodings when continuing training from a previous model, ensuring consistent categorical handling across training sessions.
- Adds a
run_recodefunction and related testing infrastructure for categorical re-coding functionality - Refactors the dispatch system to use
DispatchAnyconsistently across CPU and CUDA implementations - Updates DMatrix creation functions to accept model parameters for extracting reference categories
Reviewed Changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| python-package/xgboost/testing/dask.py | Adds run_recode function to test categorical re-coding with training continuation |
| python-package/xgboost/dask/data.py | Updates DMatrix creation functions to handle model categories and reference categories |
| src/data/proxy_dmatrix.h | Refactors dispatch system and adds support for reference categories |
| src/data/proxy_dmatrix.cuh | Updates CUDA dispatch functions to use consistent DispatchAny interface |
| tests/test_distributed/test_with_dask/test_with_dask.py | Adds test for re-coding functionality |
Comments suppressed due to low confidence (1)
src/data/proxy_dmatrix.h:88
- The parameter type changed from pass-by-value to pointer without updating the function documentation. This is a breaking change that should be clearly documented or the parameter should maintain consistent passing semantics.
void SetColumnar(StringView data);
|
cc @rongou . |
doc/tutorials/categorical.rst
Outdated
| Internally, XGBoost attempts to extract the categories from the dataframe inputs. For | ||
| inference (predict), the re-coding happens on the fly and there's no data copy (baring | ||
| from some internal transformations performed by the dataframe itself). For training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for from here.
doc/tutorials/categorical.rst
Outdated
| Starting with 3.1, the *Python* interface can remember the encoding and perform recoding | ||
| during inference and training continuation when the input is a dataframe (`pandas`, | ||
| `cuDF`, `polars`, `pyarrow`, `modin`). The feature support focuses on basic usage. It has | ||
| some restrictions on the types of inputs that can be accepted. Firstly, category names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First
doc/tutorials/categorical.rst
Outdated
| - integer, from 8-bit to 64-bit, both signed and unsigned are supported. | ||
| - 32-bit or 64-bit floating point | ||
|
|
||
| Other category types are not supported. Secondly, the input types must be strictly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second
doc/tutorials/categorical.rst
Outdated
| from some internal transformations performed by the dataframe itself). For training | ||
| continuation however, re-coding requires some extra steps if you are using the native | ||
| interface. The sklearn interface and the Dask interface can handle training continuation | ||
| automatically. Lastly, please note that using the re-coder with the native interface is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for going through the doc. I fixed the grammar based on your comments.
Ref #11088