Skip to content

Conversation

@trivialfis
Copy link
Member

@trivialfis trivialfis commented Aug 4, 2025

Ref #11088

  • Update document.
  • Unify the code dispatch between QDM and DM.
  • Support re-coding with Dask.

This comment was marked as outdated.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for training continuation in the Dask interface with categorical encoding. The main goal is to enable XGBoost to preserve categorical encodings when continuing training from a previous model, ensuring consistent categorical handling across training sessions.

  • Adds a run_recode function and related testing infrastructure for categorical re-coding functionality
  • Refactors the dispatch system to use DispatchAny consistently across CPU and CUDA implementations
  • Updates DMatrix creation functions to accept model parameters for extracting reference categories

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python-package/xgboost/testing/dask.py Adds run_recode function to test categorical re-coding with training continuation
python-package/xgboost/dask/data.py Updates DMatrix creation functions to handle model categories and reference categories
src/data/proxy_dmatrix.h Refactors dispatch system and adds support for reference categories
src/data/proxy_dmatrix.cuh Updates CUDA dispatch functions to use consistent DispatchAny interface
tests/test_distributed/test_with_dask/test_with_dask.py Adds test for re-coding functionality
Comments suppressed due to low confidence (1)

src/data/proxy_dmatrix.h:88

  • The parameter type changed from pass-by-value to pointer without updating the function documentation. This is a breaking change that should be clearly documented or the parameter should maintain consistent passing semantics.
  void SetColumnar(StringView data);

@trivialfis trivialfis changed the title [wip][enc][dask] Support training continuation. [enc][dask] Support training continuation. Aug 6, 2025
@trivialfis
Copy link
Member Author

cc @rongou .

@trivialfis trivialfis marked this pull request as ready for review August 6, 2025 20:51
Internally, XGBoost attempts to extract the categories from the dataframe inputs. For
inference (predict), the re-coding happens on the fly and there's no data copy (baring
from some internal transformations performed by the dataframe itself). For training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for from here.

Starting with 3.1, the *Python* interface can remember the encoding and perform recoding
during inference and training continuation when the input is a dataframe (`pandas`,
`cuDF`, `polars`, `pyarrow`, `modin`). The feature support focuses on basic usage. It has
some restrictions on the types of inputs that can be accepted. Firstly, category names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First

- integer, from 8-bit to 64-bit, both signed and unsigned are supported.
- 32-bit or 64-bit floating point

Other category types are not supported. Secondly, the input types must be strictly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second

from some internal transformations performed by the dataframe itself). For training
continuation however, re-coding requires some extra steps if you are using the native
interface. The sklearn interface and the Dask interface can handle training continuation
automatically. Lastly, please note that using the re-coder with the native interface is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for going through the doc. I fixed the grammar based on your comments.

@trivialfis trivialfis merged commit f25f74d into dmlc:master Aug 7, 2025
60 of 63 checks passed
@trivialfis trivialfis deleted the enc-dask-doc branch August 7, 2025 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants