[enc][dask] Support training continuation. #11609

trivialfis · 2025-08-04T18:15:33Z

Update document.
Unify the code dispatch between QDM and DM.
Support re-coding with Dask.

Evaluation. Fixes. Documents.

Co-authored-by: Copilot <[email protected]>

Copilot

Pull Request Overview

This PR adds support for training continuation in the Dask interface with categorical encoding. The main goal is to enable XGBoost to preserve categorical encodings when continuing training from a previous model, ensuring consistent categorical handling across training sessions.

Adds a run_recode function and related testing infrastructure for categorical re-coding functionality
Refactors the dispatch system to use DispatchAny consistently across CPU and CUDA implementations
Updates DMatrix creation functions to accept model parameters for extracting reference categories

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
python-package/xgboost/testing/dask.py	Adds `run_recode` function to test categorical re-coding with training continuation
python-package/xgboost/dask/data.py	Updates DMatrix creation functions to handle model categories and reference categories
src/data/proxy_dmatrix.h	Refactors dispatch system and adds support for reference categories
src/data/proxy_dmatrix.cuh	Updates CUDA dispatch functions to use consistent `DispatchAny` interface
tests/test_distributed/test_with_dask/test_with_dask.py	Adds test for re-coding functionality

Comments suppressed due to low confidence (1)

src/data/proxy_dmatrix.h:88

The parameter type changed from pass-by-value to pointer without updating the function documentation. This is a breaking change that should be clearly documented or the parameter should maintain consistent passing semantics.

  void SetColumnar(StringView data);

src/data/proxy_dmatrix.h

src/data/simple_dmatrix.cu

src/data/simple_dmatrix.cc

python-package/xgboost/dask/data.py

src/data/quantile_dmatrix.cc

trivialfis · 2025-08-06T20:51:07Z

cc @rongou .

rongou · 2025-08-06T21:12:28Z

doc/tutorials/categorical.rst

+
+Internally, XGBoost attempts to extract the categories from the dataframe inputs. For
+inference (predict), the re-coding happens on the fly and there's no data copy (baring
+from some internal transformations performed by the dataframe itself). For training


No need for from here.

rongou · 2025-08-06T21:13:54Z

doc/tutorials/categorical.rst

+Starting with 3.1, the *Python* interface can remember the encoding and perform recoding
+during inference and training continuation when the input is a dataframe (`pandas`,
+`cuDF`, `polars`, `pyarrow`, `modin`). The feature support focuses on basic usage. It has
+some restrictions on the types of inputs that can be accepted. Firstly, category names


rongou · 2025-08-06T21:14:02Z

doc/tutorials/categorical.rst

+- integer, from 8-bit to 64-bit, both signed and unsigned are supported.
+- 32-bit or 64-bit floating point
+
+Other category types are not supported. Secondly, the input types must be strictly


rongou · 2025-08-06T21:14:54Z

doc/tutorials/categorical.rst

+from some internal transformations performed by the dataframe itself). For training
+continuation however, re-coding requires some extra steps if you are using the native
+interface. The sklearn interface and the Dask interface can handle training continuation
+automatically. Lastly, please note that using the re-coder with the native interface is


Thank you for going through the doc. I fixed the grammar based on your comments.

trivialfis force-pushed the enc-dask-doc branch from ed18017 to f011826 Compare August 5, 2025 08:05

trivialfis requested a review from Copilot August 5, 2025 08:59

This comment was marked as outdated.

Sign in to view

trivialfis and others added 14 commits August 6, 2025 15:18

[enc][dask] Support training continuation.

b95a65a

Evaluation. Fixes. Documents.

cleanup.

8263ce6

balance.

37f002a

sync.

25b4f70

Update python-package/xgboost/sklearn.py

2b78196

Co-authored-by: Copilot <[email protected]>

Notes.

54b787b

Fixes.

fd3c96f

partition

8bde103

Rename.

124dd49

Cleanup.

50e519b

shared ptr.

48dbcc3

Move.

823a215

Revert.

8e1b941

Cleanup.

b7910a6

trivialfis force-pushed the enc-dask-doc branch from 4de4b85 to b7910a6 Compare August 6, 2025 09:15

trivialfis requested a review from Copilot August 6, 2025 09:16

Copilot AI reviewed Aug 6, 2025

View reviewed changes

src/data/proxy_dmatrix.h Show resolved Hide resolved

src/data/simple_dmatrix.cu Show resolved Hide resolved

src/data/simple_dmatrix.cc Show resolved Hide resolved

python-package/xgboost/dask/data.py Show resolved Hide resolved

src/data/quantile_dmatrix.cc Show resolved Hide resolved

trivialfis added 2 commits August 6, 2025 17:20

sklearn tag.

9b126b1

Revert.

8268def

trivialfis mentioned this pull request Aug 6, 2025

Auto encoding for categorical data during inference. #11088

Open

20 tasks

trivialfis added 4 commits August 6, 2025 18:21

swap.

8c72ba4

pointer wrapper.

b3b9c85

Revert.

20d0ddc

Test.

d3ec666

trivialfis changed the title ~~[wip][enc][dask] Support training continuation.~~ [enc][dask] Support training continuation. Aug 6, 2025

trivialfis added 2 commits August 7, 2025 04:30

wording.

a573d47

wording.

2e5b3e2

trivialfis marked this pull request as ready for review August 6, 2025 20:51

rongou approved these changes Aug 6, 2025

View reviewed changes

trivialfis added 2 commits August 7, 2025 06:00

Reviewer's comments.

5c0847f

Update the external memory tests.

1f3ff82

trivialfis merged commit f25f74d into dmlc:master Aug 7, 2025
60 of 63 checks passed

trivialfis deleted the enc-dask-doc branch August 7, 2025 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[enc][dask] Support training continuation. #11609

[enc][dask] Support training continuation. #11609

Uh oh!

trivialfis commented Aug 4, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Aug 6, 2025

Uh oh!

rongou Aug 6, 2025

Uh oh!

rongou Aug 6, 2025

Uh oh!

rongou Aug 6, 2025

Uh oh!

rongou Aug 6, 2025

Uh oh!

trivialfis Aug 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[enc][dask] Support training continuation. #11609

[enc][dask] Support training continuation. #11609

Uh oh!

Conversation

trivialfis commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Aug 6, 2025

Uh oh!

rongou Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rongou Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rongou Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rongou Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trivialfis commented Aug 4, 2025 •

edited

Loading