Skip to content

Commit feabaf6

Browse files
trivialfisCopilot
andauthored
[doc] Cleanup C doc notations and enhance categorical docs. (#11774)
--------- Co-authored-by: Copilot <[email protected]>
1 parent 279c4cb commit feabaf6

File tree

2 files changed

+449
-425
lines changed

2 files changed

+449
-425
lines changed

doc/tutorials/categorical.rst

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
Categorical Data
33
################
44

5+
**Contents**
6+
7+
.. contents::
8+
:backlinks: none
9+
:local:
10+
511
Since version 1.5, XGBoost has support for categorical data. For numerical data, the
612
split condition is defined as :math:`value < threshold`, while for categorical data the
713
split is defined depending on whether partitioning or onehot encoding is used. For
@@ -140,7 +146,7 @@ Auto-recoding (Data Consistency)
140146

141147
.. versionchanged:: 3.1
142148

143-
Starting with XGBoost 3.1, the *Python* interface can perform automatic re-coding for
149+
Starting with XGBoost 3.1, the **Python** interface can perform automatic re-coding for
144150
new inputs.
145151

146152
XGBoost accepts parameters to indicate which feature is considered categorical, either
@@ -158,11 +164,11 @@ the input and hence cannot store it in the model. The mapping usually happens in
158164
users' data engineering pipeline. To ensure the correct result from XGBoost, users need to
159165
keep the pipeline for transforming data consistent across training and testing data.
160166

161-
Starting with 3.1, the *Python* interface can remember the encoding and perform recoding
167+
Starting with 3.1, the **Python** interface can remember the encoding and perform recoding
162168
during inference and training continuation when the input is a dataframe (`pandas`,
163169
`cuDF`, `polars`, `pyarrow`, `modin`). The feature support focuses on basic usage. It has
164-
some restrictions on the types of inputs that can be accepted. First, category names
165-
must have one of the following types:
170+
some restrictions on the types of inputs that can be accepted. First, category names must
171+
have one of the following types:
166172

167173
- string
168174
- integer, from 8-bit to 64-bit, both signed and unsigned are supported.
@@ -224,9 +230,15 @@ of using the native interface:
224230
No extra step is required for using the scikit-learn interface as long as the inputs are
225231
dataframes. During training continuation, XGBoost will either extract the categories from
226232
the previous model or use the categories from the new training dataset if the input model
227-
doesn't have the information.
233+
doesn't have the information. As a side note, users can inspect the content of the
234+
categories by exporting it to arrow arrays. This interface is still experimental:
235+
236+
.. code-block:: python
237+
238+
categories = booster.get_categories(export_to_arrow=True)
239+
print(categories.to_arrow())
228240
229-
For R, the auto-recoding is not yet supported as of 3.1. To provide an example:
241+
For **R**, the auto-recoding is not yet supported as of 3.1. To provide an example:
230242

231243
.. code-block:: R
232244

0 commit comments

Comments
 (0)