22Categorical Data
33################
44
5+ **Contents **
6+
7+ .. contents ::
8+ :backlinks: none
9+ :local:
10+
511Since version 1.5, XGBoost has support for categorical data. For numerical data, the
612split condition is defined as :math: `value < threshold`, while for categorical data the
713split is defined depending on whether partitioning or onehot encoding is used. For
@@ -140,7 +146,7 @@ Auto-recoding (Data Consistency)
140146
141147.. versionchanged :: 3.1
142148
143- Starting with XGBoost 3.1, the *Python * interface can perform automatic re-coding for
149+ Starting with XGBoost 3.1, the ** Python * * interface can perform automatic re-coding for
144150 new inputs.
145151
146152XGBoost accepts parameters to indicate which feature is considered categorical, either
@@ -158,11 +164,11 @@ the input and hence cannot store it in the model. The mapping usually happens in
158164users' data engineering pipeline. To ensure the correct result from XGBoost, users need to
159165keep the pipeline for transforming data consistent across training and testing data.
160166
161- Starting with 3.1, the *Python * interface can remember the encoding and perform recoding
167+ Starting with 3.1, the ** Python * * interface can remember the encoding and perform recoding
162168during inference and training continuation when the input is a dataframe (`pandas `,
163169`cuDF `, `polars `, `pyarrow `, `modin `). The feature support focuses on basic usage. It has
164- some restrictions on the types of inputs that can be accepted. First, category names
165- must have one of the following types:
170+ some restrictions on the types of inputs that can be accepted. First, category names must
171+ have one of the following types:
166172
167173- string
168174- integer, from 8-bit to 64-bit, both signed and unsigned are supported.
@@ -224,9 +230,15 @@ of using the native interface:
224230 No extra step is required for using the scikit-learn interface as long as the inputs are
225231dataframes. During training continuation, XGBoost will either extract the categories from
226232the previous model or use the categories from the new training dataset if the input model
227- doesn't have the information.
233+ doesn't have the information. As a side note, users can inspect the content of the
234+ categories by exporting it to arrow arrays. This interface is still experimental:
235+
236+ .. code-block :: python
237+
238+ categories = booster.get_categories(export_to_arrow = True )
239+ print (categories.to_arrow())
228240
229- For R , the auto-recoding is not yet supported as of 3.1. To provide an example:
241+ For ** R ** , the auto-recoding is not yet supported as of 3.1. To provide an example:
230242
231243.. code-block :: R
232244
0 commit comments