You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/tutorials/categorical.rst
+87-16Lines changed: 87 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -137,38 +137,109 @@ feature it's specified as ``"c"``. The Dask module in XGBoost has the same inte
137
137
:class:`dask.Array <dask.Array>` can also be used for categorical data. Lastly, the
138
138
sklearn interface :py:class:`~xgboost.XGBRegressor` has the same parameter.
139
139
140
-
****************
141
-
Data Consistency
142
-
****************
140
+
.. _cat-recode:
143
141
144
-
XGBoost accepts parameters to indicate which feature is considered categorical, either through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However, XGBoost by itself doesn't store information on how categories are encoded in the first place. For instance, given an encoding schema that maps music genres to integer codes:
142
+
********************************
143
+
Auto-recoding (Data Consistency)
144
+
********************************
145
+
146
+
.. versionchanged:: 3.1
147
+
148
+
Starting with XGBoost 3.1, the *Python* interface can perform automatic re-coding for
149
+
new inputs.
150
+
151
+
XGBoost accepts parameters to indicate which feature is considered categorical, either
152
+
through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However,
153
+
except for the Python interface, XGBoost doesn't store the information about how
154
+
categories are encoded in the first place. For instance, given an encoding schema that
XGBoost doesn't know this mapping from the input and hence cannot store it in the model. The mapping usually happens in the users' data engineering pipeline with column transformers like :py:class:`sklearn.preprocessing.OrdinalEncoder`. To make sure correct result from XGBoost, users need to keep the pipeline for transforming data consistent across training and testing data. One should watch out for errors like:
161
+
Aside from the Python interface (R/Java/C, etc), XGBoost doesn't know this mapping from
162
+
the input and hence cannot store it in the model. The mapping usually happens in the
163
+
users' data engineering pipeline. To ensure the correct result from XGBoost, users need to
164
+
keep the pipeline for transforming data consistent across training and testing data.
165
+
166
+
Starting with 3.1, the *Python* interface can remember the encoding and perform recoding
167
+
during inference and training continuation when the input is a dataframe (`pandas`,
168
+
`cuDF`, `polars`, `pyarrow`, `modin`). The feature support focuses on basic usage. It has
169
+
some restrictions on the types of inputs that can be accepted. First, category names
170
+
must have one of the following types:
171
+
172
+
- string
173
+
- integer, from 8-bit to 64-bit, both signed and unsigned are supported.
174
+
- 32-bit or 64-bit floating point
175
+
176
+
Other category types are not supported. Second, the input types must be strictly
177
+
consistent. For example, XGBoost will raise an error if the categorical columns in the
178
+
training set are unsigned integers whereas the test dataset has signed integer columns. If
179
+
you have categories that are not one of the supported types, you need to perform the
180
+
re-coding using a pre-processing data transformer like the
181
+
:py:class:`sklearn.preprocessing.OrdinalEncoder`. See
182
+
:ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using an ordinal
183
+
encoder. To clarify, the type here refers to the type of the name of categories (called
In the above snippet, training data and test data are encoded separately, resulting in two different encoding schemas and invalid prediction result. See :ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using ordinal encoder.
229
+
No extra step is required for using the scikit-learn interface as long as the inputs are
230
+
dataframes. During training continuation, XGBoost will either extract the categories from
231
+
the previous model or use the categories from the new training dataset if the input model
232
+
doesn't have the information.
162
233
163
234
*************
164
235
Miscellaneous
165
236
*************
166
237
167
-
By default, XGBoost assumes input categories are integers starting from 0 till the number
168
-
of categories :math:`[0, n\_categories)`. However, user might provide inputs with invalid
169
-
values due to mistakes or missing values in training dataset. It can be negative value,
170
-
integer values that can not be accurately represented by 32-bit floating point, or values
171
-
that are larger than actual number of unique categories. During training this is
238
+
By default, XGBoost assumes input category codes are integers starting from 0 till the
239
+
number of categories :math:`[0, n\_categories)`. However, user might provide inputs with
240
+
invalid values due to mistakes or missing values in training dataset. It can be negative
241
+
value, integer values that can not be accurately represented by 32-bit floating point, or
242
+
values that are larger than actual number of unique categories. During training this is
172
243
validated but for prediction it's treated as the same as not-chosen category for
0 commit comments