Skip to content

Commit a8924d4

Browse files
authored
[enc][doc] Clarify R workaround. (#11650)
1 parent 35f1e85 commit a8924d4

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

doc/tutorials/categorical.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,44 @@ dataframes. During training continuation, XGBoost will either extract the catego
231231
the previous model or use the categories from the new training dataset if the input model
232232
doesn't have the information.
233233

234+
For R, the auto-recoding is not yet supported as of 3.1. To provide an example:
235+
236+
.. code-block:: R
237+
238+
> f0 = factor(c("a", "b", "c"))
239+
> as.numeric(f0)
240+
[1] 1 2 3
241+
> f0
242+
[1] a b c
243+
Levels: a b c
244+
245+
In the above snippet, we have the mapping: ``a -> 1, b -> 2, c -> 3``. Assuming the above
246+
is the training data, and the next snippet is the test data:
247+
248+
.. code-block:: R
249+
250+
> f1 = factor(c("a", "c"))
251+
> as.numeric(f1)
252+
[1] 1 2
253+
> f1
254+
[1] a c
255+
Levels: a c
256+
257+
258+
Now, we have ``a -> 1, c -> 2`` because ``b`` is missing, and the R factor encodes the data
259+
differently, resulting in invalid test-time encoding. XGBoost cannot remember the original
260+
encoding for the R package. You will have to encode the data explicitly during inference:
261+
262+
.. code-block:: R
263+
264+
> f1 = factor(c("a", "c"), levels = c("a", "b", "c"))
265+
> f1
266+
[1] a c
267+
Levels: a b c
268+
> as.numeric(f1)
269+
[1] 1 3
270+
271+
234272
*************
235273
Miscellaneous
236274
*************

0 commit comments

Comments
 (0)