dmlc
diff --git a/‎demo/guide-python/cat_pipeline.py‎
Lines changed: 5 additions & 0 deletions b/‎demo/guide-python/cat_pipeline.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎doc/python/python_api.rst‎
Lines changed: 8 additions & 0 deletions b/‎doc/python/python_api.rst‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎doc/tutorials/categorical.rst‎
Lines changed: 87 additions & 16 deletions b/‎doc/tutorials/categorical.rst‎
Lines changed: 87 additions & 16 deletions
diff --git a/‎python-package/xgboost/_data_utils.py‎
Lines changed: 5 additions & 1 deletion b/‎python-package/xgboost/_data_utils.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎python-package/xgboost/collective.py‎
Lines changed: 9 additions & 7 deletions b/‎python-package/xgboost/collective.py‎
Lines changed: 9 additions & 7 deletions
diff --git a/‎python-package/xgboost/core.py‎
Lines changed: 2 additions & 2 deletions b/‎python-package/xgboost/core.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎python-package/xgboost/dask/__init__.py‎
Lines changed: 11 additions & 41 deletions b/‎python-package/xgboost/dask/__init__.py‎
Lines changed: 11 additions & 41 deletions
@@ -6,6 +6,11 @@
 training and inference. There are many ways to attain the same goal, this script can be
 used as a starting point.
 
+.. versionchanged:: 3.1
+
+    Start with 3.1, users don't need this for most of the cases. See :ref:`cat-recode`
+    for more info.
+
 See Also
 --------
 - :doc:`Tutorial </tutorials/categorical>`
 
@@ -206,6 +206,14 @@ Collective
 
 .. autofunction:: xgboost.collective.init
 
+.. autofunction:: xgboost.collective.finalize
+
+.. autofunction:: xgboost.collective.get_rank
+
+.. autofunction:: xgboost.collective.get_world_size
+
+.. autoclass:: xgboost.collective.CommunicatorContext
+
 .. automodule:: xgboost.tracker
 
 .. autoclass:: xgboost.tracker.RabitTracker
@@ -137,38 +137,109 @@ feature it's specified as ``"c"``.  The Dask module in XGBoost has the same inte
 :class:`dask.Array <dask.Array>` can also be used for categorical data. Lastly, the
 sklearn interface :py:class:`~xgboost.XGBRegressor` has the same parameter.
 
-****************
-Data Consistency
-****************
+.. _cat-recode:
 
-XGBoost accepts parameters to indicate which feature is considered categorical, either through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However, XGBoost by itself doesn't store information on how categories are encoded in the first place. For instance, given an encoding schema that maps music genres to integer codes:
+********************************
+Auto-recoding (Data Consistency)
+********************************
+
+.. versionchanged:: 3.1
+
+  Starting with XGBoost 3.1, the *Python* interface can perform automatic re-coding for
+  new inputs.
+
+XGBoost accepts parameters to indicate which feature is considered categorical, either
+through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However,
+except for the Python interface, XGBoost doesn't store the information about how
+categories are encoded in the first place. For instance, given an encoding schema that
+maps music genres to integer codes:
 
 .. code-block:: python
 
   {"acoustic": 0, "indie": 1, "blues": 2, "country": 3}
 
-XGBoost doesn't know this mapping from the input and hence cannot store it in the model. The mapping usually happens in the users' data engineering pipeline with column transformers like :py:class:`sklearn.preprocessing.OrdinalEncoder`. To make sure correct result from XGBoost, users need to keep the pipeline for transforming data consistent across training and testing data. One should watch out for errors like:
+Aside from the Python interface (R/Java/C, etc), XGBoost doesn't know this mapping from
+the input and hence cannot store it in the model. The mapping usually happens in the
+users' data engineering pipeline. To ensure the correct result from XGBoost, users need to
+keep the pipeline for transforming data consistent across training and testing data.
+
+Starting with 3.1, the *Python* interface can remember the encoding and perform recoding
+during inference and training continuation when the input is a dataframe (`pandas`,
+`cuDF`, `polars`, `pyarrow`, `modin`). The feature support focuses on basic usage. It has
+some restrictions on the types of inputs that can be accepted. First, category names
+must have one of the following types:
+
+- string
+- integer, from 8-bit to 64-bit, both signed and unsigned are supported.
+- 32-bit or 64-bit floating point
+
+Other category types are not supported. Second, the input types must be strictly
+consistent. For example, XGBoost will raise an error if the categorical columns in the
+training set are unsigned integers whereas the test dataset has signed integer columns. If
+you have categories that are not one of the supported types, you need to perform the
+re-coding using a pre-processing data transformer like the
+:py:class:`sklearn.preprocessing.OrdinalEncoder`. See
+:ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using an ordinal
+encoder. To clarify, the type here refers to the type of the name of categories (called
+``Index`` in pandas):
+
+.. code-block:: python
+
+  # string type
+  {"acoustic": 0, "indie": 1, "blues": 2, "country": 3}
+  # integer type
+  {-1: 0, 1: 1, 3: 2, 7: 3}
+  # depending on the dataframe implementation, it can be signed or unsigned.
+  {5: 0, 1: 1, 3: 2, 7: 3}
+  # floating point type, both 32-bit and 64-bit are supported.
+  {-1.0: 0, 1.0: 1, 3.0: 2, 7.0: 3}
+
+Internally, XGBoost attempts to extract the categories from the dataframe inputs. For
+inference (predict), the re-coding happens on the fly and there's no data copy (baring
+some internal transformations performed by the dataframe itself). For training
+continuation however, re-coding requires some extra steps if you are using the native
+interface. The sklearn interface and the Dask interface can handle training continuation
+automatically. Last, please note that using the re-coder with the native interface is
+still experimental. It's ready for testing, but we want to observe the feature usage for a
+period of time and might make some breaking changes if needed. The following is a snippet
+of using the native interface:
 
 .. code-block:: python
 
-  X_train["genre"] = X_train["genre"].astype("category")
-  reg = xgb.XGBRegressor(enable_categorical=True).fit(X_train, y_train)
+  import pandas as pd
+
+  X = pd.DataFrame()
+  Xy = xgboost.QuantileDMatrix(X, y, enable_categorical=True)
+  booster = xgboost.train({}, Xy)
+
+  # XGBoost can handle re-coding for inference without user intervention
+  X_new = pd.DataFrame()
+  booster.inplace_predict(X_new)
+
+  # Get categories saved in the model for training continuation
+  categories = booster.get_categories()
+  # Use saved categories as a reference for re-coding.
+  # Training continuation requires a re-coded DMatrix, pass the categories as feature_types
+  Xy_new = xgboost.QuantileDMatrix(
+    X_new, y_new, feature_types=categories, enable_categorical=True, ref=Xy
+  )
+  booster_1 = xgboost.train({}, Xy_new, xgb_model=booster)
 
-  # invalid encoding
-  X_test["genre"] = X_test["genre"].astype("category")
-  reg.predict(X_test)
 
-In the above snippet, training data and test data are encoded separately, resulting in two different encoding schemas and invalid prediction result. See :ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using ordinal encoder.
+No extra step is required for using the scikit-learn interface as long as the inputs are
+dataframes. During training continuation, XGBoost will either extract the categories from
+the previous model or use the categories from the new training dataset if the input model
+doesn't have the information.
 
 *************
 Miscellaneous
 *************
 
-By default, XGBoost assumes input categories are integers starting from 0 till the number
-of categories :math:`[0, n\_categories)`. However, user might provide inputs with invalid
-values due to mistakes or missing values in training dataset. It can be negative value,
-integer values that can not be accurately represented by 32-bit floating point, or values
-that are larger than actual number of unique categories.  During training this is
+By default, XGBoost assumes input category codes are integers starting from 0 till the
+number of categories :math:`[0, n\_categories)`. However, user might provide inputs with
+invalid values due to mistakes or missing values in training dataset. It can be negative
+value, integer values that can not be accurately represented by 32-bit floating point, or
+values that are larger than actual number of unique categories.  During training this is
 validated but for prediction it's treated as the same as not-chosen category for
 performance reasons.
 
 
@@ -683,7 +683,11 @@ def __del__(self) -> None:
 def get_ref_categories(
     feature_types: Optional[Union[FeatureTypes, Categories]],
 ) -> Tuple[Optional[FeatureTypes], Optional[Categories]]:
-    """Get the optional reference categories from the input."""
+    """Get the optional reference categories from the `feature_types`. This is used by
+    various `DMatrix` where the `feature_types` is reused for specifying the reference
+    categories.
+
+    """
     if isinstance(feature_types, Categories):
         ref_categories = feature_types
         feature_types = None
 
@@ -37,7 +37,8 @@ class Config:
         See `dmlc_timeout` in :py:meth:`init`. This is only used for communicators, not
         the tracker. They are different parameters since the timeout for tracker limits
         only the time for starting and finalizing the communication group, whereas the
-        timeout for communicators limits the time used for collective operations.
+        timeout for communicators limits the time used for collective operations, like
+        :py:meth:`allreduce`.
 
     tracker_host_ip : See :py:class:`~xgboost.tracker.RabitTracker`.
 
@@ -94,7 +95,8 @@ def init(**args: _ArgVals) -> None:
           - federated_client_cert: Client certificate file path. Only needed for the SSL
             mode.
 
-        Use upper case for environment variables, use lower case for runtime configuration.
+        Use upper case for environment variables, use lower case for runtime
+        configuration.
 
     """
     _check_call(_LIB.XGCommunicatorInit(make_jcargs(**args)))
@@ -122,17 +124,17 @@ def get_world_size() -> int:
 
     Returns
     -------
-    n : int
+    n :
         Total number of process.
     """
     ret = _LIB.XGCommunicatorGetWorldSize()
     return ret
 
 
-def is_distributed() -> int:
+def is_distributed() -> bool:
     """If the collective communicator is distributed."""
     is_dist = _LIB.XGCommunicatorIsDistributed()
-    return is_dist
+    return bool(is_dist)
 
 
 def communicator_print(msg: Any) -> None:
@@ -160,8 +162,8 @@ def get_processor_name() -> str:
 
     Returns
     -------
-    name : str
-        the name of processor(host)
+    name :
+        The name of processor(host)
     """
     name_str = ctypes.c_char_p()
     _check_call(_LIB.XGCommunicatorGetProcessorName(ctypes.byref(name_str)))
 
@@ -1361,12 +1361,12 @@ def get_categories(self, export_to_arrow: bool = False) -> Categories:
 
         .. warning::
 
-            This function is still working in progress.
+            This function is experimental.
 
         Parameters
         ----------
         export_to_arrow :
-            The returned container will contain a list to ``pyarrow`` arrays for the
+            The returned container will contain a list of ``pyarrow`` arrays for the
             categories. See the :py:meth:`~Categories.to_arrow` for more info.
 
         """
 
@@ -89,6 +89,7 @@
 from packaging.version import parse as parse_version
 
 from .. import collective, config
+from .._data_utils import Categories
 from .._typing import FeatureNames, FeatureTypes, IterationRange
 from ..callback import TrainingCallback
 from ..collective import Config as CollConfig
@@ -122,7 +123,7 @@
 )
 from ..tracker import RabitTracker
 from ..training import train as worker_train
-from .data import _create_dmatrix, _create_quantile_dmatrix, no_group_split
+from .data import _get_dmatrices, no_group_split
 from .utils import get_address_from_user, get_n_threads
 
 _DaskCollection: TypeAlias = Union[da.Array, dd.DataFrame, dd.Series]
@@ -331,6 +332,10 @@ def __init__(
 
         self.feature_names = feature_names
         self.feature_types = feature_types
+        if isinstance(feature_types, Categories):
+            raise TypeError(
+                "The Dask interface can handle categories from DataFrame automatically."
+            )
         self.missing = missing if missing is not None else numpy.nan
         self.enable_categorical = enable_categorical
 
@@ -652,12 +657,6 @@ def _create_fn_args(self, worker_addr: str) -> Dict[str, Any]:
         return args
 
 
-def _dmatrix_from_list_of_parts(is_quantile: bool, **kwargs: Any) -> DMatrix:
-    if is_quantile:
-        return _create_quantile_dmatrix(**kwargs)
-    return _create_dmatrix(**kwargs)
-
-
 async def _get_rabit_args(
     client: "distributed.Client",
     n_workers: int,
@@ -735,37 +734,6 @@ async def _check_workers_are_alive(
         raise RuntimeError(f"Missing required workers: {missing_workers}")
 
 
-def _get_dmatrices(
-    train_ref: dict,
-    train_id: int,
-    *refs: dict,
-    evals_id: Sequence[int],
-    evals_name: Sequence[str],
-    n_threads: int,
-) -> Tuple[DMatrix, List[Tuple[DMatrix, str]]]:
-    # Create training DMatrix
-    Xy = _dmatrix_from_list_of_parts(**train_ref, nthread=n_threads)
-    # Create evaluation DMatrices
-    evals: List[Tuple[DMatrix, str]] = []
-    for i, ref in enumerate(refs):
-        # Same DMatrix as the training
-        if evals_id[i] == train_id:
-            evals.append((Xy, evals_name[i]))
-            continue
-        if ref.get("ref", None) is not None:
-            if ref["ref"] != train_id:
-                raise ValueError(
-                    "The training DMatrix should be used as a reference to evaluation"
-                    " `QuantileDMatrix`."
-                )
-            del ref["ref"]
-            eval_Xy = _dmatrix_from_list_of_parts(**ref, nthread=n_threads, ref=Xy)
-        else:
-            eval_Xy = _dmatrix_from_list_of_parts(**ref, nthread=n_threads)
-        evals.append((eval_Xy, evals_name[i]))
-    return Xy, evals
-
-
 async def _train_async(
     *,
     client: "distributed.Client",
@@ -817,6 +785,8 @@ def do_train(  # pylint: disable=too-many-positional-arguments
                 evals_id=evals_id,
                 evals_name=evals_name,
                 n_threads=n_threads,
+                # We need the model for reference categories.
+                model=xgb_model,
             )
 
             booster = worker_train(
@@ -1934,7 +1904,7 @@ class DaskXGBRanker(XGBRankerMixIn, DaskScikitLearnBase):
     def __init__(
         self,
         *,
-        objective: str = "rank:pairwise",
+        objective: str = "rank:ndcg",
         allow_group_split: bool = False,
         coll_cfg: Optional[CollConfig] = None,
         **kwargs: Any,
@@ -2051,8 +2021,8 @@ def check_ser(
         ) -> TypeGuard[Optional[dd.Series]]:
             if not isinstance(qid, dd.Series) and qid is not None:
                 raise TypeError(
-                    f"When `allow_group_split` is set to False, {name} is required to be"
-                    " a series."
+                    f"When `allow_group_split` is set to False, {name} is required to "
+                    "be a series."
                 )
             return True