Document the JSON schema for the categories container.

trivialfis · trivialfis · commit 54fa6248816a · 2025-11-03T01:43:04.000+08:00
diff --git a/doc/model_schema.rst b/doc/model_schema.rst
@@ -0,0 +1,8 @@
+:orphan:
+
+#################
+JSON Model Schema
+#################
+
+.. include:: ./model.schema
+   :code: json
diff --git a/doc/tutorials/saving_model.rst b/doc/tutorials/saving_model.rst
@@ -2,6 +2,12 @@
 Introduction to Model IO
 ########################
 
+**Contents**
+
+.. contents::
+  :backlinks: none
+  :local:
+
 Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is
 enabled for serializing models to file, serializing models to buffer, and for memory
 snapshot (pickle and alike).
@@ -233,6 +239,57 @@ to be loaded back to XGBoost.  The JSON version has a `schema
 <https://github.com/dmlc/xgboost/blob/master/doc/dump.schema>`__.  See next section for
 more info.
 
+**********
+Categories
+**********
+
+Since 3.1, the categories encoding from a training dataframe is stored in the booster to
+provide test-time re-coding support, see :ref:`cat-recode` for more info about how the
+re-coder works. We will briefly explain the JSON format for the serialized category index.
+
+The categories are saved in a JSON object named "cats" under the gbtree model. It contains
+three keys:
+
+- feature_segments
+
+This is a CSR-like pointer that stores the number of categories for each feature. It
+starts with zero and ends with the total number of categories from all features. For
+example:
+
+.. code-block:: python
+
+    feature_segments = [0, 3, 3, 5]
+
+The ``feature_segments`` list represents a dataset with two categorical features and one
+numerical feature. The first feature contains three categories, the second feature is
+numerical and thus has no categories, and the last feature includes two categories.
+
+- sorted_idx
+
+This array stores the sorted indices (`argsort`) of categories across all features,
+segmented by the ``feature_segments``. Given a feature with categories: ``["b", "c",
+"a"]``, the sorted index is ``[1, 2, 0]``.
+
+- enc
+
+This is an array with a length equal to the number of features, storing all the categories
+in the same order as the input dataframe. The storage schema depends on whether the
+categories are strings (XGBoost also supports numerical categories, such as integers). For
+string categories, we use a schema similar to the arrow format for a string array. The
+categories of each feature are represented by two arrays, namely ``offsets`` and
+``values``. The format is also similar to a CSR-matrix. The ``values`` field is a
+``uint8`` array storing characters from all category names. Given a feature with three
+categories: ``["bb", "c", "a"]``, the ``values`` field is ``[98, 98, 99, 97]``. Then the
+``offsets`` segments the ``values`` array similar to a CSR pointer: ``[0, 2, 3, 4]``. We
+chose to not store the ``values`` as a JSON string to avoid handling special characters
+and string encoding. The string names are stored exactly as given by the dataframe.
+
+As for numerical categories, the ``enc`` contains two keys: ``type`` and ``values``. The
+``type`` field is an integer ID that identifies the type of the categories, such as 64-bit
+integers and 32-bit floating points (note that they are all f32 inside a decision
+tree). The exact mapping between the type to the integer ID is internal but stable. The
+``values`` is an array storing all categories in a feature.
+
 ***********
 JSON Schema
 ***********
@@ -243,10 +300,7 @@ XGBoost.  Here is the JSON schema for the output model (not serialization, which
 be stable as noted above).  For an example of parsing XGBoost tree model, see
 ``/demo/json-model``.  Please notice the "weight_drop" field used in "dart" booster.
 XGBoost does not scale tree leaf directly, instead it saves the weights as a separated
-array.
-
-.. include:: ../model.schema
-   :code: json
+array. See :doc:`/model_schema`.
 
 
 *************
diff --git a/include/xgboost/json.h b/include/xgboost/json.h
@@ -30,7 +30,11 @@ class Value {
   }
 
  public:
-  /*!\brief Simplified implementation of LLVM RTTI. */
+  /**
+   * @brief Simplified implementation of LLVM RTTI.
+   *
+   * @note The integer ID must be kept stable.
+   */
   enum class ValueKind : std::int64_t {
     kString = 0,
     kNumber = 1,