Skip to content

Commit 54fa624

Browse files
committed
Document the JSON schema for the categories container.
1 parent f8f2705 commit 54fa624

File tree

3 files changed

+71
-5
lines changed

3 files changed

+71
-5
lines changed

doc/model_schema.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
:orphan:
2+
3+
#################
4+
JSON Model Schema
5+
#################
6+
7+
.. include:: ./model.schema
8+
:code: json

doc/tutorials/saving_model.rst

Lines changed: 58 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
Introduction to Model IO
33
########################
44

5+
**Contents**
6+
7+
.. contents::
8+
:backlinks: none
9+
:local:
10+
511
Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is
612
enabled for serializing models to file, serializing models to buffer, and for memory
713
snapshot (pickle and alike).
@@ -233,6 +239,57 @@ to be loaded back to XGBoost. The JSON version has a `schema
233239
<https://github.com/dmlc/xgboost/blob/master/doc/dump.schema>`__. See next section for
234240
more info.
235241

242+
**********
243+
Categories
244+
**********
245+
246+
Since 3.1, the categories encoding from a training dataframe is stored in the booster to
247+
provide test-time re-coding support, see :ref:`cat-recode` for more info about how the
248+
re-coder works. We will briefly explain the JSON format for the serialized category index.
249+
250+
The categories are saved in a JSON object named "cats" under the gbtree model. It contains
251+
three keys:
252+
253+
- feature_segments
254+
255+
This is a CSR-like pointer that stores the number of categories for each feature. It
256+
starts with zero and ends with the total number of categories from all features. For
257+
example:
258+
259+
.. code-block:: python
260+
261+
feature_segments = [0, 3, 3, 5]
262+
263+
The ``feature_segments`` list represents a dataset with two categorical features and one
264+
numerical feature. The first feature contains three categories, the second feature is
265+
numerical and thus has no categories, and the last feature includes two categories.
266+
267+
- sorted_idx
268+
269+
This array stores the sorted indices (`argsort`) of categories across all features,
270+
segmented by the ``feature_segments``. Given a feature with categories: ``["b", "c",
271+
"a"]``, the sorted index is ``[1, 2, 0]``.
272+
273+
- enc
274+
275+
This is an array with a length equal to the number of features, storing all the categories
276+
in the same order as the input dataframe. The storage schema depends on whether the
277+
categories are strings (XGBoost also supports numerical categories, such as integers). For
278+
string categories, we use a schema similar to the arrow format for a string array. The
279+
categories of each feature are represented by two arrays, namely ``offsets`` and
280+
``values``. The format is also similar to a CSR-matrix. The ``values`` field is a
281+
``uint8`` array storing characters from all category names. Given a feature with three
282+
categories: ``["bb", "c", "a"]``, the ``values`` field is ``[98, 98, 99, 97]``. Then the
283+
``offsets`` segments the ``values`` array similar to a CSR pointer: ``[0, 2, 3, 4]``. We
284+
chose to not store the ``values`` as a JSON string to avoid handling special characters
285+
and string encoding. The string names are stored exactly as given by the dataframe.
286+
287+
As for numerical categories, the ``enc`` contains two keys: ``type`` and ``values``. The
288+
``type`` field is an integer ID that identifies the type of the categories, such as 64-bit
289+
integers and 32-bit floating points (note that they are all f32 inside a decision
290+
tree). The exact mapping between the type to the integer ID is internal but stable. The
291+
``values`` is an array storing all categories in a feature.
292+
236293
***********
237294
JSON Schema
238295
***********
@@ -243,10 +300,7 @@ XGBoost. Here is the JSON schema for the output model (not serialization, which
243300
be stable as noted above). For an example of parsing XGBoost tree model, see
244301
``/demo/json-model``. Please notice the "weight_drop" field used in "dart" booster.
245302
XGBoost does not scale tree leaf directly, instead it saves the weights as a separated
246-
array.
247-
248-
.. include:: ../model.schema
249-
:code: json
303+
array. See :doc:`/model_schema`.
250304

251305

252306
*************

include/xgboost/json.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,11 @@ class Value {
3030
}
3131

3232
public:
33-
/*!\brief Simplified implementation of LLVM RTTI. */
33+
/**
34+
* @brief Simplified implementation of LLVM RTTI.
35+
*
36+
* @note The integer ID must be kept stable.
37+
*/
3438
enum class ValueKind : std::int64_t {
3539
kString = 0,
3640
kNumber = 1,

0 commit comments

Comments
 (0)