Skip to content

Commit 62b5b16

Browse files
authored
[doc] Split up the support matrix from the intro. (#11586)
1 parent aa497ae commit 62b5b16

File tree

4 files changed

+104
-126
lines changed

4 files changed

+104
-126
lines changed

doc/python/data_input.rst

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
################################
2+
Supported Python data structures
3+
################################
4+
5+
This page is a support matrix for various input types.
6+
7+
.. _py-data:
8+
9+
*******
10+
Markers
11+
*******
12+
13+
- T: Supported.
14+
- F: Not supported.
15+
- NE: Invalid type for the use case. For instance, :py:class:`pandas.Series` can not be multi-target label.
16+
- NPA: Support with the help of numpy array.
17+
- AT: Support with the help of arrow table.
18+
- CPA: Support with the help of cupy array.
19+
- SciCSR: Support with the help of scipy sparse CSR :py:class:`scipy.sparse.csr_matrix`. The conversion to scipy CSR may or may not be possible. Raise a type error if conversion fails.
20+
- FF: We can look forward to having its support in recent future if requested.
21+
- empty: To be filled in.
22+
23+
************
24+
Table Header
25+
************
26+
- `X` means predictor matrix.
27+
- Meta info: label, weight, etc.
28+
- Multi Label: 2-dim label for multi-target.
29+
- Others: Anything else that we don't list here explicitly including formats like `lil`, `dia`, `bsr`. XGBoost will try to convert it into scipy csr.
30+
31+
**************
32+
Support Matrix
33+
**************
34+
35+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
36+
| Name | DMatrix X | QuantileDMatrix X | Sklearn X | Meta Info | Inplace prediction | Multi Label |
37+
+=========================+===========+===================+===========+===========+====================+=============+
38+
| numpy.ndarray | T | T | T | T | T | T |
39+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
40+
| scipy.sparse.csr | T | T | T | NE | T | F |
41+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
42+
| scipy.sparse.csc | T | F | T | NE | F | F |
43+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
44+
| scipy.sparse.coo | SciCSR | F | SciCSR | NE | F | F |
45+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
46+
| uri | T | F | F | F | NE | F |
47+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
48+
| list | NPA | NPA | NPA | NPA | NPA | T |
49+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
50+
| tuple | NPA | NPA | NPA | NPA | NPA | T |
51+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
52+
| pandas.DataFrame | NPA | NPA | NPA | NPA | NPA | NPA |
53+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
54+
| pandas.Series | NPA | NPA | NPA | NPA | NPA | NE |
55+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
56+
| cudf.DataFrame | T | T | T | T | T | T |
57+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
58+
| cudf.Series | T | T | T | T | FF | NE |
59+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
60+
| cupy.ndarray | T | T | T | T | T | T |
61+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
62+
| torch.Tensor | T | T | T | T | T | T |
63+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
64+
| dlpack | CPA | CPA | | CPA | FF | FF |
65+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
66+
| modin.DataFrame | NPA | FF | NPA | NPA | FF | |
67+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
68+
| modin.Series | NPA | FF | NPA | NPA | FF | |
69+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
70+
| pyarrow.Table | T | T | T | T | T | T |
71+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
72+
| polars.DataFrame | AT | AT | AT | AT | AT | AT |
73+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
74+
| polars.LazyFrame (WARN) | AT | AT | AT | AT | AT | AT |
75+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
76+
| polars.Series | AT | AT | AT | AT | AT | NE |
77+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
78+
| _\_array\_\_ | NPA | F | NPA | NPA | H | |
79+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
80+
| Others | SciCSR | F | | F | F | |
81+
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
82+
83+
The polars ``LazyFrame.collect`` supports many configurations, ranging from the choice of
84+
query engine to type coercion. XGBoost simply uses the default parameter. Please run
85+
``collect`` to obtain the ``DataFrame`` before passing it into XGBoost for finer control
86+
over the behaviour.

doc/python/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Contents
1212
python_intro
1313
sklearn_estimator
1414
python_api
15+
data_input
1516
callbacks
1617
examples/index
1718
dask-examples/index

doc/python/python_intro.rst

Lines changed: 10 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,9 @@ To verify your installation, run the following in Python:
3232

3333
Data Interface
3434
--------------
35-
The XGBoost Python module is able to load data from many different types of data format including both CPU and GPU data structures. For a complete list of supported data types, please reference the :ref:`py-data`. For a detailed description of text input formats, please visit :doc:`/tutorials/input_format`.
35+
The XGBoost Python module is able to load data from many different types of data format including both CPU and GPU data structures. For a comprehensive list of supported data types, please reference the :doc:`/python/data_input`. For a detailed description of text input formats, please visit :doc:`/tutorials/input_format`.
3636

37-
The input data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object. For the sklearn estimator interface, a :py:class:`DMatrix` or a :py:class:`QuantileDMatrix` is created depending on the chosen algorithm and the input, see the sklearn API reference for details. We will illustrate some of the basic input types with the ``DMatrix`` here.
37+
The input data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object. For the sklearn estimator interface, a :py:class:`DMatrix` or a :py:class:`QuantileDMatrix` is created depending on the chosen algorithm and the input, see the sklearn API reference for details. We will illustrate some of the basic input types using the ``DMatrix`` here.
3838

3939
* To load a NumPy array into :py:class:`DMatrix <xgboost.DMatrix>`:
4040

@@ -59,11 +59,12 @@ The input data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object. For
5959
label = pandas.DataFrame(np.random.randint(2, size=4))
6060
dtrain = xgb.DMatrix(data, label=label)
6161
62-
* Saving :py:class:`DMatrix <xgboost.DMatrix>` into a XGBoost binary file will make loading faster:
62+
* Saving :py:class:`DMatrix <xgboost.DMatrix>` into a XGBoost binary file:
6363

6464
.. code-block:: python
6565
66-
dtrain = xgb.DMatrix('train.svm.txt?format=libsvm')
66+
data = np.random.rand(5, 10) # 5 entities, each contains 10 features
67+
label = np.random.randint(2, size=5) # binary target
6768
dtrain.save_binary('train.buffer')
6869
6970
* Missing values can be replaced by a default value in the :py:class:`DMatrix <xgboost.DMatrix>` constructor:
@@ -79,116 +80,6 @@ The input data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object. For
7980
w = np.random.rand(5, 1)
8081
dtrain = xgb.DMatrix(data, label=label, missing=np.NaN, weight=w)
8182
82-
When performing ranking tasks, the number of weights should be equal
83-
to number of groups.
84-
85-
* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
86-
87-
.. code-block:: python
88-
89-
dtrain = xgb.DMatrix('train.svm.txt?format=libsvm')
90-
dtest = xgb.DMatrix('test.svm.buffer')
91-
92-
The parser in XGBoost has limited functionality. When using Python interface, it's
93-
recommended to use sklearn ``load_svmlight_file`` or other similar utilites than
94-
XGBoost's builtin parser.
95-
96-
* To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
97-
98-
.. code-block:: python
99-
100-
# label_column specifies the index of the column containing the true label
101-
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
102-
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
103-
104-
The parser in XGBoost has limited functionality. When using Python interface, it's
105-
recommended to use pandas ``read_csv`` or other similar utilites than XGBoost's builtin
106-
parser.
107-
108-
.. _py-data:
109-
110-
Supported data structures for various XGBoost functions
111-
=======================================================
112-
113-
*******
114-
Markers
115-
*******
116-
117-
- T: Supported.
118-
- F: Not supported.
119-
- NE: Invalid type for the use case. For instance, `pd.Series` can not be multi-target label.
120-
- NPA: Support with the help of numpy array.
121-
- AT: Support with the help of arrow table.
122-
- CPA: Support with the help of cupy array.
123-
- SciCSR: Support with the help of scripy sparse CSR. The conversion to scipy CSR may or may not be possible. Raise a type error if conversion fails.
124-
- FF: We can look forward to having its support in recent future if requested.
125-
- empty: To be filled in.
126-
127-
************
128-
Table Header
129-
************
130-
- `X` means predictor matrix.
131-
- Meta info: label, weight, etc.
132-
- Multi Label: 2-dim label for multi-target.
133-
- Others: Anything else that we don't list here explicitly including formats like `lil`, `dia`, `bsr`. XGBoost will try to convert it into scipy csr.
134-
135-
**************
136-
Support Matrix
137-
**************
138-
139-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
140-
| Name | DMatrix X | QuantileDMatrix X | Sklearn X | Meta Info | Inplace prediction | Multi Label |
141-
+=========================+===========+===================+===========+===========+====================+=============+
142-
| numpy.ndarray | T | T | T | T | T | T |
143-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
144-
| scipy.sparse.csr | T | T | T | NE | T | F |
145-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
146-
| scipy.sparse.csc | T | F | T | NE | F | F |
147-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
148-
| scipy.sparse.coo | SciCSR | F | SciCSR | NE | F | F |
149-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
150-
| uri | T | F | F | F | NE | F |
151-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
152-
| list | NPA | NPA | NPA | NPA | NPA | T |
153-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
154-
| tuple | NPA | NPA | NPA | NPA | NPA | T |
155-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
156-
| pandas.DataFrame | NPA | NPA | NPA | NPA | NPA | NPA |
157-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
158-
| pandas.Series | NPA | NPA | NPA | NPA | NPA | NE |
159-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
160-
| cudf.DataFrame | T | T | T | T | T | T |
161-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
162-
| cudf.Series | T | T | T | T | FF | NE |
163-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
164-
| cupy.ndarray | T | T | T | T | T | T |
165-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
166-
| torch.Tensor | T | T | T | T | T | T |
167-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
168-
| dlpack | CPA | CPA | | CPA | FF | FF |
169-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
170-
| modin.DataFrame | NPA | FF | NPA | NPA | FF | |
171-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
172-
| modin.Series | NPA | FF | NPA | NPA | FF | |
173-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
174-
| pyarrow.Table | T | T | T | T | T | T |
175-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
176-
| polars.DataFrame | AT | AT | AT | AT | AT | AT |
177-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
178-
| polars.LazyFrame (WARN) | AT | AT | AT | AT | AT | AT |
179-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
180-
| polars.Series | AT | AT | AT | AT | AT | NE |
181-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
182-
| _\_array\_\_ | NPA | F | NPA | NPA | H | |
183-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
184-
| Others | SciCSR | F | | F | F | |
185-
+-------------------------+-----------+-------------------+-----------+-----------+--------------------+-------------+
186-
187-
The polars ``LazyFrame.collect`` supports many configurations, ranging from the choice of
188-
query engine to type coercion. XGBoost simply uses the default parameter. Please run
189-
``collect`` to obtain the ``DataFrame`` before passing it into XGBoost for finer control
190-
over the behaviour.
191-
19283
Setting Parameters
19384
------------------
19485
XGBoost can use either a list of pairs or a dictionary to set :doc:`parameters </parameter>`. For instance:
@@ -227,11 +118,11 @@ Training a model requires a parameter list and data set.
227118
num_round = 10
228119
bst = xgb.train(param, dtrain, num_round, evallist)
229120
230-
After training, the model can be saved.
121+
After training, the model can be saved into ``JSON`` or ``UBJSON``:
231122

232123
.. code-block:: python
233124
234-
bst.save_model('0001.model')
125+
bst.save_model('model.ubj')
235126
236127
The model and its feature map can also be dumped to a text file.
237128

@@ -247,10 +138,10 @@ A saved model can be loaded as follows:
247138
.. code-block:: python
248139
249140
bst = xgb.Booster({'nthread': 4}) # init model
250-
bst.load_model('model.bin') # load model data
141+
bst.load_model('model.ubj') # load model data
251142
252-
Methods including `update` and `boost` from `xgboost.Booster` are designed for
253-
internal usage only. The wrapper function `xgboost.train` does some
143+
Methods including `update` and `boost` from :py:class:`xgboost.Booster` are designed for
144+
internal usage only. The wrapper function :py:class:`xgboost.train` does some
254145
pre-configuration including setting up caches and some other parameters.
255146

256147
Early Stopping

0 commit comments

Comments
 (0)