Skip to content

Commit ed8c363

Browse files
apacheGH-40841: [Docs][C++][Python] Add initial documentation for RecordBatch::Tensor conversion (apache#40842)
### Rationale for this change The work on the conversion from `Table`/`RecordBatch` to `Tensor` is progressing and we have to make sure to add information to the documentation. ### What changes are included in this PR? I propose to add - new page (`converting_recordbatch_to_tensor.rst`) in the `cpp/examples` section, - added section (Conversion of RecordBatch do Tensor) in the `docs/source/python/data.rst` the content above would be updated as the features are added in the future (row-major conversion, `Table::ToTensor`, DLPack support for `Tensor` class, etc.) ### Are these changes tested? It will be tested with the crossbow preview-docs job. ### Are there any user-facing changes? No, just documentation. * GitHub Issue: apache#40841 Lead-authored-by: AlenkaF <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
1 parent 50ca7a7 commit ed8c363

File tree

3 files changed

+99
-0
lines changed

3 files changed

+99
-0
lines changed
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
.. default-domain:: cpp
19+
.. highlight:: cpp
20+
21+
Conversion of ``RecordBatch`` to ``Tensor`` instances
22+
=====================================================
23+
24+
Arrow provides a method to convert ``RecordBatch`` objects to a ``Tensor``
25+
with two dimensions:
26+
27+
.. code::
28+
29+
std::shared_ptr<RecordBatch> batch;
30+
31+
ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor());
32+
ASSERT_OK(tensor->Validate());
33+
34+
The conversion supports signed and unsigned integer types plus float types.
35+
In case the ``RecordBatch`` has null values the conversion succeeds if
36+
``null_to_nan`` parameter is set to ``true``. In this case all
37+
types will be promoted to a floating-point data type.
38+
39+
.. code::
40+
41+
std::shared_ptr<RecordBatch> batch;
42+
43+
ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor(/*null_to_nan=*/true));
44+
ASSERT_OK(tensor->Validate());
45+
46+
Currently only column-major conversion is supported.

docs/source/cpp/examples/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ Examples
2727
dataset_skyhook_scan_example
2828
row_columnar_conversion
2929
std::tuple-like ranges to Arrow <tuple_range_conversion>
30+
Converting RecordBatch to Tensor <converting_recordbatch_to_tensor>

docs/source/python/data.rst

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -560,3 +560,55 @@ schema without having to get any of the batches.::
560560
x: int64
561561

562562
It can also be sent between languages using the :ref:`C stream interface <c-stream-interface>`.
563+
564+
Conversion of RecordBatch do Tensor
565+
-----------------------------------
566+
567+
Each array of the ``RecordBatch`` has it's own contiguous memory that is not necessarily
568+
adjacent to other arrays. A different memory structure that is used in machine learning
569+
libraries is a two dimensional array (also called a 2-dim tensor or a matrix) which takes
570+
only one contiguous block of memory.
571+
572+
For this reason there is a function ``pyarrow.RecordBatch.to_tensor()`` available
573+
to efficiently convert tabular columnar data into a tensor.
574+
575+
Data types supported in this conversion are unsigned, signed integer and float
576+
types. Currently only column-major conversion is supported.
577+
578+
>>> import pyarrow as pa
579+
>>> arr1 = [1, 2, 3, 4, 5]
580+
>>> arr2 = [10, 20, 30, 40, 50]
581+
>>> batch = pa.RecordBatch.from_arrays(
582+
... [
583+
... pa.array(arr1, type=pa.uint16()),
584+
... pa.array(arr2, type=pa.int16()),
585+
... ], ["a", "b"]
586+
... )
587+
>>> batch.to_tensor()
588+
<pyarrow.Tensor>
589+
type: int32
590+
shape: (9, 2)
591+
strides: (4, 36)
592+
>>> batch.to_tensor().to_numpy()
593+
array([[ 1, 10],
594+
[ 2, 20],
595+
[ 3, 30],
596+
[ 4, 40],
597+
[ 5, 50]], dtype=int32)
598+
599+
With ``null_to_nan`` set to ``True`` one can also convert data with
600+
nulls. They will be converted to ``NaN``:
601+
602+
>>> import pyarrow as pa
603+
>>> batch = pa.record_batch(
604+
... [
605+
... pa.array([1, 2, 3, 4, None], type=pa.int32()),
606+
... pa.array([10, 20, 30, 40, None], type=pa.float32()),
607+
... ], names = ["a", "b"]
608+
... )
609+
>>> batch.to_tensor(null_to_nan=True).to_numpy()
610+
array([[ 1., 10.],
611+
[ 2., 20.],
612+
[ 3., 30.],
613+
[ 4., 40.],
614+
[nan, nan]])

0 commit comments

Comments
 (0)