apacheGH-40841: [Docs][C++][Python] Add initial documentation for RecordBatch::Tensor conversion (apache#40842)

AlenkaF · jorisvandenbossche · web-flow · commit ed8c3630dbe2 · 2024-03-29T08:29:28.000+01:00
### Rationale for this change The work on the conversion from `Table`/`RecordBatch` to `Tensor` is progressing and we have to make sure to add information to the documentation. ### What changes are included in this PR? I propose to add - new page (`converting_recordbatch_to_tensor.rst`) in the `cpp/examples` section, - added section (Conversion of RecordBatch do Tensor) in the `docs/source/python/data.rst` the content above would be updated as the features are added in the future (row-major conversion, `Table::ToTensor`, DLPack support for `Tensor` class, etc.) ### Are these changes tested? It will be tested with the crossbow preview-docs job. ### Are there any user-facing changes? No, just documentation. * GitHub Issue: apache#40841 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
diff --git a/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst b/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst
@@ -0,0 +1,46 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+Conversion of ``RecordBatch`` to ``Tensor`` instances
+=====================================================
+
+Arrow provides a method to convert ``RecordBatch`` objects to a ``Tensor``
+with two dimensions:
+
+.. code::
+
+   std::shared_ptr<RecordBatch> batch;
+
+   ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor());
+   ASSERT_OK(tensor->Validate());
+
+The conversion supports signed and unsigned integer types plus float types.
+In case the ``RecordBatch`` has null values the conversion succeeds if
+``null_to_nan`` parameter is set to ``true``. In this case all
+types will be promoted to a floating-point data type.
+
+.. code::
+
+   std::shared_ptr<RecordBatch> batch;
+
+   ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor(/*null_to_nan=*/true));
+   ASSERT_OK(tensor->Validate());
+
+Currently only column-major conversion is supported.
diff --git a/docs/source/cpp/examples/index.rst b/docs/source/cpp/examples/index.rst
@@ -27,3 +27,4 @@ Examples
    dataset_skyhook_scan_example
    row_columnar_conversion
    std::tuple-like ranges to Arrow <tuple_range_conversion>
+   Converting RecordBatch to Tensor <converting_recordbatch_to_tensor>
diff --git a/docs/source/python/data.rst b/docs/source/python/data.rst
@@ -560,3 +560,55 @@ schema without having to get any of the batches.::
    x: int64
 
 It can also be sent between languages using the :ref:`C stream interface <c-stream-interface>`.
+
+Conversion of RecordBatch do Tensor
+-----------------------------------
+
+Each array of the ``RecordBatch`` has it's own contiguous memory that is not necessarily
+adjacent to other arrays. A different memory structure that is used in machine learning
+libraries is a two dimensional array (also called a 2-dim tensor or a matrix) which takes
+only one contiguous block of memory.
+
+For this reason there is a function ``pyarrow.RecordBatch.to_tensor()`` available
+to efficiently convert tabular columnar data into a tensor.
+
+Data types supported in this conversion are unsigned, signed integer and float
+types. Currently only column-major conversion is supported.
+
+   >>>  import pyarrow as pa
+   >>>  arr1 = [1, 2, 3, 4, 5]
+   >>>  arr2 = [10, 20, 30, 40, 50]
+   >>>  batch = pa.RecordBatch.from_arrays(
+   ...      [
+   ...          pa.array(arr1, type=pa.uint16()),
+   ...          pa.array(arr2, type=pa.int16()),
+   ...      ], ["a", "b"]
+   ...  )
+   >>>  batch.to_tensor()
+   <pyarrow.Tensor>
+   type: int32
+   shape: (9, 2)
+   strides: (4, 36)
+   >>>  batch.to_tensor().to_numpy()
+   array([[ 1, 10],
+         [ 2, 20],
+         [ 3, 30],
+         [ 4, 40],
+         [ 5, 50]], dtype=int32)
+
+With ``null_to_nan`` set to ``True`` one can also convert data with
+nulls. They will be converted to ``NaN``:
+
+   >>> import pyarrow as pa
+   >>> batch = pa.record_batch(
+   ...     [
+   ...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
+   ...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
+   ...     ], names = ["a", "b"]
+   ... )
+   >>> batch.to_tensor(null_to_nan=True).to_numpy()
+   array([[ 1., 10.],
+         [ 2., 20.],
+         [ 3., 30.],
+         [ 4., 40.],
+         [nan, nan]])