feat(python): Add array creation/building from buffers (#378)

The gist of this PR is that I'd like the ability to create arrays for testing without pyarrow so that nanoarrow's tests can run in more places. Other than building/running in odd corner-case environments, nanoarrow in R has been great at prototyping and/or creating test data (e.g., an array with a non-zero offset, an array with a rarely-used type). This is useful for both nanoarrow to test itself and perhaps others who might want to use nanoarrow in a similar way in Python. This is a bit big...I did need to put all of it in one place to figure out what the end point was; however, I'm happy to split into smaller self-contained bits now that I know where I'm headed. After this PR, we can create an array out-of-the-box from anything that supports the buffer protocol. Importantly, this includes numpy arrays so that you can do things like generate arrays with `n` random numbers. ```python import nanoarrow as na import numpy as np ``` ```python na.c_array_view(b"12345") ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'uint8' - length: 5 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <uint8[5 b] 49 50 51 52 53> - dictionary: NULL - children[0]: ```python na.c_array_view(np.array([1, 2, 3], np.int32)) ``` ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: ``` While not built in to the main `c_array()` constructor, we can also now assemble an array from buffers. This has been very useful in R and ensures that we can construct just about any array if we need to. ```python array = na.c_array_from_buffers( na.struct([na.int32()]), length=3, buffers=[None], children=[ na.c_array_from_buffers( na.int32(), length=3, buffers=[None, na.c_buffer([1, 2, 3], na.int32())] ) ], ) na.c_array_view(array) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'struct' - length: 3 - offset: 0 - null_count: 0 - buffers[1]: - validity <bool[0 b] > - dictionary: NULL - children[1]: - <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: I also added the ability to construct a buffer from an iterable and wired that into the `c_array()` constructor although this is probably not all that fast. It does, however, make it much easier to write tests (because many of them currently start with `na_c_array(pa.array([1, 2, 3]))`. ```python na.c_array_view([1, 2, 3], na.int32()) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: This allows creating an array from anything supported by the `struct` module which means we can create some of the less frequently used types. ```python na.c_array_view([1, 2, 3], na.float16()) ``` CBuffer(half_float[6 b] 1.0 2.0 3.0) ```python na.c_array_view([(1, 2), (3, 4), (5, 6)], na.interval_day_time()) ``` CBuffer(interval_day_time[24 b] (1, 2) (3, 4) (5, 6)) Because it's mentaly exhausting to bitpack buffers in my head and because Arrow uses them all the time, I also think it's mission-critical to be able to create bitmaps: ```python na.c_buffer([True, False, True, True], na.bool()) ``` CBuffer(bool[1 b] 10110000) This involved fixing some issues with the existing buffer view: - The buffer view only ever saved a pointer to the device. This is a bit of a problem because even though the CPU device is static and lives forever, CUDA "device" objects will probably keep a CUDA context alive. Thus, we need a strong reference to the `CDevice` Python object (which ensures the underlying nanoarrow `Device*` remains valid). - The buffer view only handled `BufferView` input where technically all it needs is a pointer and a length. This opens it up to represent other types of buffers than just something from nanoarrow (e.g., imported from dlpack or buffer protocol). Implementing the buffer protocol as a consumer was done by wrapping the `ArrowBuffer` with a "deallocator" that holds the `Py_buffer` and ensures it is released. I still need to do some testing to ensure that it's actually released and that we're not leaking memory. This is how I do it in R and in geoarrow-c (Python) as well. Using the `ArrowBuffer` is helpful because the C-level array builder uses them to manage the memory and ensures they're all released when the array is released. Implementing the build-from-iterable involved a few more things...notably, completing the "python struct format string" <-> "arrow data type" conversion. This allows the use of `struct.pack()` which takes care of things like half-float conversion and tuples of day, month, nano conversion. I'm aware this could use a bit better documentation of the added classes/methods...I am assuming these will be internal for the time being but they definitely need a bit more than is currently there. --------- Co-authored-by: Joris Van den Bossche <[email protected]>
apache · Feb 19, 2024 · 841c845 · 841c845
1 parent 4b6717f
commit 841c845
Show file tree

Hide file tree

Showing 17 changed files with 1,988 additions and 332 deletions.
diff --git a/python/README.ipynb b/python/README.ipynb
@@ -118,7 +118,12 @@
        "- storage_type: 'decimal128'\n",
        "- decimal_bitwidth: 128\n",
        "- decimal_precision: 10\n",
-       "- decimal_scale: 3"
+       "- decimal_scale: 3\n",
+       "- dictionary_ordered: False\n",
+       "- map_keys_sorted: False\n",
+       "- nullable: True\n",
+       "- storage_type_id: 24\n",
+       "- type_id: 24"
       ]
      },
      "execution_count": 3,
@@ -195,7 +200,7 @@
        "- length: 4\n",
        "- offset: 0\n",
        "- null_count: 1\n",
-       "- buffers: (2939032895680, 2939032895616, 2939032895744)\n",
+       "- buffers: (3678035706048, 3678035705984, 3678035706112)\n",
        "- dictionary: NULL\n",
        "- children[0]:"
       ]
@@ -232,9 +237,9 @@
        "- offset: 0\n",
        "- null_count: 1\n",
        "- buffers[3]:\n",
-       "  - <bool validity[1 b] 11100000>\n",
-       "  - <int32 data_offset[20 b] 0 3 6 11 11>\n",
-       "  - <string data[11 b] b'onetwothree'>\n",
+       "  - validity <bool[1 b] 11100000>\n",
+       "  - data_offset <int32[20 b] 0 3 6 11 11>\n",
+       "  - data <string[11 b] b'onetwothree'>\n",
        "- dictionary: NULL\n",
        "- children[0]:"
       ]
@@ -297,20 +302,7 @@
      "data": {
       "text/plain": [
        "<nanoarrow.c_lib.CArrayStream>\n",
-       "- get_schema(): <nanoarrow.c_lib.CSchema struct>\n",
-       "  - format: '+s'\n",
-       "  - name: ''\n",
-       "  - flags: 0\n",
-       "  - metadata: NULL\n",
-       "  - dictionary: NULL\n",
-       "  - children[1]:\n",
-       "    'some_column': <nanoarrow.c_lib.CSchema int32>\n",
-       "      - format: 'i'\n",
-       "      - name: 'some_column'\n",
-       "      - flags: 2\n",
-       "      - metadata: NULL\n",
-       "      - dictionary: NULL\n",
-       "      - children[0]:"
+       "- get_schema(): struct<some_column: int32>"
       ]
      },
      "execution_count": 8,
@@ -343,7 +335,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "<nanoarrow.c_lib.CArray struct>\n",
+      "<nanoarrow.c_lib.CArray struct<some_column: int32>>\n",
       "- length: 3\n",
       "- offset: 0\n",
       "- null_count: 0\n",
@@ -354,7 +346,7 @@
       "    - length: 3\n",
       "    - offset: 0\n",
       "    - null_count: 0\n",
-      "    - buffers: (0, 2939033026688)\n",
+      "    - buffers: (0, 3678035837056)\n",
       "    - dictionary: NULL\n",
       "    - children[0]:\n"
      ]
@@ -382,20 +374,7 @@
      "data": {
       "text/plain": [
        "<nanoarrow.c_lib.CArrayStream>\n",
-       "- get_schema(): <nanoarrow.c_lib.CSchema struct>\n",
-       "  - format: '+s'\n",
-       "  - name: ''\n",
-       "  - flags: 0\n",
-       "  - metadata: NULL\n",
-       "  - dictionary: NULL\n",
-       "  - children[1]:\n",
-       "    'some_column': <nanoarrow.c_lib.CSchema int32>\n",
-       "      - format: 'i'\n",
-       "      - name: 'some_column'\n",
-       "      - flags: 2\n",
-       "      - metadata: NULL\n",
-       "      - dictionary: NULL\n",
-       "      - children[0]:"
+       "- get_schema(): struct<some_column: int32>"
       ]
      },
      "execution_count": 10,

diff --git a/python/README.md b/python/README.md
@@ -87,6 +87,11 @@ na.c_schema_view(schema)
     - decimal_bitwidth: 128
     - decimal_precision: 10
     - decimal_scale: 3
+    - dictionary_ordered: False
+    - map_keys_sorted: False
+    - nullable: True
+    - storage_type_id: 24
+    - type_id: 24
 
 
 
@@ -131,7 +136,7 @@ array
     - length: 4
     - offset: 0
     - null_count: 1
-    - buffers: (2939032895680, 2939032895616, 2939032895744)
+    - buffers: (3678035706048, 3678035705984, 3678035706112)
     - dictionary: NULL
     - children[0]:
 
@@ -153,9 +158,9 @@ na.c_array_view(array)
     - offset: 0
     - null_count: 1
     - buffers[3]:
-      - <bool validity[1 b] 11100000>
-      - <int32 data_offset[20 b] 0 3 6 11 11>
-      - <string data[11 b] b'onetwothree'>
+      - validity <bool[1 b] 11100000>
+      - data_offset <int32[20 b] 0 3 6 11 11>
+      - data <string[11 b] b'onetwothree'>
     - dictionary: NULL
     - children[0]:
 
@@ -194,20 +199,7 @@ array_stream
 
 
     <nanoarrow.c_lib.CArrayStream>
-    - get_schema(): <nanoarrow.c_lib.CSchema struct>
-      - format: '+s'
-      - name: ''
-      - flags: 0
-      - metadata: NULL
-      - dictionary: NULL
-      - children[1]:
-        'some_column': <nanoarrow.c_lib.CSchema int32>
-          - format: 'i'
-          - name: 'some_column'
-          - flags: 2
-          - metadata: NULL
-          - dictionary: NULL
-          - children[0]:
+    - get_schema(): struct<some_column: int32>
 
 
 
@@ -219,7 +211,7 @@ for array in array_stream:
     print(array)
 ```
 
-    <nanoarrow.c_lib.CArray struct>
+    <nanoarrow.c_lib.CArray struct<some_column: int32>>
     - length: 3
     - offset: 0
     - null_count: 0
@@ -230,7 +222,7 @@ for array in array_stream:
         - length: 3
         - offset: 0
         - null_count: 0
-        - buffers: (0, 2939033026688)
+        - buffers: (0, 3678035837056)
         - dictionary: NULL
         - children[0]:
 
@@ -248,20 +240,7 @@ array_stream
 
 
     <nanoarrow.c_lib.CArrayStream>
-    - get_schema(): <nanoarrow.c_lib.CSchema struct>
-      - format: '+s'
-      - name: ''
-      - flags: 0
-      - metadata: NULL
-      - dictionary: NULL
-      - children[1]:
-        'some_column': <nanoarrow.c_lib.CSchema int32>
-          - format: 'i'
-          - name: 'some_column'
-          - flags: 2
-          - metadata: NULL
-          - dictionary: NULL
-          - children[0]:
+    - get_schema(): struct<some_column: int32>
 
 
 

diff --git a/python/bootstrap.py b/python/bootstrap.py
@@ -42,11 +42,13 @@ def generate_nanoarrow_pxd(self, file_in, file_out):
         # Replace NANOARROW_MAX_FIXED_BUFFERS with its value
         content = self.re_max_buffers.sub("3", content)
 
-        # Find types and function definitions
+        # Find typedefs, types, and function definitions
+        typedefs = self._find_typedefs(content)
         types = self._find_types(content)
         func_defs = self._find_func_defs(content)
 
         # Make corresponding cython definitions
+        typedefs_cython = [self._typdef_to_cython(t, "    ") for t in typedefs]
         types_cython = [self._type_to_cython(t, "    ") for t in types]
         func_defs_cython = [self._func_def_to_cython(d, "    ") for d in func_defs]
 
@@ -63,7 +65,6 @@ def generate_nanoarrow_pxd(self, file_in, file_out):
 
             # A few things we add in manually
             output.write(b"\n")
-            output.write(b"    ctypedef int ArrowErrorCode\n")
             output.write(b"    cdef int NANOARROW_OK\n")
             output.write(b"    cdef int NANOARROW_MAX_FIXED_BUFFERS\n")
             output.write(b"    cdef int ARROW_FLAG_DICTIONARY_ORDERED\n")
@@ -75,20 +76,26 @@ def generate_nanoarrow_pxd(self, file_in, file_out):
                 output.write(type.encode("UTF-8"))
                 output.write(b"\n\n")
 
+            for typedef in typedefs_cython:
+                output.write(typedef.encode("UTF-8"))
+                output.write(b"\n")
+            output.write(b"\n")
+
             for func_def in func_defs_cython:
                 output.write(func_def.encode("UTF-8"))
                 output.write(b"\n")
 
     def _define_regexes(self):
         self.re_comment = re.compile(r"\s*//[^\n]*")
         self.re_max_buffers = re.compile(r"NANOARROW_MAX_FIXED_BUFFERS")
+        self.re_typedef = re.compile(r"typedef(?P<typedef>[^;]+)")
         self.re_type = re.compile(
             r"(?P<type>struct|union|enum) (?P<name>Arrow[^ ]+) {(?P<body>[^}]*)}"
         )
         self.re_func_def = re.compile(
-            r"\n(static inline )?(?P<const>const )?(struct|enum )?"
+            r"\n(static inline )?(?P<const>const )?(struct |enum )?"
             r"(?P<return_type>[A-Za-z0-9_*]+) "
-            r"(?P<name>Arrow[A-Za-z]+)\((?P<args>[^\)]*)\);"
+            r"(?P<name>Arrow[A-Za-z0-9]+)\((?P<args>[^\)]*)\);"
         )
         self.re_tagged_type = re.compile(
             r"(?P<type>struct|union|enum) (?P<name>Arrow[A-Za-z]+)"
@@ -101,12 +108,20 @@ def _define_regexes(self):
     def _strip_comments(self, content):
         return self.re_comment.sub("", content)
 
+    def _find_typedefs(self, content):
+        return [m.groupdict() for m in self.re_typedef.finditer(content)]
+
     def _find_types(self, content):
         return [m.groupdict() for m in self.re_type.finditer(content)]
 
     def _find_func_defs(self, content):
         return [m.groupdict() for m in self.re_func_def.finditer(content)]
 
+    def _typdef_to_cython(self, t, indent=""):
+        typedef = t["typedef"]
+        typedef = self.re_tagged_type.sub(r"\2", typedef)
+        return f"{indent}ctypedef {typedef}"
+
     def _type_to_cython(self, t, indent=""):
         type = t["type"]
         name = t["name"]

diff --git a/python/src/nanoarrow/__init__.py b/python/src/nanoarrow/__init__.py
@@ -28,9 +28,11 @@
 from nanoarrow.c_lib import (
     c_schema,
     c_array,
+    c_array_from_buffers,
     c_array_stream,
     c_schema_view,
     c_array_view,
+    c_buffer,
     allocate_c_schema,
     allocate_c_array,
     allocate_c_array_stream,
@@ -73,6 +75,7 @@
 )
 from nanoarrow._version import __version__  # noqa: F401
 
+# Helps Sphinx automatically populate an API reference section
 __all__ = [
     "Schema",
     "TimeUnit",
@@ -83,8 +86,10 @@
     "binary",
     "bool",
     "c_array",
+    "c_array_from_buffers",
     "c_array_stream",
     "c_array_view",
+    "c_buffer",
     "c_lib",
     "c_schema",
     "c_schema_view",