Skip to content

Commit

Permalink
feat(python): Add array creation/building from buffers (#378)
Browse files Browse the repository at this point in the history
The gist of this PR is that I'd like the ability to create arrays for
testing without pyarrow so that nanoarrow's tests can run in more
places. Other than building/running in odd corner-case environments,
nanoarrow in R has been great at prototyping and/or creating test data
(e.g., an array with a non-zero offset, an array with a rarely-used
type). This is useful for both nanoarrow to test itself and perhaps
others who might want to use nanoarrow in a similar way in Python.

This is a bit big...I did need to put all of it in one place to figure
out what the end point was; however, I'm happy to split into smaller
self-contained bits now that I know where I'm headed.

After this PR, we can create an array out-of-the-box from anything that
supports the buffer protocol. Importantly, this includes numpy arrays so
that you can do things like generate arrays with `n` random numbers.


```python
import nanoarrow as na
import numpy as np
```

```python
na.c_array_view(b"12345")
```




    <nanoarrow.c_lib.CArrayView>
    - storage_type: 'uint8'
    - length: 5
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <uint8[5 b] 49 50 51 52 53>
    - dictionary: NULL
    - children[0]:


```python
na.c_array_view(np.array([1, 2, 3], np.int32))
```

```
<nanoarrow.c_lib.CArrayView>
- storage_type: 'int32'
- length: 3
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int32[12 b] 1 2 3>
- dictionary: NULL
- children[0]:
```

While not built in to the main `c_array()` constructor, we can also now
assemble an array from buffers. This has been very useful in R and
ensures that we can construct just about any array if we need to.


```python
array = na.c_array_from_buffers(
    na.struct([na.int32()]),
    length=3,
    buffers=[None],
    children=[
        na.c_array_from_buffers(
            na.int32(),
            length=3,
            buffers=[None, na.c_buffer([1, 2, 3], na.int32())]
        )
    ],
)

na.c_array_view(array)
```




    <nanoarrow.c_lib.CArrayView>
    - storage_type: 'struct'
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[1]:
      - validity <bool[0 b] >
    - dictionary: NULL
    - children[1]:
      - <nanoarrow.c_lib.CArrayView>
        - storage_type: 'int32'
        - length: 3
        - offset: 0
        - null_count: 0
        - buffers[2]:
          - validity <bool[0 b] >
          - data <int32[12 b] 1 2 3>
        - dictionary: NULL
        - children[0]:



I also added the ability to construct a buffer from an iterable and
wired that into the `c_array()` constructor although this is probably
not all that fast. It does, however, make it much easier to write tests
(because many of them currently start with `na_c_array(pa.array([1, 2,
3]))`.


```python
na.c_array_view([1, 2, 3], na.int32())
```




    <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int32'
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int32[12 b] 1 2 3>
    - dictionary: NULL
    - children[0]:



This allows creating an array from anything supported by the `struct`
module which means we can create some of the less frequently used types.


```python
na.c_array_view([1, 2, 3], na.float16())
```




    CBuffer(half_float[6 b] 1.0 2.0 3.0)




```python
na.c_array_view([(1, 2), (3, 4), (5, 6)], na.interval_day_time())
```




    CBuffer(interval_day_time[24 b] (1, 2) (3, 4) (5, 6))



Because it's mentaly exhausting to bitpack buffers in my head and
because Arrow uses them all the time, I also think it's mission-critical
to be able to create bitmaps:


```python
na.c_buffer([True, False, True, True], na.bool())
```




    CBuffer(bool[1 b] 10110000)


This involved fixing some issues with the existing buffer view:

- The buffer view only ever saved a pointer to the device. This is a bit
of a problem because even though the CPU device is static and lives
forever, CUDA "device" objects will probably keep a CUDA context alive.
Thus, we need a strong reference to the `CDevice` Python object (which
ensures the underlying nanoarrow `Device*` remains valid).
- The buffer view only handled `BufferView` input where technically all
it needs is a pointer and a length. This opens it up to represent other
types of buffers than just something from nanoarrow (e.g., imported from
dlpack or buffer protocol).

Implementing the buffer protocol as a consumer was done by wrapping the
`ArrowBuffer` with a "deallocator" that holds the `Py_buffer` and
ensures it is released. I still need to do some testing to ensure that
it's actually released and that we're not leaking memory. This is how I
do it in R and in geoarrow-c (Python) as well. Using the `ArrowBuffer`
is helpful because the C-level array builder uses them to manage the
memory and ensures they're all released when the array is released.

Implementing the build-from-iterable involved a few more
things...notably, completing the "python struct format string" <->
"arrow data type" conversion. This allows the use of `struct.pack()`
which takes care of things like half-float conversion and tuples of day,
month, nano conversion.

I'm aware this could use a bit better documentation of the added
classes/methods...I am assuming these will be internal for the time
being but they definitely need a bit more than is currently there.

---------

Co-authored-by: Joris Van den Bossche <[email protected]>
  • Loading branch information
paleolimbot and jorisvandenbossche authored Feb 19, 2024
1 parent 4b6717f commit 841c845
Show file tree
Hide file tree
Showing 17 changed files with 1,988 additions and 332 deletions.
49 changes: 14 additions & 35 deletions python/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,12 @@
"- storage_type: 'decimal128'\n",
"- decimal_bitwidth: 128\n",
"- decimal_precision: 10\n",
"- decimal_scale: 3"
"- decimal_scale: 3\n",
"- dictionary_ordered: False\n",
"- map_keys_sorted: False\n",
"- nullable: True\n",
"- storage_type_id: 24\n",
"- type_id: 24"
]
},
"execution_count": 3,
Expand Down Expand Up @@ -195,7 +200,7 @@
"- length: 4\n",
"- offset: 0\n",
"- null_count: 1\n",
"- buffers: (2939032895680, 2939032895616, 2939032895744)\n",
"- buffers: (3678035706048, 3678035705984, 3678035706112)\n",
"- dictionary: NULL\n",
"- children[0]:"
]
Expand Down Expand Up @@ -232,9 +237,9 @@
"- offset: 0\n",
"- null_count: 1\n",
"- buffers[3]:\n",
" - <bool validity[1 b] 11100000>\n",
" - <int32 data_offset[20 b] 0 3 6 11 11>\n",
" - <string data[11 b] b'onetwothree'>\n",
" - validity <bool[1 b] 11100000>\n",
" - data_offset <int32[20 b] 0 3 6 11 11>\n",
" - data <string[11 b] b'onetwothree'>\n",
"- dictionary: NULL\n",
"- children[0]:"
]
Expand Down Expand Up @@ -297,20 +302,7 @@
"data": {
"text/plain": [
"<nanoarrow.c_lib.CArrayStream>\n",
"- get_schema(): <nanoarrow.c_lib.CSchema struct>\n",
" - format: '+s'\n",
" - name: ''\n",
" - flags: 0\n",
" - metadata: NULL\n",
" - dictionary: NULL\n",
" - children[1]:\n",
" 'some_column': <nanoarrow.c_lib.CSchema int32>\n",
" - format: 'i'\n",
" - name: 'some_column'\n",
" - flags: 2\n",
" - metadata: NULL\n",
" - dictionary: NULL\n",
" - children[0]:"
"- get_schema(): struct<some_column: int32>"
]
},
"execution_count": 8,
Expand Down Expand Up @@ -343,7 +335,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"<nanoarrow.c_lib.CArray struct>\n",
"<nanoarrow.c_lib.CArray struct<some_column: int32>>\n",
"- length: 3\n",
"- offset: 0\n",
"- null_count: 0\n",
Expand All @@ -354,7 +346,7 @@
" - length: 3\n",
" - offset: 0\n",
" - null_count: 0\n",
" - buffers: (0, 2939033026688)\n",
" - buffers: (0, 3678035837056)\n",
" - dictionary: NULL\n",
" - children[0]:\n"
]
Expand Down Expand Up @@ -382,20 +374,7 @@
"data": {
"text/plain": [
"<nanoarrow.c_lib.CArrayStream>\n",
"- get_schema(): <nanoarrow.c_lib.CSchema struct>\n",
" - format: '+s'\n",
" - name: ''\n",
" - flags: 0\n",
" - metadata: NULL\n",
" - dictionary: NULL\n",
" - children[1]:\n",
" 'some_column': <nanoarrow.c_lib.CSchema int32>\n",
" - format: 'i'\n",
" - name: 'some_column'\n",
" - flags: 2\n",
" - metadata: NULL\n",
" - dictionary: NULL\n",
" - children[0]:"
"- get_schema(): struct<some_column: int32>"
]
},
"execution_count": 10,
Expand Down
47 changes: 13 additions & 34 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@ na.c_schema_view(schema)
- decimal_bitwidth: 128
- decimal_precision: 10
- decimal_scale: 3
- dictionary_ordered: False
- map_keys_sorted: False
- nullable: True
- storage_type_id: 24
- type_id: 24



Expand Down Expand Up @@ -131,7 +136,7 @@ array
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- buffers: (3678035706048, 3678035705984, 3678035706112)
- dictionary: NULL
- children[0]:

Expand All @@ -153,9 +158,9 @@ na.c_array_view(array)
- offset: 0
- null_count: 1
- buffers[3]:
- <bool validity[1 b] 11100000>
- <int32 data_offset[20 b] 0 3 6 11 11>
- <string data[11 b] b'onetwothree'>
- validity <bool[1 b] 11100000>
- data_offset <int32[20 b] 0 3 6 11 11>
- data <string[11 b] b'onetwothree'>
- dictionary: NULL
- children[0]:

Expand Down Expand Up @@ -194,20 +199,7 @@ array_stream


<nanoarrow.c_lib.CArrayStream>
- get_schema(): <nanoarrow.c_lib.CSchema struct>
- format: '+s'
- name: ''
- flags: 0
- metadata: NULL
- dictionary: NULL
- children[1]:
'some_column': <nanoarrow.c_lib.CSchema int32>
- format: 'i'
- name: 'some_column'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
- get_schema(): struct<some_column: int32>



Expand All @@ -219,7 +211,7 @@ for array in array_stream:
print(array)
```

<nanoarrow.c_lib.CArray struct>
<nanoarrow.c_lib.CArray struct<some_column: int32>>
- length: 3
- offset: 0
- null_count: 0
Expand All @@ -230,7 +222,7 @@ for array in array_stream:
- length: 3
- offset: 0
- null_count: 0
- buffers: (0, 2939033026688)
- buffers: (0, 3678035837056)
- dictionary: NULL
- children[0]:

Expand All @@ -248,20 +240,7 @@ array_stream


<nanoarrow.c_lib.CArrayStream>
- get_schema(): <nanoarrow.c_lib.CSchema struct>
- format: '+s'
- name: ''
- flags: 0
- metadata: NULL
- dictionary: NULL
- children[1]:
'some_column': <nanoarrow.c_lib.CSchema int32>
- format: 'i'
- name: 'some_column'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
- get_schema(): struct<some_column: int32>



Expand Down
23 changes: 19 additions & 4 deletions python/bootstrap.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,13 @@ def generate_nanoarrow_pxd(self, file_in, file_out):
# Replace NANOARROW_MAX_FIXED_BUFFERS with its value
content = self.re_max_buffers.sub("3", content)

# Find types and function definitions
# Find typedefs, types, and function definitions
typedefs = self._find_typedefs(content)
types = self._find_types(content)
func_defs = self._find_func_defs(content)

# Make corresponding cython definitions
typedefs_cython = [self._typdef_to_cython(t, " ") for t in typedefs]
types_cython = [self._type_to_cython(t, " ") for t in types]
func_defs_cython = [self._func_def_to_cython(d, " ") for d in func_defs]

Expand All @@ -63,7 +65,6 @@ def generate_nanoarrow_pxd(self, file_in, file_out):

# A few things we add in manually
output.write(b"\n")
output.write(b" ctypedef int ArrowErrorCode\n")
output.write(b" cdef int NANOARROW_OK\n")
output.write(b" cdef int NANOARROW_MAX_FIXED_BUFFERS\n")
output.write(b" cdef int ARROW_FLAG_DICTIONARY_ORDERED\n")
Expand All @@ -75,20 +76,26 @@ def generate_nanoarrow_pxd(self, file_in, file_out):
output.write(type.encode("UTF-8"))
output.write(b"\n\n")

for typedef in typedefs_cython:
output.write(typedef.encode("UTF-8"))
output.write(b"\n")
output.write(b"\n")

for func_def in func_defs_cython:
output.write(func_def.encode("UTF-8"))
output.write(b"\n")

def _define_regexes(self):
self.re_comment = re.compile(r"\s*//[^\n]*")
self.re_max_buffers = re.compile(r"NANOARROW_MAX_FIXED_BUFFERS")
self.re_typedef = re.compile(r"typedef(?P<typedef>[^;]+)")
self.re_type = re.compile(
r"(?P<type>struct|union|enum) (?P<name>Arrow[^ ]+) {(?P<body>[^}]*)}"
)
self.re_func_def = re.compile(
r"\n(static inline )?(?P<const>const )?(struct|enum )?"
r"\n(static inline )?(?P<const>const )?(struct |enum )?"
r"(?P<return_type>[A-Za-z0-9_*]+) "
r"(?P<name>Arrow[A-Za-z]+)\((?P<args>[^\)]*)\);"
r"(?P<name>Arrow[A-Za-z0-9]+)\((?P<args>[^\)]*)\);"
)
self.re_tagged_type = re.compile(
r"(?P<type>struct|union|enum) (?P<name>Arrow[A-Za-z]+)"
Expand All @@ -101,12 +108,20 @@ def _define_regexes(self):
def _strip_comments(self, content):
return self.re_comment.sub("", content)

def _find_typedefs(self, content):
return [m.groupdict() for m in self.re_typedef.finditer(content)]

def _find_types(self, content):
return [m.groupdict() for m in self.re_type.finditer(content)]

def _find_func_defs(self, content):
return [m.groupdict() for m in self.re_func_def.finditer(content)]

def _typdef_to_cython(self, t, indent=""):
typedef = t["typedef"]
typedef = self.re_tagged_type.sub(r"\2", typedef)
return f"{indent}ctypedef {typedef}"

def _type_to_cython(self, t, indent=""):
type = t["type"]
name = t["name"]
Expand Down
5 changes: 5 additions & 0 deletions python/src/nanoarrow/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,11 @@
from nanoarrow.c_lib import (
c_schema,
c_array,
c_array_from_buffers,
c_array_stream,
c_schema_view,
c_array_view,
c_buffer,
allocate_c_schema,
allocate_c_array,
allocate_c_array_stream,
Expand Down Expand Up @@ -73,6 +75,7 @@
)
from nanoarrow._version import __version__ # noqa: F401

# Helps Sphinx automatically populate an API reference section
__all__ = [
"Schema",
"TimeUnit",
Expand All @@ -83,8 +86,10 @@
"binary",
"bool",
"c_array",
"c_array_from_buffers",
"c_array_stream",
"c_array_view",
"c_buffer",
"c_lib",
"c_schema",
"c_schema_view",
Expand Down
Loading

0 comments on commit 841c845

Please sign in to comment.