Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(python): Add array creation/building from buffers (#378)
The gist of this PR is that I'd like the ability to create arrays for testing without pyarrow so that nanoarrow's tests can run in more places. Other than building/running in odd corner-case environments, nanoarrow in R has been great at prototyping and/or creating test data (e.g., an array with a non-zero offset, an array with a rarely-used type). This is useful for both nanoarrow to test itself and perhaps others who might want to use nanoarrow in a similar way in Python. This is a bit big...I did need to put all of it in one place to figure out what the end point was; however, I'm happy to split into smaller self-contained bits now that I know where I'm headed. After this PR, we can create an array out-of-the-box from anything that supports the buffer protocol. Importantly, this includes numpy arrays so that you can do things like generate arrays with `n` random numbers. ```python import nanoarrow as na import numpy as np ``` ```python na.c_array_view(b"12345") ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'uint8' - length: 5 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <uint8[5 b] 49 50 51 52 53> - dictionary: NULL - children[0]: ```python na.c_array_view(np.array([1, 2, 3], np.int32)) ``` ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: ``` While not built in to the main `c_array()` constructor, we can also now assemble an array from buffers. This has been very useful in R and ensures that we can construct just about any array if we need to. ```python array = na.c_array_from_buffers( na.struct([na.int32()]), length=3, buffers=[None], children=[ na.c_array_from_buffers( na.int32(), length=3, buffers=[None, na.c_buffer([1, 2, 3], na.int32())] ) ], ) na.c_array_view(array) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'struct' - length: 3 - offset: 0 - null_count: 0 - buffers[1]: - validity <bool[0 b] > - dictionary: NULL - children[1]: - <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: I also added the ability to construct a buffer from an iterable and wired that into the `c_array()` constructor although this is probably not all that fast. It does, however, make it much easier to write tests (because many of them currently start with `na_c_array(pa.array([1, 2, 3]))`. ```python na.c_array_view([1, 2, 3], na.int32()) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: This allows creating an array from anything supported by the `struct` module which means we can create some of the less frequently used types. ```python na.c_array_view([1, 2, 3], na.float16()) ``` CBuffer(half_float[6 b] 1.0 2.0 3.0) ```python na.c_array_view([(1, 2), (3, 4), (5, 6)], na.interval_day_time()) ``` CBuffer(interval_day_time[24 b] (1, 2) (3, 4) (5, 6)) Because it's mentaly exhausting to bitpack buffers in my head and because Arrow uses them all the time, I also think it's mission-critical to be able to create bitmaps: ```python na.c_buffer([True, False, True, True], na.bool()) ``` CBuffer(bool[1 b] 10110000) This involved fixing some issues with the existing buffer view: - The buffer view only ever saved a pointer to the device. This is a bit of a problem because even though the CPU device is static and lives forever, CUDA "device" objects will probably keep a CUDA context alive. Thus, we need a strong reference to the `CDevice` Python object (which ensures the underlying nanoarrow `Device*` remains valid). - The buffer view only handled `BufferView` input where technically all it needs is a pointer and a length. This opens it up to represent other types of buffers than just something from nanoarrow (e.g., imported from dlpack or buffer protocol). Implementing the buffer protocol as a consumer was done by wrapping the `ArrowBuffer` with a "deallocator" that holds the `Py_buffer` and ensures it is released. I still need to do some testing to ensure that it's actually released and that we're not leaking memory. This is how I do it in R and in geoarrow-c (Python) as well. Using the `ArrowBuffer` is helpful because the C-level array builder uses them to manage the memory and ensures they're all released when the array is released. Implementing the build-from-iterable involved a few more things...notably, completing the "python struct format string" <-> "arrow data type" conversion. This allows the use of `struct.pack()` which takes care of things like half-float conversion and tuples of day, month, nano conversion. I'm aware this could use a bit better documentation of the added classes/methods...I am assuming these will be internal for the time being but they definitely need a bit more than is currently there. --------- Co-authored-by: Joris Van den Bossche <[email protected]>
- Loading branch information