-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(python): Add CArrayView -> Python conversion (#391)
This PR adds a framework for Python object creation from arrays and array streams with implementations for most arrow types. Notably, it includes implementations for nested types (struct, list, dictionary) to make sure that the framework won't have to be completely rewritten to accommodate them. A few types (decimal, datetime) aren't supported but should be reasonably easy to implement by wrapping existing iterator factories included in this PR. None of these are exposed with `import nanoarrow as na` yet...I'm anticipating that the user-facing `nanoarrow.Array` and/or `nanoarrow.ArrayStream` to use the implementation here in methods. A few changes were required at a lower level to make this work: - It is now possible to use nanoarrow's `ArrowBasicArrayStream` implementation to create a stream from a previously-resolved list of arrays. This makes it easier to test streams since before we had no way to create them. - The constructor for `c_array_stream()` now falls back on `c_array()` by wrapping it in a length-one stream. This makes it easier to write generic code that takes stream-like input (like the iterator). - The `ArrowLayout` needed to be exposed to implement the fixed-size list implementation. - I added tests for all the lower level changes, which I did in dedicated files. Some of these tests overlap with existing tests in test_nanoarrow...at some point we should go through test_nanoarrow and separate the tests (or create an integration test section since many of those early tests assumed pyarrow was available). The implementation seems to be efficient given the constraint that assembling the iterators is currently done using Python code. ```python import numpy as np import pyarrow as pa from nanoarrow import iterator n = int(1e6) n_cols = 10 arrays = [np.random.random(n) for _ in range(n_cols)] batch = pa.record_batch( arrays, names=[f"col{i}" for i in range(n_cols)] ) %timeit list(iterator.itertuples(batch)) #> 256 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # Just zipping the arrays %timeit list(zip(*arrays)) #> 335 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # A few ways to do this from pyarrow %timeit list(zip(*(col.to_pylist() for col in batch.columns))) #> 1.99 s ± 52.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit list(zip(*(col.to_numpy() for col in batch.columns))) #> 315 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # Works if all columns are the same type (but rows are arrays, not tuples) %timeit list(np.array(batch)) #> 131 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # Test some nested things n = int(1e4) n_cols = 10 big_list = [["a", "b", "c", "d", "e"]] * n arrays = [big_list for _ in range(n_cols)] batch = pa.record_batch( arrays, names=[f"col{i}" for i in range(n_cols)] ) %timeit list(iterator.itertuples(batch)) #> 89.2 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit list(zip(*(col.to_pylist() for col in batch.columns))) #> 288 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ```
- Loading branch information
1 parent
7e601cc
commit 7cf50a3
Showing
11 changed files
with
1,231 additions
and
45 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.