refactor(r): Refactor ArrowArray(Stream) -> R Vector conversion #392

paleolimbot · 2024-02-26T21:58:58Z

The initial version of converting ArrowArray (or stream of them) to R is implemented in C and is difficult to understand. Not only is this difficult because of the verbose C, the dispatch portion is implemented almost completely twice (once for a single array, once for an array stream). It is at a point currently where it is difficult for me, let alone an external contributor, to add features or fix bugs. Time to refactor!

This approach uses C++ classes/virtual method dispatch to handle the different types of vector conversions. This is similar to how the arrow R package does this except the Arrow R package uses the Arrow C++ converter infrastructure/heavy templating to do dispatch. Here we use a switch() and eat the per-batch and per-column virtual method call.

Work in progress!

codecov-commenter · 2024-02-27T00:53:29Z

Codecov Report

Attention: Patch coverage is 35.44474% with 479 lines in your changes are missing coverage. Please review.

Project coverage is 86.09%. Comparing base (c7a1236) to head (2afc7b8).

Files	Patch %	Lines
r/src/vctr_builder_int.h	2.98%	65 Missing ⚠️
r/src/vctr_builder_int64.h	0.00%	59 Missing ⚠️
r/src/vctr_builder_dbl.h	10.52%	51 Missing ⚠️
r/src/vctr_builder.cc	73.21%	45 Missing ⚠️
r/src/vctr_builder_difftime.h	37.50%	40 Missing ⚠️
r/src/vctr_builder_other.h	11.11%	40 Missing ⚠️
r/src/vctr_builder_base.h	30.35%	39 Missing ⚠️
r/src/vctr_builder_lgl.h	5.26%	36 Missing ⚠️
r/src/vctr_builder_chr.h	5.88%	32 Missing ⚠️
r/src/vctr_builder_blob.h	4.00%	24 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #392      +/-   ##
==========================================
- Coverage   88.74%   86.09%   -2.66%     
==========================================
  Files          81       97      +16     
  Lines       14398    15091     +693     
==========================================
+ Hits        12778    12993     +215     
- Misses       1620     2098     +478

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…461) This PR adds the `nanoarrow_vctr`, which is an R translation of the Python `Array` class in nanoarrow's Python bindings. This is implemented like an R `factor()` in the sense that under the hood it is a sequence of integers (`0:(array$length - 1)` at the beginning) with attributes that give those integers context. This is implemented in such a way that it is "tacked on" to the existing conversions. The existing conversions do need a refactoring ( #392 ), but that is a heavy change for this point in the release cycle. The only change needed to the existing conversion was a slight refactor of the "consume array stream" code that correctly gave each array in the stream its own R object to manage its lifecycle (before each array was "materialized" and then immediately released because no previous conversion code required an ArrowArray to live beyond the conversion. The motivation for this change is converting GeoArrow extension types. In the geoarrow package, we implement an efficient conversion from a stream of arrays to various types of R-spatial objects (e.g., sf); however, we really don't want to invoke the default conversion for those types because they have awful performance (e.g., the multipolygon would be a `list(list(list(data.frame))))`) and there's no need to invoke that number of R object conversions between the initial state (an arrow array) and the final state (an sfc column). The nanoarrow_vctr allows something like: ```r df <- convert_array(some_array_containing_a_geoarrow_col) st_as_sfc(df$geometry) # or s2::as_s2_geography(df$geometry), or something else ``` A side-effect of this change is that we have an escape hatch for conversions that are lossy or contain types with no R equivalent. A quick demo: ``` r library(nanoarrow) arrays <- lapply( list(1:5, 6:10, 11:13), as_nanoarrow_array ) # A vctr can be created from any stream (vctr <- as_nanoarrow_vctr(basic_array_stream(arrays))) #> <nanoarrow_vctr int32[13]> #> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 # Under the hood this is something like a factor() where levels are # a list of arrays with cached offsets. This is like an Arrow ChunkedArray str(vctr) #> <nanoarrow_vctr int32[13]> #> List of 3 #> $ :<nanoarrow_array int32[5]> #> ..$ length : int 5 #> ..$ null_count: int 0 #> ..$ offset : int 0 #> ..$ buffers :List of 2 #> .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> .. ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5` #> ..$ dictionary: NULL #> ..$ children : list() #> $ :<nanoarrow_array int32[5]> #> ..$ length : int 5 #> ..$ null_count: int 0 #> ..$ offset : int 0 #> ..$ buffers :List of 2 #> .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> .. ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `6 7 8 9 10` #> ..$ dictionary: NULL #> ..$ children : list() #> $ :<nanoarrow_array int32[3]> #> ..$ length : int 3 #> ..$ null_count: int 0 #> ..$ offset : int 0 #> ..$ buffers :List of 2 #> .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> `` #> .. ..$ :<nanoarrow_buffer data<int32>[3][12 b]> `11 12 13` #> ..$ dictionary: NULL #> ..$ children : list() # vctrs can be sliced: head(vctr) #> <nanoarrow_vctr int32[6]> #> [1] 1 2 3 4 5 6 # ...and can live in a data.frame head(tibble::tibble(x = vctr)) #> # A tibble: 6 × 1 #> x #> <nnrrw_vc> #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5 #> 6 6 # They can be used as zero-copy conversion targets array <- as_nanoarrow_array(1:5) convert_array(array, nanoarrow_vctr()) #> <nanoarrow_vctr int32[5]> #> [1] 1 2 3 4 5 # ...also works in a nested ptype array <- as_nanoarrow_array(data.frame(x = 1:5)) convert_array(array, tibble::tibble(x = nanoarrow_vctr())) #> # A tibble: 5 × 1 #> x #> <nnrrw_vc> #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5 # For nested list output, it will give a slice of the original array for # each list item array <- as_nanoarrow_array( list(1:5, 6:10, NULL, 11:13), schema = na_list(na_int32()) ) (lst_of <- convert_array(array, vctrs::list_of(nanoarrow_vctr()))) #> <list_of<nanoarrow_vctr>[4]> #> [[1]] #> <nanoarrow_vctr int32[5]> #> [1] 1 2 3 4 5 #> #> [[2]] #> <nanoarrow_vctr int32[5]> #> [1] 6 7 8 9 10 #> #> [[3]] #> NULL #> #> [[4]] #> <nanoarrow_vctr int32[3]> #> [1] 11 12 13 for (i in seq_along(lst_of)) { array <- attr(lst_of[[i]], "chunks")[[1]] cat(sprintf("offset: %d, length: %d\n", array$offset, array$length)) } #> offset: 0, length: 5 #> offset: 5, length: 5 #> offset: 10, length: 3 ``` <sup>Created on 2024-05-10 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup>

paleolimbot mentioned this pull request Mar 4, 2024

Disturbed layout in subgraph igraph/rigraph#775

Open

paleolimbot force-pushed the r-convert-refactor branch from b44feef to 6598171 Compare March 14, 2024 20:38

paleolimbot force-pushed the r-convert-refactor branch from 2afc7b8 to a749d8d Compare April 15, 2024 18:54

paleolimbot mentioned this pull request May 10, 2024

feat(r): Add experimental nanoarrow_vctr to wrap a list of arrays #461

Merged

paleolimbot mentioned this pull request Jun 10, 2024

[R] Convert arrow dictionary to R factor via as.data.frame.nanoarrow_array_stream()? #513

Open

paleolimbot added 23 commits July 4, 2024 04:40

rename nanoarow_cpp to reflect purpose

7a90fe1

start the vctr builder

6605e7b

sketch

d145da8

clean up dispatch

4a1983f

a little bit more dispatch

e1b6e42

maybe fix compile

b1e5d90

some dispatch

2bec5d3

start migrating infer_ptype

bbf46d7

with data.frame

3de01c7

with passing tests for infer

d28efa7

remove ptype bits

42b1b85

start to split out builder classes

15f2e17

split out classes into smaller files

04f56cd

first conversion

5e7df94

add an array view

c1e20ac

maybe have conversion errors

78d3d81

maybe some actual conversions

7e35f5d

fix can't convert

d269d8b

working for integers

a931c5b

shuffle

9e38a65

add double impl

2688e28

get logical conversion ported

24d301b

port int64 to new class

1782799

paleolimbot added 13 commits July 4, 2024 04:43

fix lossy convert

e907d6b

format

14b9ef5

wire up chr

86a681d

wire up blob

c9b8720

wire up date converter

cb88afe

wire up more converters

f439865

wire up posixct

90c8a0d

start on the call-into-r

dcd08a9

always route extension types and dictionaires through the Other route

a5a9b6f

sketch "other" support

25b7b88

pass optional ownership info

873fd8e

prototype method

16225e2

fix init

15b38b6

paleolimbot force-pushed the r-convert-refactor branch from a749d8d to 15b38b6 Compare July 4, 2024 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(r): Refactor ArrowArray(Stream) -> R Vector conversion #392

refactor(r): Refactor ArrowArray(Stream) -> R Vector conversion #392

paleolimbot commented Feb 26, 2024

codecov-commenter commented Feb 27, 2024 •

edited

Loading

refactor(r): Refactor ArrowArray(Stream) -> R Vector conversion #392

Are you sure you want to change the base?

refactor(r): Refactor ArrowArray(Stream) -> R Vector conversion #392

Conversation

paleolimbot commented Feb 26, 2024

codecov-commenter commented Feb 27, 2024 • edited Loading

Codecov Report

codecov-commenter commented Feb 27, 2024 •

edited

Loading