You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR adds bindings to nanoarrow's IPC reader from R. The entrypoint
for a user is `read_nanoarrow()`, which accepts raw vectors,
connections, and file paths (thin wrapper around connections). It also
fixes a number of compiler warnings in the IPC extension.
The implementation is not particularly complicated from the R side, but
the main drawback of adding IPC support is that the flatbuffers
implementation (flatcc) actively does not care about gcc compiler
warnings (whereas CRAN actively cares about them).
These are all slower than the arrow package, which has more tools at its
disposal to prevent copies.
``` r
library(arrow, warn.conflicts = FALSE)
library(nanoarrow)
# Basic read example
tf <- tempfile()
write_ipc_stream(dplyr::starwars, tf)
read_nanoarrow(tf) |> tibble::as_tibble()
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Or… 150 49 brown light brown 19 fema… femin…
#> 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
#> 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
#> 8 R5-D4 97 32 <NA> white, red red NA none mascu…
#> 9 Biggs D… 183 84 black light brown 24 male mascu…
#> 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list<chr>>,
#> # vehicles <list<chr>>, starships <list<chr>>
df_bigish <- nanoarrow:::vec_gen(data.frame(x = character()), n = 1e6)
write_ipc_stream(df_bigish, tf)
# Wrapper because mmap is apparently not passed through from read_ipc_stream()
# and this is pretty significant
read_ipc_stream_wrap <- function(f, ..., mmap) {
arrow::read_ipc_stream(
arrow:::make_readable_file(f, mmap = mmap, random_access = FALSE),
...
)
}
tf_raw <- brio::read_file_raw(tf)
# Slower than arrow for raw vector input because of C implementation,
# which doesn't currently share the global buffer (just shares buffers
# between columns within a single batch)
bench::mark(
nanoarrow = read_nanoarrow(tf_raw) |> collect_array_stream(),
arrow = read_ipc_stream(buffer(tf_raw), as_data_frame = FALSE),
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 nanoarrow 1.27ms 1.84ms 439. 41.5KB 2.18
#> 2 arrow 509.26µs 528.65µs 1821. 3.6MB 79.3
# Slower than arrow, maybe because of C implementation, but definitely
# because it uses base::readBin() which necessiates an extra copy
bench::mark(
nanoarrow = read_nanoarrow(tf) |> collect_array_stream(),
arrow_mmap = read_ipc_stream_wrap(tf, mmap = TRUE, as_data_frame = FALSE),
arrow = read_ipc_stream_wrap(tf, mmap = FALSE, as_data_frame = FALSE),
check = FALSE
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 nanoarrow 5.18ms 5.66ms 174. 16.2MB 189.
#> 2 arrow_mmap 613.48µs 640.5µs 1526. 528.6KB 13.9
#> 3 arrow 2.18ms 2.84ms 339. 551.6KB 4.06
```
<sup>Created on 2024-02-19 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
0 commit comments