-
Notifications
You must be signed in to change notification settings - Fork 118
[Website]: Blog post about arrow-avro #712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Preview URL: https://jecsand838.github.io/arrow-site If the preview URL doesn't work, you may forget to configure your fork repository for preview. |
Amazing! Thank you @jecsand838 -- I will try and review this over the next day or two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @jecsand838 -- thank you 🙏
It will be a great announcement post. I left some small comments but nothing I think is required.
Please feel free to mark the PR ready for review when you think it is ready and I can give it another look.
{% endcomment %} | ||
--> | ||
|
||
`arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow `RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object Encoding, and the Confluent Schema Registry wire format, with projection/evolution, tunable batch sizing, and an optional `StringViewArray` for faster strings. Its vectorized design reduces copies and cache misses, making both batch (files) and streaming (Kafka) pipelines simpler and faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might help to use the full name / link when first introducing Avro as not everyone might be familiar with it
Something like "Apache Avro" and link to https://avro.apache.org/
|
||
## Motivation | ||
|
||
As a row‑oriented format, Avro is optimized for encoding one record at a time, while Apache Arrow is columnar, optimized for vectorized analytics. When Avro data is decoded record‑by‑record and then materialized into Arrow arrays, systems pay for extra allocations, branches, and cache‑unfriendly memory access (exactly the overhead Arrow's design tries to avoid). One example of a challenge resulting from this can be found in [DataFusion's Avro Datasource](https://github.com/apache/datafusion/tree/main/datafusion/datasource-avro). This row to column impedance mismatch caused by decoding Avro into Arrow shows up as unnecessary work in hot paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can also motivate the work with some explanation of the popularity of Avro (e.g. all the data written in Kafka, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I agree, that's a good idea.
What do you think of this section?
### Why this matters
Apache Avro is a first‑class format across stream processors and cloud services:
- Confluent Schema Registry supports Avro across multiple languages and tooling.
- Apache Flink exposes an `avro-confluent` format for Kafka.
- AWS Lambda (June 2025) added native handling for Avro‑formatted Kafka events with Glue and Confluent Schema Registry integrations.
- Azure Event Hubs provides a Schema Registry with Avro support for Kafka‑compatible clients.
In short: Arrow users encounter Avro both on disk (OCF) and on the wire (Kafka). An Arrow‑first, vectorized reader/writer for OCF, Single‑Object, and Confluent framing removes a pervasive bottleneck and keeps pipelines columnar end‑to‑end.
|
||
This work is part of the ongoing Arrow‑rs effort to implement first class Avro support in Rust. We'd love your feedback on real‑world use-cases, workloads, and integrations. We also welcome contributions, whether that's issues, benchmarks, or PRs. To follow along or help, open an [issue on GitHub](https://github.com/apache/arrow-rs/issues) and/or track [Add Avro Support](https://github.com/apache/arrow-rs/issues/4886) in `apache/arrow-rs`. | ||
|
||
If you have any questions about this blog post, please feel free to contact the author, [Connor Sanders](mailto:[email protected]). No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is appropriate, it might also be worth adding an acknowledgment here for any support you may have had -- e.g. acknowledge Elastiflow, for example. The effort you and Nathaniel have put into this undertaking is pretty amazing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great callout, I'll make sure to add that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed up an Acknowledgments section. I 100% agree! Thank you for pointing this out.
|
||
Configuration is intentionally minimal but practical. For instance, the `ReaderBuilder` exposes knobs covering both batch file ingestion and streaming systems without forcing format‑specific code paths. | ||
|
||
## Architecture & Technical Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
At a high level, [`arrow-avro`](https://arrow.apache.org/rust/arrow_avro/index.html) splits cleanly into read and write paths built around Arrow `RecordBatch`es. The read side turns Avro (OCF files or framed byte streams) into Arrow arrays in batches, while the write side takes Arrow batches and produces OCF files or streaming frames. When you build an `AvroStreamWriter`, the framing (SOE or Confluent) is part of the stream output based on the configured fingerprint strategy, no separate framing step required. The public API and module layout are intentionally small so most applications only touch a builder, a reader/decoder, and (optionally) a schema store for schema evolution while streaming. | ||
|
||
On the [read](https://arrow.apache.org/rust/arrow_avro/reader/index.html) path, everything starts with [`ReaderBuilder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.ReaderBuilder.html). From a single builder you can create a [`Reader`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Reader.html) for Object Container Files (OCF) or a streaming [`Decoder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Decoder.html) for Single‑Object/Confluent frames. The `Reader` pulls OCF blocks and yields Arrow `RecordBatch`es while the `Decoder` is push‑based, i.e. you feed bytes as they arrive and then call `flush` to drain completed batches. This design lets the same decode plan serve file and streaming use cases with minimal branching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to describe what a "decode plan" is -- does this mean "decode state machine" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned this up. The wording I choose for explaining how the same underlying decoder logic is shared between file and streaming use-cases was poor.
|
||
This section compares those styles qualitatively and with medians from the Criterion benchmark runs that produced the violin plots below. | ||
|
||
### Read performance (1M) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know! I spun my wheels on this for a bit. I'll do some research and fix this tonight / tomorrow, even if it comes down to me remaking the Violin plots myself. I 100% agree with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used an SVG editor to increase the font size and bolded the y axis labels. I think this helps quite a bit.
Let me know if you think it needs more though.
|
||
### Benchmark Median Time Results (Apple Silicon Mac) | ||
|
||
| Case | apache-avro median | arrow-avro median | speedup | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
9cd76cb
to
b27a097
Compare
7d02b16
to
5ce75e5
Compare
…ase and further refined the contents of the post.
5ce75e5
to
256f886
Compare
Part of the work to add first class Avro support to arrow-rs is to tell people about it:
Closes: apache/arrow-rs#8428
Part of apache/arrow-rs#4886
@alamb Here's my first pass at the blog post. Sorry about it taking it a bit longer than anticipated. Let me know what you think and I'm 100% down to collaborate on this. 😃