[Website]: Blog post about arrow-avro #712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jecsand838 wants to merge 8 commits into apache:main from jecsand838:arrow-avro-blog-post

jecsand838 commented Oct 6, 2025

Part of the work to add first class Avro support to arrow-rs is to tell people about it:

Closes: apache/arrow-rs#8428
Part of apache/arrow-rs#4886

@alamb Here's my first pass at the blog post. Sorry about it taking it a bit longer than anticipated. Let me know what you think and I'm 100% down to collaborate on this. 😃


          Added arrow-avro blog post.

764dd82

github-actions bot commented Oct 6, 2025

Preview URL: https://jecsand838.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

Contributor

alamb commented Oct 6, 2025

Amazing! Thank you @jecsand838 -- I will try and review this over the next day or two


          Updated FingerprintAlgorithm::None to FingerprintAlgorithm::Id

f1f1438

jecsand838 marked this pull request as ready for review

October 14, 2025 19:46

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

This looks great @jecsand838 -- thank you 🙏

It will be a great announcement post. I left some small comments but nothing I think is required.

Please feel free to mark the PR ready for review when you think it is ready and I can give it another look.

_posts/2025-10-17-introducing-arrow-avro.md Outdated Show resolved Hide resolved

_posts/2025-10-17-introducing-arrow-avro.md Outdated Show resolved Hide resolved

_posts/2025-10-17-introducing-arrow-avro.md Outdated

+              {% endcomment %}
+              -->
+              `arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow `RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object Encoding, and the Confluent Schema Registry wire format, with projection/evolution, tunable batch sizing, and an optional `StringViewArray` for faster strings. Its vectorized design reduces copies and cache misses, making both batch (files) and streaming (Kafka) pipelines simpler and faster.

Contributor

alamb Oct 14, 2025

I think it might help to use the full name / link when first introducing Avro as not everyone might be familiar with it

Something like "Apache Avro" and link to https://avro.apache.org/

_posts/2025-10-17-introducing-arrow-avro.md Outdated


		## Motivation

		As a row‑oriented format, Avro is optimized for encoding one record at a time, while Apache Arrow is columnar, optimized for vectorized analytics. When Avro data is decoded record‑by‑record and then materialized into Arrow arrays, systems pay for extra allocations, branches, and cache‑unfriendly memory access (exactly the overhead Arrow's design tries to avoid). One example of a challenge resulting from this can be found in [DataFusion's Avro Datasource](https://github.com/apache/datafusion/tree/main/datafusion/datasource-avro). This row to column impedance mismatch caused by decoding Avro into Arrow shows up as unnecessary work in hot paths.

Contributor

alamb Oct 14, 2025

I wonder if we can also motivate the work with some explanation of the popularity of Avro (e.g. all the data written in Kafka, for example)

Author

jecsand838 Oct 16, 2025

@alamb I agree, that's a good idea.

What do you think of this section?

### Why this matters

Apache Avro is a first‑class format across stream processors and cloud services:
- Confluent Schema Registry supports Avro across multiple languages and tooling.
- Apache Flink exposes an `avro-confluent` format for Kafka.
- AWS Lambda (June 2025) added native handling for Avro‑formatted Kafka events with Glue and Confluent Schema Registry integrations.
- Azure Event Hubs provides a Schema Registry with Avro support for Kafka‑compatible clients.

In short: Arrow users encounter Avro both on disk (OCF) and on the wire (Kafka). An Arrow‑first, vectorized reader/writer for OCF, Single‑Object, and Confluent framing removes a pervasive bottleneck and keeps pipelines columnar end‑to‑end.

_posts/2025-10-17-introducing-arrow-avro.md Outdated


		This work is part of the ongoing Arrow‑rs effort to implement first class Avro support in Rust. We'd love your feedback on real‑world use-cases, workloads, and integrations. We also welcome contributions, whether that's issues, benchmarks, or PRs. To follow along or help, open an [issue on GitHub](https://github.com/apache/arrow-rs/issues) and/or track [Add Avro Support](https://github.com/apache/arrow-rs/issues/4886) in `apache/arrow-rs`.

		If you have any questions about this blog post, please feel free to contact the author, [Connor Sanders](mailto:[email protected]). No newline at end of file

Contributor

alamb Oct 14, 2025

If it is appropriate, it might also be worth adding an acknowledgment here for any support you may have had -- e.g. acknowledge Elastiflow, for example. The effort you and Nathaniel have put into this undertaking is pretty amazing.

Author

jecsand838 Oct 15, 2025

That's a great callout, I'll make sure to add that in.

Author

jecsand838 Oct 16, 2025

I just pushed up an Acknowledgments section. I 100% agree! Thank you for pointing this out.

_posts/2025-10-17-introducing-arrow-avro.md Outdated


		Configuration is intentionally minimal but practical. For instance, the `ReaderBuilder` exposes knobs covering both batch file ingestion and streaming systems without forcing format‑specific code paths.

		## Architecture & Technical Overview

Contributor

alamb Oct 14, 2025

I rendered this locally and it looks great:

Screenshot 2025-10-14 at 3 36 48 PM

_posts/2025-10-17-introducing-arrow-avro.md Outdated


		At a high level, [`arrow-avro`](https://arrow.apache.org/rust/arrow_avro/index.html) splits cleanly into read and write paths built around Arrow `RecordBatch`es. The read side turns Avro (OCF files or framed byte streams) into Arrow arrays in batches, while the write side takes Arrow batches and produces OCF files or streaming frames. When you build an `AvroStreamWriter`, the framing (SOE or Confluent) is part of the stream output based on the configured fingerprint strategy, no separate framing step required. The public API and module layout are intentionally small so most applications only touch a builder, a reader/decoder, and (optionally) a schema store for schema evolution while streaming.

		On the [read](https://arrow.apache.org/rust/arrow_avro/reader/index.html) path, everything starts with [`ReaderBuilder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.ReaderBuilder.html). From a single builder you can create a [`Reader`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Reader.html) for Object Container Files (OCF) or a streaming [`Decoder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Decoder.html) for Single‑Object/Confluent frames. The `Reader` pulls OCF blocks and yields Arrow `RecordBatch`es while the `Decoder` is push‑based, i.e. you feed bytes as they arrive and then call `flush` to drain completed batches. This design lets the same decode plan serve file and streaming use cases with minimal branching.

Contributor

alamb Oct 14, 2025

It might help to describe what a "decode plan" is -- does this mean "decode state machine" or something?

Author

jecsand838 Oct 16, 2025

I cleaned this up. The wording I choose for explaining how the same underlying decoder logic is shared between file and streaming use-cases was poor.

_posts/2025-10-17-introducing-arrow-avro.md Outdated Show resolved Hide resolved

_posts/2025-10-17-introducing-arrow-avro.md Outdated


		This section compares those styles qualitatively and with medians from the Criterion benchmark runs that produced the violin plots below.

		### Read performance (1M)

Contributor

alamb Oct 14, 2025

I found the y axis labels hard to read in these images - I wonder if we can give them more space / make them bigger somehow

Screenshot 2025-10-14 at 3 42 48 PM

Author

jecsand838 Oct 14, 2025

I know! I spun my wheels on this for a bit. I'll do some research and fix this tonight / tomorrow, even if it comes down to me remaking the Violin plots myself. I 100% agree with you.

Author

jecsand838 Oct 16, 2025

I used an SVG editor to increase the font size and bolded the y axis labels. I think this helps quite a bit.

Let me know if you think it needs more though.

_posts/2025-10-17-introducing-arrow-avro.md Outdated


		### Benchmark Median Time Results (Apple Silicon Mac)

		\| Case \| apache-avro median \| arrow-avro median \| speedup \|

Contributor

alamb Oct 14, 2025

Very impressive

Screenshot 2025-10-14 at 3 45 58 PM

jecsand838 and others added 4 commits

October 14, 2025 14:48


          Update _posts/2025-10-17-introducing-arrow-avro.md

Co-authored-by: Andrew Lamb <[email protected]>


          Update _posts/2025-10-17-introducing-arrow-avro.md

4a04bfd

Co-authored-by: Andrew Lamb <[email protected]>


          Update _posts/2025-10-17-introducing-arrow-avro.md

d01d0cc

Co-authored-by: Andrew Lamb <[email protected]>


          Update _posts/2025-10-17-introducing-arrow-avro.md

087dab1

Co-authored-by: Andrew Lamb <[email protected]>

jecsand838 force-pushed the arrow-avro-blog-post branch 2 times, most recently from 9cd76cb to b27a097 Compare

October 16, 2025 06:22


          Address PR Comments

69677bf

jecsand838 force-pushed the arrow-avro-blog-post branch 5 times, most recently from 7d02b16 to 5ce75e5 Compare

October 20, 2025 07:23


          Changed blog publish date to 10-24-2025 to better align with v57 rele…

256f886

…ase and further refined the contents of the post.

jecsand838 force-pushed the arrow-avro-blog-post branch from 5ce75e5 to 256f886 Compare

October 21, 2025 05:54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet