Skip to content

Conversation

jecsand838
Copy link

Part of the work to add first class Avro support to arrow-rs is to tell people about it:

Closes: apache/arrow-rs#8428
Part of apache/arrow-rs#4886

@alamb Here's my first pass at the blog post. Sorry about it taking it a bit longer than anticipated. Let me know what you think and I'm 100% down to collaborate on this. 😃

Copy link

github-actions bot commented Oct 6, 2025

Preview URL: https://jecsand838.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

@alamb
Copy link
Contributor

alamb commented Oct 6, 2025

Amazing! Thank you @jecsand838 -- I will try and review this over the next day or two

@jecsand838 jecsand838 marked this pull request as ready for review October 14, 2025 19:46
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @jecsand838 -- thank you 🙏

It will be a great announcement post. I left some small comments but nothing I think is required.

Please feel free to mark the PR ready for review when you think it is ready and I can give it another look.

{% endcomment %}
-->

`arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow `RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object Encoding, and the Confluent Schema Registry wire format, with projection/evolution, tunable batch sizing, and an optional `StringViewArray` for faster strings. Its vectorized design reduces copies and cache misses, making both batch (files) and streaming (Kafka) pipelines simpler and faster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might help to use the full name / link when first introducing Avro as not everyone might be familiar with it

Something like "Apache Avro" and link to https://avro.apache.org/


## Motivation

As a row‑oriented format, Avro is optimized for encoding one record at a time, while Apache Arrow is columnar, optimized for vectorized analytics. When Avro data is decoded record‑by‑record and then materialized into Arrow arrays, systems pay for extra allocations, branches, and cache‑unfriendly memory access (exactly the overhead Arrow's design tries to avoid). One example of a challenge resulting from this can be found in [DataFusion's Avro Datasource](https://github.com/apache/datafusion/tree/main/datafusion/datasource-avro). This row to column impedance mismatch caused by decoding Avro into Arrow shows up as unnecessary work in hot paths.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can also motivate the work with some explanation of the popularity of Avro (e.g. all the data written in Kafka, for example)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I agree, that's a good idea.

What do you think of this section?

### Why this matters

Apache Avro is a first‑class format across stream processors and cloud services:
- Confluent Schema Registry supports Avro across multiple languages and tooling.
- Apache Flink exposes an `avro-confluent` format for Kafka.
- AWS Lambda (June 2025) added native handling for Avro‑formatted Kafka events with Glue and Confluent Schema Registry integrations.
- Azure Event Hubs provides a Schema Registry with Avro support for Kafka‑compatible clients.

In short: Arrow users encounter Avro both on disk (OCF) and on the wire (Kafka). An Arrow‑first, vectorized reader/writer for OCF, Single‑Object, and Confluent framing removes a pervasive bottleneck and keeps pipelines columnar end‑to‑end.


This work is part of the ongoing Arrow‑rs effort to implement first class Avro support in Rust. We'd love your feedback on real‑world use-cases, workloads, and integrations. We also welcome contributions, whether that's issues, benchmarks, or PRs. To follow along or help, open an [issue on GitHub](https://github.com/apache/arrow-rs/issues) and/or track [Add Avro Support](https://github.com/apache/arrow-rs/issues/4886) in `apache/arrow-rs`.

If you have any questions about this blog post, please feel free to contact the author, [Connor Sanders](mailto:[email protected]). No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is appropriate, it might also be worth adding an acknowledgment here for any support you may have had -- e.g. acknowledge Elastiflow, for example. The effort you and Nathaniel have put into this undertaking is pretty amazing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great callout, I'll make sure to add that in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed up an Acknowledgments section. I 100% agree! Thank you for pointing this out.


Configuration is intentionally minimal but practical. For instance, the `ReaderBuilder` exposes knobs covering both batch file ingestion and streaming systems without forcing format‑specific code paths.

## Architecture & Technical Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rendered this locally and it looks great:

Screenshot 2025-10-14 at 3 36 48 PM


At a high level, [`arrow-avro`](https://arrow.apache.org/rust/arrow_avro/index.html) splits cleanly into read and write paths built around Arrow `RecordBatch`es. The read side turns Avro (OCF files or framed byte streams) into Arrow arrays in batches, while the write side takes Arrow batches and produces OCF files or streaming frames. When you build an `AvroStreamWriter`, the framing (SOE or Confluent) is part of the stream output based on the configured fingerprint strategy, no separate framing step required. The public API and module layout are intentionally small so most applications only touch a builder, a reader/decoder, and (optionally) a schema store for schema evolution while streaming.

On the [read](https://arrow.apache.org/rust/arrow_avro/reader/index.html) path, everything starts with [`ReaderBuilder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.ReaderBuilder.html). From a single builder you can create a [`Reader`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Reader.html) for Object Container Files (OCF) or a streaming [`Decoder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Decoder.html) for Single‑Object/Confluent frames. The `Reader` pulls OCF blocks and yields Arrow `RecordBatch`es while the `Decoder` is push‑based, i.e. you feed bytes as they arrive and then call `flush` to drain completed batches. This design lets the same decode plan serve file and streaming use cases with minimal branching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to describe what a "decode plan" is -- does this mean "decode state machine" or something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned this up. The wording I choose for explaining how the same underlying decoder logic is shared between file and streaming use-cases was poor.


This section compares those styles qualitatively and with medians from the Criterion benchmark runs that produced the violin plots below.

### Read performance (1M)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the y axis labels hard to read in these images - I wonder if we can give them more space / make them bigger somehow

Screenshot 2025-10-14 at 3 42 48 PM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know! I spun my wheels on this for a bit. I'll do some research and fix this tonight / tomorrow, even if it comes down to me remaking the Violin plots myself. I 100% agree with you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used an SVG editor to increase the font size and bolded the y axis labels. I think this helps quite a bit.

Let me know if you think it needs more though.


### Benchmark Median Time Results (Apple Silicon Mac)

| Case | apache-avro median | arrow-avro median | speedup |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive

Screenshot 2025-10-14 at 3 45 58 PM

@jecsand838 jecsand838 force-pushed the arrow-avro-blog-post branch 2 times, most recently from 9cd76cb to b27a097 Compare October 16, 2025 06:22
@jecsand838 jecsand838 force-pushed the arrow-avro-blog-post branch 5 times, most recently from 7d02b16 to 5ce75e5 Compare October 20, 2025 07:23
…ase and further refined the contents of the post.
@jecsand838 jecsand838 force-pushed the arrow-avro-blog-post branch from 5ce75e5 to 256f886 Compare October 21, 2025 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about adding arrow-avro

2 participants