Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" #569

Merged
merged 23 commits into from
Jan 10, 2025

Conversation

ianmcook
Copy link
Member

@ianmcook ianmcook commented Jan 7, 2025

This adds the first in a series of posts that aim to demystify the use of Arrow as a data interchange format for databases and query engines. A Google Docs version is available here. The vector source file for the figures is here.

cc @lidavidm @zeroshade

@drin
Copy link

drin commented Jan 7, 2025

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?)

@ianmcook
Copy link
Member Author

ianmcook commented Jan 7, 2025

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).

I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?)

Agreed. I'd like to include something like this in the second post too, comparing time and size using Arrow IPC vs. Parquet, ORC, Avro, CSV, JSON. But there are so many different variables at play (network speed, CPU and memory specs, encoding and compression options, how optimized the implementation is, whether or not to aggressively downcast based on the range of values in the data, what column types to use in the example, ... ) that I expect it will be impossible to claim that any results we present are representative. So the main message might end up being "YMMV" and we will probably want to provide a repo with some tools that readers can use to experiment for themselves.

@drin
Copy link

drin commented Jan 7, 2025

great! Thanks Ian! looking forward to the posts. I'll give this post a deeper look soon and I'd be happy to help with something like a cookbook repo for examples you might build up over the course of the posts, if I can.

@ianmcook
Copy link
Member Author

ianmcook commented Jan 7, 2025

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I'll translate this into Japanese when this is published.

_posts/2025-01-07-arrow-result-transfer.md Outdated Show resolved Hide resolved
@telemenar
Copy link

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

Another thing that feeds into this beyond the storage benefits called out here:

Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).

Is that for archival storage in addition to the cost aspect, you are generally doing ser once and de many times. Which changes your tradeoffs. In the pure compression algo space, this might be the difference between choosing lz4 (wire) and zstd (archival).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great in general, well-explained with impressive examples.

I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.

(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)

_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
_posts/2025-01-10-arrow-result-transfer.md Outdated Show resolved Hide resolved
@ianmcook
Copy link
Member Author

ianmcook commented Jan 9, 2025

This is looking great in general, well-explained with impressive examples.

Thanks @pitrou!

I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.

(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)

I changed the language in a couple of places and expanded footnote 3 to help prevent readers from getting this idea.

@ianmcook ianmcook merged commit 2788cf7 into apache:main Jan 10, 2025
1 check passed
@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

FWIW I read the rendered version https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/

It is nicely done. Great work 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants