-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" #569
Conversation
The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution). I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?) |
Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).
Agreed. I'd like to include something like this in the second post too, comparing time and size using Arrow IPC vs. Parquet, ORC, Avro, CSV, JSON. But there are so many different variables at play (network speed, CPU and memory specs, encoding and compression options, how optimized the implementation is, whether or not to aggressively downcast based on the range of values in the data, what column types to use in the example, ... ) that I expect it will be impossible to claim that any results we present are representative. So the main message might end up being "YMMV" and we will probably want to provide a repo with some tools that readers can use to experiment for themselves. |
great! Thanks Ian! looking forward to the posts. I'll give this post a deeper look soon and I'd be happy to help with something like a cookbook repo for examples you might build up over the course of the posts, if I can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I'll translate this into Japanese when this is published.
Co-authored-by: Sutou Kouhei <[email protected]>
Another thing that feeds into this beyond the storage benefits called out here:
Is that for archival storage in addition to the cost aspect, you are generally doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great in general, well-explained with impressive examples.
I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.
(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)
Thanks @pitrou!
I changed the language in a couple of places and expanded footnote 3 to help prevent readers from getting this idea. |
FWIW I read the rendered version https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/ It is nicely done. Great work 👏 |
This adds the first in a series of posts that aim to demystify the use of Arrow as a data interchange format for databases and query engines. A Google Docs version is available here. The vector source file for the figures is here.
cc @lidavidm @zeroshade