-
Notifications
You must be signed in to change notification settings - Fork 118
[Website]: Blog post about new Rust Parquet Metadata parser #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Preview URL: https://alamb.github.io/arrow-site If the preview URL doesn't work, you may forget to configure your fork repository for preview. |
FYI @etseidl I started writing a blog post about this work. Right now it is a brain dump and isn't ready for review but I wanted to get it out of my mind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just a rough draft, but I noticed a typo in one of the embedded images I thought might end up getting missed in review due to the size of the font. ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start! Thanks for doing this! ❤️
Co-authored-by: Ed Seidl <[email protected]>
…ite into alamb/new_parquet_metadata
I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork? |
I tried to follow the directions, but it did not seem to work. I'll double check |
For some reason the publish to fork workflow was skipped: https://github.com/apache/arrow-site/actions/runs/18410526447/job/52461618507?pr=711 The workflow file that github reports https://github.com/apache/arrow-site/actions/runs/18410526447/workflow?pr=711 seems to be ok, but I am not a github actions expert name: Deploy on fork
if: >-
github.event_name == 'push' &&
github.repository != 'apache/arrow-site'
needs: build |
The preview on fork is deployed on your fork's GitHub Actions not apache/arrow-site's GitHub Actions: https://github.com/alamb/arrow-site/actions/runs/18410524340 You need to the followings on https://github.com/alamb/arrow-site/settings/pages and https://github.com/alamb/arrow-site/settings/environments : https://github.com/apache/arrow-site/blob/main/README.md#forks
|
Thank you @kou I had previously tried to follow those instructions and could not get it to work for some reason Here is the content of https://github.com/alamb/arrow-site/settings/environments --> |
For some reason the branch protection rule is preventing it: https://github.com/alamb/arrow-site/actions/runs/18410524340 ![]() I will look more carefully |
Update: I changed my branch protection rule to "no restrictions" ![]() Probably not the best approach, but now the preview link does work: https://alamb.github.io/arrow-site/blog/2025/10/08/rust-parquet-metadata/ |
This post is now ready for review You can see a rendered preview here (thanks @kou!): https://alamb.github.io/arrow-site/blog/2025/10/08/rust-parquet-metadata/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great. Thanks!
custom parser, we significantly sped up metadata parsing in the | ||
[parquet] Rust crate, which is widely used in the [Apache Arrow] ecosystem. | ||
|
||
This is the first open source effort (to my knowledge) to write a custom Thrift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf beat us to the punch by several years 😅. They have a cool functor based parser. https://github.com/rapidsai/cudf/blob/branch-25.12/cpp/src/io/parquet/compact_protocol_reader.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call -- fixed in e86b3e7
@alamb Oh, sorry that those instructions didn't work... It seems that I'll update these instructions to the "No restriction" approach. |
<img src="{{ site.baseurl }}/img/rust-parquet-metadata/original-pipeline.png" width="100%" class="img-responsive" alt="Original Parquet Parsing Pipeline" aria-hidden="true"> | ||
</div> | ||
|
||
*Figure 6:* Two-step process to read Parquet metadata: A parser created with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this was mentioned before, but Figure 6 has the old feather logo. Don't know if that should be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- fixed in 3066161
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
…ite into alamb/new_parquet_metadata
approach. Please see the [final PR] for details of the level of effort involved. | ||
|
||
[final PR]: https://github.com/apache/arrow-rs/pull/8530 | ||
[Jörn Horstmann]: https://github.com/jhorstmann |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @jhorstmann you have a shout out in this blog -- please let me know if you would like any changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good! I actually plan to do a presentation on the macros at an internal Rust meetup and will then also update the readme of the compact-thrift
repository with more details. The details how the macros work are probably out of scope for this blog post, but could be added to the arrow-rs
code base later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- note that @etseidl also write up a readme here that is quite good:
https://github.com/apache/arrow-rs/blob/49d92fa163f61d677a971143c598ad9f020f8fec/parquet/THRIFT.md
Co-authored-by: Ed Seidl <[email protected]>
…ite into alamb/new_parquet_metadata
@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓 ![]() |
I was just re-running your benchmark branch on my workstation and was composing a message the same effect 😁 |
I can't tell you how much I am currently grinning. You have basically achieved what @XiangpengHao predicated 2 years ago (2x-4x speedup). |
Now we just need to implement the metadata index and parse the footer in parallel 🤣 |
2. It typically maps one-to-one with Thrift definitions, limiting | ||
additional optimizations such as zero-copy parsing, field | ||
skipping, and amortized memory allocation strategies. | ||
3. Its API is very stable (hard to change), which is important for easy maintenance when a large number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth mentioning that arrow-rs already did some postprocessing on the generated code, and also included a custom implementation of the compact protocol api. That makes the step to a completely custom parser slightly smaller and less crazy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 7621a3e
Co-authored-by: Ed Seidl <[email protected]>
…ite into alamb/new_parquet_metadata
My plan here is to wait for the arrow 57 release to be published (eta in about 2 days), and then rerun the benchmarks again with the final released version and then publish this blog If anyone else would like more time to review, please just leave comments |
Part of the work to write a new metadata parser in Rust is to tell people about it:
Preview URL:
TODO: