Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2025

Part of the work to write a new metadata parser in Rust is to tell people about it:

Preview URL:

TODO:

  • Rerun performance numbers with updated arrow-57 branch

Copy link

github-actions bot commented Oct 2, 2025

Preview URL: https://alamb.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2025

FYI @etseidl I started writing a blog post about this work. Right now it is a brain dump and isn't ready for review but I wanted to get it out of my mind

Copy link

@cannonpalms cannonpalms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a rough draft, but I noticed a typo in one of the embedded images I thought might end up getting missed in review due to the size of the font. ❤️

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start! Thanks for doing this! ❤️

@scovich
Copy link

scovich commented Oct 10, 2025

I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork?

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2025

I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork?

I tried to follow the directions, but it did not seem to work. I'll double check

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2025

I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork?

I tried to follow the directions, but it did not seem to work. I'll double check

For some reason the publish to fork workflow was skipped: https://github.com/apache/arrow-site/actions/runs/18410526447/job/52461618507?pr=711

The workflow file that github reports https://github.com/apache/arrow-site/actions/runs/18410526447/workflow?pr=711 seems to be ok, but I am not a github actions expert

    name: Deploy on fork
    if: >-
      github.event_name == 'push' &&
      github.repository != 'apache/arrow-site'
    needs: build

@kou
Copy link
Member

kou commented Oct 11, 2025

The preview on fork is deployed on your fork's GitHub Actions not apache/arrow-site's GitHub Actions: https://github.com/alamb/arrow-site/actions/runs/18410524340

You need to the followings on https://github.com/alamb/arrow-site/settings/pages and https://github.com/alamb/arrow-site/settings/environments :

https://github.com/apache/arrow-site/blob/main/README.md#forks

  1. Enable GitHub Pages on your fork:
    1. Open https://github.com/${YOUR_GITHUB_ACCOUNT}/arrow-site/settings/pages
    2. Select "GitHub Actions" as "Source"
  2. Accept publishing GitHub Pages from all branches on your fork:
    1. Open https://github.com/${YOUR_GITHUB_ACCOUNT}/arrow-site/settings/environments
    2. Select the "github-pages" environment
      1. Change the default "Deployment branches and tags" rule:
      2. Press the "Edit" button
      3. Change the "Name pattern" to * from main or gh-pages

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

The preview on fork is deployed on your fork's GitHub Actions not apache/arrow-site's GitHub Actions: https://github.com/alamb/arrow-site/actions/runs/18410524340

Thank you @kou

I had previously tried to follow those instructions and could not get it to work for some reason

Here is the content of
https://github.com/alamb/arrow-site/settings/pages
Screenshot 2025-10-14 at 12 31 14 PM

https://github.com/alamb/arrow-site/settings/environments -->
https://github.com/alamb/arrow-site/settings/environments/9063572923/edit
Screenshot 2025-10-14 at 12 32 50 PM

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

For some reason the branch protection rule is preventing it: https://github.com/alamb/arrow-site/actions/runs/18410524340

Screenshot 2025-10-14 at 12 36 08 PM

I will look more carefully

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

Update: I changed my branch protection rule to "no restrictions"

Screenshot 2025-10-14 at 1 05 37 PM

Probably not the best approach, but now the preview link does work: https://alamb.github.io/arrow-site/blog/2025/10/08/rust-parquet-metadata/

@alamb alamb marked this pull request as ready for review October 14, 2025 19:33
@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

This post is now ready for review

You can see a rendered preview here (thanks @kou!): https://alamb.github.io/arrow-site/blog/2025/10/08/rust-parquet-metadata/

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Thanks!

custom parser, we significantly sped up metadata parsing in the
[parquet] Rust crate, which is widely used in the [Apache Arrow] ecosystem.

This is the first open source effort (to my knowledge) to write a custom Thrift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf beat us to the punch by several years 😅. They have a cool functor based parser. https://github.com/rapidsai/cudf/blob/branch-25.12/cpp/src/io/parquet/compact_protocol_reader.hpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call -- fixed in e86b3e7

@kou
Copy link
Member

kou commented Oct 14, 2025

@alamb Oh, sorry that those instructions didn't work... It seems that * doesn't match XXX/YYY branch names.

I'll update these instructions to the "No restriction" approach.

<img src="{{ site.baseurl }}/img/rust-parquet-metadata/original-pipeline.png" width="100%" class="img-responsive" alt="Original Parquet Parsing Pipeline" aria-hidden="true">
</div>

*Figure 6:* Two-step process to read Parquet metadata: A parser created with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this was mentioned before, but Figure 6 has the old feather logo. Don't know if that should be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- fixed in 3066161

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2025

@alamb Oh, sorry that those instructions didn't work... It seems that * doesn't match XXX/YYY branch names.

I'll update these instructions to the "No restriction" approach.

Thank you for making it easier to see the preview of the sites @kou 🙏

approach. Please see the [final PR] for details of the level of effort involved.

[final PR]: https://github.com/apache/arrow-rs/pull/8530
[Jörn Horstmann]: https://github.com/jhorstmann
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @jhorstmann you have a shout out in this blog -- please let me know if you would like any changes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! I actually plan to do a presentation on the macros at an internal Rust meetup and will then also update the readme of the compact-thrift repository with more details. The details how the macros work are probably out of scope for this blog post, but could be added to the arrow-rs code base later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2025

@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓

results

@etseidl
Copy link
Contributor

etseidl commented Oct 15, 2025

@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓

I was just re-running your benchmark branch on my workstation and was composing a message the same effect 😁

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2025

@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓

I was just re-running your benchmark branch on my workstation and was composing a message the same effect 😁

I can't tell you how much I am currently grinning. You have basically achieved what @XiangpengHao predicated 2 years ago (2x-4x speedup).

@etseidl
Copy link
Contributor

etseidl commented Oct 15, 2025

Now we just need to implement the metadata index and parse the footer in parallel 🤣

2. It typically maps one-to-one with Thrift definitions, limiting
additional optimizations such as zero-copy parsing, field
skipping, and amortized memory allocation strategies.
3. Its API is very stable (hard to change), which is important for easy maintenance when a large number

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth mentioning that arrow-rs already did some postprocessing on the generated code, and also included a custom implementation of the compact protocol api. That makes the step to a completely custom parser slightly smaller and less crazy :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 7621a3e

@alamb
Copy link
Contributor Author

alamb commented Oct 20, 2025

My plan here is to wait for the arrow 57 release to be published (eta in about 2 days), and then rerun the benchmarks again with the final released version and then publish this blog

If anyone else would like more time to review, please just leave comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about new rust Metadata Parser

6 participants