Skip to content

Comments

feat(parquet): add content defined chunking for arrow writer#9450

Draft
kszucs wants to merge 9 commits intoapache:mainfrom
kszucs:content-defined-chunking
Draft

feat(parquet): add content defined chunking for arrow writer#9450
kszucs wants to merge 9 commits intoapache:mainfrom
kszucs:content-defined-chunking

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Feb 20, 2026

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):

image

What changes are included in this PR?

  • Content-defined chunker at parquet/src/column/chunker/
  • Arrow writer integration integrated in ArrowColumnWriter
  • Writer properties via CdcOptions struct (min_chunk_size, max_chunk_size, norm_level)
  • ColumnDescriptor: added repeated_ancestor_def_level field to for nested field values iteration

Are these changes tested?

Yes — unit tests are located in cdc.rs and ported from the C++ implementation.

Are there any user-facing changes?

New experimental API, disabled by default — no behavior change for existing code:

// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant