-
Notifications
You must be signed in to change notification settings - Fork 2.5k
perf: New partitioned IO sink pipeline enabled for sink_parquet #25629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: New partitioned IO sink pipeline enabled for sink_parquet #25629
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #25629 +/- ##
==========================================
- Coverage 79.67% 79.63% -0.04%
==========================================
Files 1743 1751 +8
Lines 240288 240969 +681
Branches 3038 3038
==========================================
+ Hits 191442 191890 +448
- Misses 48063 48296 +233
Partials 783 783 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5adb8af to
ed62325
Compare
ed62325 to
acc8390
Compare
acc8390 to
0b6f7c4
Compare
0b6f7c4 to
11b0e5c
Compare
| serde = { workspace = true, optional = true } | ||
| serde_json = { workspace = true, optional = true } | ||
| slotmap = { workspace = true } | ||
| smallvec = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this dependency worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do better actually
This PR introduces a new streaming partitioned IO sink pipeline. It is enabled by default when using
pl.PartitionByKeyandpl.PartitionMaxSizewithsink_parquet(with the exception of some unsupported parameters).Benchmarks
Test script
Benchmark comparison
Logs - Before
Logs - After
*finalize_flush_size: How much data was flushed from memory during finalize (lower is better)
*total_sink_opens: Total number of files opened.
*forced_sink_closes: Number of file closes performed to reclaim a file permit for opening a new file
Implementation overview
Files under
components/:partition_distributor.rs:partitioned_pipeline.rsand stores state for each partition:partition_morsel_sender)partition_morsel_sender.rs:partition_sink_starter.rsdyn FileProvider(already merged)dyn FileWriterStarter(refactor(rust): Add parquet file write pipeline for new IO sinks #25618)partitioner_pipeline.rs:partitioner.rs(can also pass-through in the case ofPartitionStrategy::FileSizepipeline_initialization/partition_by.rscontains logic on initializing and connecting the above componentsCompatibility
Note that some parameters are not yet supported and will fall back to the existing sink pipeline:
file_path_cbper_partition_sort_byfinish_callbackTodo
partition_distributor.rsL156 on codecov), this can be tested once we have the updatedpl.PartitionByAPI that lets us specifymax_rows_per_filefor a keyed partition strategy.