feat: introduce pluggable SpillFile trait and TempFileFactory for custom spill backends by pantShrey · Pull Request #21882 · apache/datafusion

pantShrey · 2026-04-27T18:59:54Z

Which issue does this PR close?

Closes Allow pluggable file backends in DiskManager and IPCStreamWriter to support non-OS file systems #21215, depends on refactor: Update SortMergeJoin to use async spill abstractions #22230

Rationale for this change

DataFusion’s spill infrastructure is tightly coupled to OS-level files, with no extension points for alternative storage backends. DiskManager cannot be customized for file creation, and IPCStreamWriter depends on OS file paths.
This prevents integration in environments where temporary storage must be managed by the host system. For example, Postgres extensions (e.g., ParadeDB) require spill files to go through BufFile APIs to respect temp_tablespaces, enforce temp_file_limit, and integrate with transaction-scoped cleanup. Since BufFile has no OS-visible path, it cannot work with the current design.
A secondary motivation raised by @alamb is supporting object storage backends (S3, GCS) for spilling, which require async IO and cannot use std::io::Write or std::io::Read.

What changes are included in this PR?

Introduced SpillFile, SpillWriter, and TempFileFactory traits to abstract spill file handling
Added DiskManagerMode::Custom to allow pluggable backends
Updated DiskManager to return Arc<dyn SpillFile> instead of OS-bound types
Refactored write path using SpillWriteAdapter to bridge sync Arrow writers with backend-agnostic writers
Refactored read path to use async streaming (Stream<Item = Result<Bytes>>) instead of blocking state machines
Updated spill-related components to operate on Arc<dyn SpillFile>
Migrated the Sort-Merge Join (SMJ) operator to use the async spill abstraction

Are these changes tested?

Yes. Existing spill tests cover the full read/write flow.

Fixed test_disk_usage_decreases_as_files_consumed by correcting a pre-existing off-by-one assumption in file rotation
Fixed test_preserve_order_with_spilling by just asserting spilling occurs (spill_count>0) and output batches are sorted

Are there any user-facing changes?

Yes this introduces API changes:

Spill-related APIs now use Arc<dyn SpillFile> instead of RefCountedTempFile
New public traits: SpillFile, SpillWriter, TempFileFactory
Added DiskManagerMode::Custom for custom backends

Custom spill backends can now be implemented and plugged in via DiskManager.

pantShrey · 2026-04-27T19:03:02Z

@alamb I opened this draft PR to get early feedback on the architecture.

The first point is around the sync read path. I introduced open_sync_reader because SortMergeJoin currently has synchronous, blocking code paths that directly open files using paths and BufReader, instead of going through the spill abstractions. Converting this to fully async would significantly increase the scope of this PR.
- Does it make sense to keep this escape hatch for now and handle making these operators async in a follow-up PR?
The second point is regarding test failures. I have not modified the original 64B limit in the tests because I wanted guidance here. Currently, the repartition test in mod.rs is failing, and it seems related to spilling not being triggered correctly, the new SpillWriteAdapter adds slight allocation overhead which makes the original 64-byte memory limit too tight for the merge heap to initialize (~296 bytes needed), bumping up the memory limit causes the test to not spill anymore, I believe increasing the test data size might solve the issue, but am not sure.

I might be missing something here, so would really appreciate your guidance.

alamb · 2026-05-07T15:24:52Z

Thanks -- will try and look at this shortly

alamb · 2026-05-09T11:38:34Z

@alamb I opened this draft PR to get early feedback on the architecture.

The first point is around the sync read path. I introduced open_sync_reader because SortMergeJoin currently has synchronous, blocking code paths that directly open files using paths and BufReader, instead of going through the spill abstractions. Converting this to fully async would significantly increase the scope of this PR.

Does it make sense to keep this escape hatch for now and handle making these operators async in a follow-up PR?

Kind of, though it seems like accumulating technical debt as we'll have APIs that will not be needed once we complete the work for SortMergeJoin

What do you think about making a first PR to migrate SortMergeJoin to use the spill abstraction?

The second point is regarding test failures. I have not modified the original 64B limit in the tests because I wanted guidance here. Currently, the repartition test in mod.rs is failing, and it seems related to spilling not being triggered correctly, the new SpillWriteAdapter adds slight allocation overhead which makes the original 64-byte memory limit too tight for the merge heap to initialize (~296 bytes needed), bumping up the memory limit causes the test to not spill anymore, I believe increasing the test data size might solve the issue, but am not sure.

Makes sense to me

alamb

Thanks @pantShrey - I reviewed this and the basic idea looks good to me. I do think it would be nice to have a unified (async) IO abstraction rather than leaving some hook around for sync IO and making this API more complicated

alamb · 2026-05-07T21:14:19Z

    used_disk_space: Arc<AtomicU64>,
    /// Number of active temporary files created by this disk manager
    active_files_count: Arc<AtomicUsize>,
+    /// Custom Backend


A small nit: I think "custom" is a somewhat unecessary term here . Perhaps this

factory: Option<Arc<dyn TempFileFactory>>,

or

temp_file_factory: Option<Arc<dyn TempFileFactory>>,

would be more consistent with the rest of the codebase

alamb · 2026-05-09T11:34:33Z

        .collect()
 }

+pub struct OsSpillWriter {


maybe "file spill writer"?

alamb · 2026-05-09T11:36:08Z

+/// Writer for spill file backends.
+/// Receives zero-copy `Bytes` payloads from the IPCStreamWriter adapter.
+pub trait SpillWriter: Send {
+    fn write(&mut self, data: Bytes) -> Result<()>;


This is pretty similar to https://doc.rust-lang.org/std/io/trait.Write.html 🤔

Yes, you are right. The reason I didn't use Write trait which uses &[u8] was for ownership reasons. Some backends might queue chunks to a background task (e.g., S3 multipart via a channel) and need to hold the data past the write() call's return. &[u8] can't express that, and it would force a second copy between the SpillWriteAdapter and the SpillWriter.
Also, the custom SpillWriter trait contains finish(), which maps perfectly to complete_multipart_upload for S3 and resource owner cleanup for Postgres.

This is all true -- however, I think that since the underlying IPC writer takes a std::io::Write, forcing all backends to use Bytes will likely require an extra unecessary copy (see comments below on SpillWriterAdapter) anways.

If you use a std::io::write like interface here, backends that want to queue chunks can do so (by copying into Bytes buffers themselves)

Thus what i suggest is:

Change this to look more like std::io::wrote:

fn write(&mut self, data: &[u8]) -> Result<()>;

Which will allow you to get rid of the write adapter

pantShrey · 2026-05-10T13:03:43Z

@alamb Thank you so much for the review! I scoped out the SortMergeJoin migration today, specifically looking at bitwise_stream.rs and process_key_match_with_filter, to see what it would take.

Because SortMergeJoin currently reads from the spill file via a synchronous for loop inside a hand-rolled poll state machine, making the read path truly async requires a major rewrite. We can't just .await the stream, so we may need to store the SendableRecordBatchStream in the execution state and manually persist variables like matched_count across Poll::Pending yields.

Because ParadeDB is hoping to unblock their Postgres integration next week, I'm worried a state machine rewrite of this scale will stall them.

Would you be open to merging this core abstraction first (with open_sync_reader marked as #[deprecated])? I can open a dedicated tracking issue for the SortMergeJoin async migration and tackle it as a fast follow-up PR.

I am happy to defer to your judgment if you feel the tech debt must be addressed first!

alamb · 2026-05-12T15:02:32Z

I am happy to defer to your judgment if you feel the tech debt must be addressed first!

How about we try it in parallel?

pantShrey · 2026-05-12T15:41:09Z

I am happy to defer to your judgment if you feel the tech debt must be addressed first!

How about we try it in parallel?

@alamb sure, i have already started to work on that locally while waiting for the response

also i am actually still stuck on the test repartition::test::test_preserve_order_with_spilling

The issue stems from the fact that RepartitionMerge now requires more memory than a RepartitionExec node, this greedily allocates memory to RepartitionExec which could have spilled instead of RepartitionMerge which cannot spill.

I would really appreciate any guidance on this, am I missing something obvious here?

alamb · 2026-05-12T20:50:50Z

test_preserve_order_with_spilling

Sadly I am not familar with this test so I don't have a lot to offer you

Maybe you can look at git history and see who introduced the test and maybe they might have some ideas

pantShrey · 2026-05-12T22:22:10Z

Hey @adriangb, Andrew suggested I reach out to you since you originally authored repartition::test::test_preserve_order_with_spilling. I'm currently hitting a wall with it while migrating the spilling architecture to async streams.

The test is currently stuck in a memory-accounting deadlock. Here’s what is happening:

If I set the memory pool limit tight enough to force a spill, RepartitionMerge panics during initialization. It needs to reserve some memory to set up its streams, but exhausts the pool before completing its unspillable setup.
However, if I increase the pool limit to give Merge enough headroom to initialize safely and then scale up the data volume to force overflow, the RepartitionExec producers greedily consume the additional memory first. This either ends up starving Merge again or allows the query to complete entirely in memory without triggering a spill.

I was able to trigger a spill once by setting the test memory limit to 608 B, but even that was not sufficient for the test to pass reliably.

Is there a correct or idiomatic way to configure this test (batch sizes, data volume, memory pool limits, etc.) to reliably force a RepartitionExec spill without violating the Merge operator’s baseline initialization overhead? Or am I approaching this incorrectly and missing something obvious?

I would really appreciate any guidance you could provide.

adriangb · 2026-05-13T04:00:56Z

IIRC that test was added when we added spilling to RepartitionExec. Conceptually the test is simple: if RepartitionExec is configured to preserve order and it spills we need to make sure that spilling did not shuffle the data. The orchestration however is difficult: forcing a RepartitionExec to spill usually requires skewed upstream partition consumption rates. You could try to change the test to eg use a GroupBy or maybe we can use a RepartitionExec in isolation if we pull from the streams in the right way. I think the structure can be changed quite a bit as long as we preserve the semantic meaning of the test, I am not surprised that it is pretty fragile to changes.

pantShrey · 2026-05-13T16:26:55Z

@alamb I’ve addressed the nits and force-pushed the updates. Could you please trigger the CI and take another look when you have a moment? In the meantime, I am working on migrating SortMergeJoin to the new spill abstractions in parallel so that both can be reviewed quickly. Thank you again for your time!

pantShrey · 2026-05-13T16:27:50Z

@adriangb Thank you so much for the guidance! I updated the test to simply assert that a spill does occur
(spill_count > 0) and that the batch output order remains perfectly sorted, rather than trying to force every single batch to spill. I hope this aligns with the semantic purpose you originally envisioned for the test. I really appreciate your help getting me unstuck here!

adriangb · 2026-05-13T16:35:57Z

@adriangb Thank you so much for the guidance! I updated the test to simply assert that a spill does occur (spill_count > 0) and that the batch output order remains perfectly sorted, rather than trying to force every single batch to spill. I hope this aligns with the semantic purpose you originally envisioned for the test. I really appreciate your help getting me unstuck here!

That makes sense to me.

github-actions · 2026-05-13T18:09:56Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-execution v54.0.0 (current)
       Built [  58.524s] (current)
     Parsing datafusion-execution v54.0.0 (current)
      Parsed [   0.026s] (current)
    Building datafusion-execution v54.0.0 (baseline)
       Built [  28.136s] (baseline)
     Parsing datafusion-execution v54.0.0 (baseline)
      Parsed [   0.025s] (baseline)
    Checking datafusion-execution v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.326s] 223 checks: 220 pass, 3 fail, 0 warn, 30 skip

--- failure auto_trait_impl_removed: auto trait no longer implemented ---

Description:
A public type has stopped implementing one or more auto traits. This can break downstream code that depends on the traits being implemented.
        ref: https://doc.rust-lang.org/reference/special-types-and-traits.html#auto-traits
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/auto_trait_impl_removed.ron

Failed in:
  type DiskManagerBuilder is no longer UnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:38
  type DiskManagerBuilder is no longer RefUnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:38
  type DiskManager is no longer UnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:154
  type DiskManager is no longer UnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:154
  type DiskManagerMode is no longer UnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:123
  type DiskManagerMode is no longer RefUnwindSafe, in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:123

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_variant_added.ron

Failed in:
  variant DiskManagerMode:Custom in /home/runner/work/datafusion/datafusion/datafusion/execution/src/disk_manager.rs:135

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  RefCountedTempFile::update_disk_usage, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/60faab5cff28d1111f7f42ee5fabfd78ee7187b4/datafusion/execution/src/disk_manager.rs:334
  RefCountedTempFile::current_disk_usage, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/60faab5cff28d1111f7f42ee5fabfd78ee7187b4/datafusion/execution/src/disk_manager.rs:382

     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  88.560s] datafusion-execution
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  35.065s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.139s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  34.854s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.139s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.893s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  72.688s] datafusion-physical-plan

pantShrey · 2026-05-13T20:52:23Z

cargo-semver-checks flagged DiskManagerMode::Custom as a breaking change since the enum isn't
#[non_exhaustive]. Happy to add it if preferred, but wanted to check first since it would affect downstream users matching on this enum.

pantShrey · 2026-05-14T20:08:44Z

Hey @alamb, quick update! While working on the SortMergeJoin async migration in parallel, I realised the changes were actually quite contained (~260 insertions, ~170 deletions). Rather than opening a second stacked PR and temporarily introducing the open_sync tech debt to main, I went ahead and rolled the refactor directly into this PR to keep things clean. I hope this approach is okay with you!

I believe the PR is now ready for review, so I've marked it as such. I'd appreciate another look whenever you have the time. Thank you!

alamb · 2026-05-15T14:42:00Z

Rather than opening a second stacked PR and temporarily introducing the open_sync tech debt to main, I went ahead and rolled the refactor directly into this PR to keep things clean. I hope this approach is okay with you!

Grerat-- can you please make a PR for just the SMJ refactor and then stack this PR on it?

alamb · 2026-05-15T14:42:23Z

That will make it easier / faster to review (I am not a SMJ expert so I can't really review that part effiicently)

pantShrey · 2026-05-19T14:16:36Z

Hey @alamb, quick update!

I've reworked both PRs to make them easier to review independently:

refactor: Update SortMergeJoin to use async spill abstractions #22230 (SMJ refactor) no longer depends on this PR, it now migrates SortMergeJoin to async spill abstractions while keeping the concrete RefCountedTempFile type, so it can be reviewed and merged standalone.
This PR now only does one focused thing: introduces the SpillFile trait + TempFileFactory and swaps the internal type from RefCountedTempFile → Arc<dyn SpillFile> and internal changes in the spill module. I've also removed the open_sync tech debt that was here before, and have added the skip validation in streamdecoder

The plan is for #22230 to merge first, then I'll rebase this on top of it. Would be grateful if you could take a look whenever you get the chance!

pantShrey · 2026-06-24T07:57:53Z

@alamb, extremely sorry for the delay in pushing the latest changes. While waiting on your guidance regarding the RecordBatch abstraction, I started working through the other review comments, but I ended up getting fairly stuck on a design decision around the writer adapter and spent more time on it than I expected.

I did make the writer-side changes you suggested: SpillWriter now extends std::io::Write, which allowed me to remove the original SpillWriteAdapter entirely.

However, removing the adapter exposed two issues.

The first is metrics tracking. The adapter was previously the place where I could observe the exact number of compressed IPC bytes written to the backend and update both the global disk usage accounting and the spilled_bytes metrics. This also removed the need for InProgressSpillFile to repeatedly call current_disk_usage() / update_disk_usage().

Without that interception point, I'm struggling to see how to accurately track spill metrics. Today the only estimate available at the RecordBatch level is derived from get_array_memory_size(), which reflects the in-memory Arrow representation rather than the final serialized IPC payload. Once IPC compression (LZ4/ZSTD) is enabled, that value can differ substantially from the actual bytes written to the backend. Additionally, Arrow's StreamWriter doesn't expose the number of bytes written per batch, so there doesn't seem to be a way to observe the final serialized size other than wrapping the std::io::Write boundary itself.

My current compromise is a very small tracking wrapper around SpillWriter that simply forwards std::io::Write calls while counting the exact serialized bytes written by Arrow. This keeps the metrics accurate for compressed spills and also allows backend-local quota tracking to remain at the write boundary rather than requiring every backend to implement current_disk_usage()-style APIs.

Do you think this approach is acceptable, or would you prefer that I revert to the explicit disk-usage tracking approach instead? My concern with the latter is that it becomes difficult for non-filesystem backends to implement efficiently and increases the backend API surface, while still only providing an approximation of the actual spill size.
The second issue is error propagation. Before this change, quota enforcement could return DataFusionError::ResourcesExhausted directly through the datafusion_common::Result path(drain() allowed skipping the io boundary). After moving the interface to std::io::Write, quota failures have to cross the std::io::Error boundary. As far as I can tell, that makes it difficult to preserve ResourcesExhausted semantics all the way up to callers such as the spill operators, which currently check for that specific error type. However, I audited the codebase (e.g., row_hash.rs, nestedloopjoin.rs) to see if losing this specific enum variant actually breaks any control flow. From what I can see, ResourcesExhausted is exclusively caught by operators to handle memory limits (which triggers the fallback to start spilling). Conversely, hitting a disk limit during an active spill has no fallback -- it is a fatal error that simply aborts the query. Because of this, I removed the adapter's error-stashing workaround and allowed the disk quota failures to just bubble up as standard std::io::Errors with the descriptive text, since the end-user UX and control flow remain identical.

On the read path, I'd also like to gently push back on returning Box<dyn std::io::Read>, if that's okay. The original implementation used StreamReader<BufReader<File>>, which required spawn_blocking to avoid blocking the async executor. Avoiding that thread-pool dependency was the original motivation for the state machine. I replaced that path with Arrow's StreamDecoder fed by tokio::fs::File + ReaderStream, which keeps the read side fully async without requiring spawn_blocking.

You're absolutely right that this introduces an extra buffering step compared to StreamReader, so there is a real tradeoff there. My thinking was that the async behaviour was worth that cost, but I'm happy to discuss further if you feel differently.

I've also started experimenting locally with the RecordBatch-level abstraction you suggested, but I didn't want to go too far without first getting your direction. Given the above, would you prefer that I continue down the RecordBatch route and revisit these APIs as part of that refactor, or should I finish the current approach first and keep the RecordBatch abstraction as a follow-up?

alamb · 2026-06-24T21:17:10Z

My current compromise is a very small tracking wrapper around SpillWriter that simply forwards std::io::Write calls while counting the exact serialized bytes written by Arrow. This keeps the metrics accurate for compressed spills and also allows backend-local quota tracking to remain at the write boundary rather than requiring every backend to implement current_disk_usage()-style APIs.

This makes sense to me

alamb · 2026-06-24T21:35:03Z

From what I can see, ResourcesExhausted is exclusively caught by operators to handle memory limits (which triggers the fallback to start spilling). Conversely, hitting a disk limit during an active spill has no fallback -- it is a fatal error that simply aborts the query. Because of this, I removed the adapter's error-stashing workaround and allowed the disk quota failures to just bubble up as standard std::io::Errors with the descriptive text, since the end-user UX and control flow remain identical.

i suppose we could also just define something that looked like std::io::write but had a different error type 🤮

alamb · 2026-06-24T21:35:35Z

run benchmark external_aggr smj sort_tpch spill_io

adriangbot · 2026-06-24T21:36:38Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4793874108-669-jrcrv 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing abstract-spill-file (024ec98) to 5fcc550 (merge-base) diff using: external_aggr
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-24T21:37:49Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4793874108-670-f6jmk 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing abstract-spill-file (024ec98) to 5fcc550 (merge-base) diff using: smj
Results will be posted here when complete

File an issue against this benchmark runner

alamb

I am sorry for the long turnaround time on this one @pantShrey -- but it is a fairly fundamental low level building block and I am trying to make sure it doesn't have any unforseen negative impacts on others downstream.

As this change will require other users to update their code, we should add an upgrade guide entry. I took the liberty of pushing a commit with one

As long as the benchmarks don't show any slowdown I think we are good to go

cc @2010YOUY01 and @mbutrovich -- perhaps you can have a final look at this API to see if it looks reasonable to you

alamb · 2026-06-24T21:20:11Z

 use datafusion_execution::SendableRecordBatchStream;
-use datafusion_execution::disk_manager::RefCountedTempFile;
 use datafusion_execution::runtime_env::RuntimeEnv;
+use datafusion_execution::spill_file::SpillFile;


it is nice that this function basically just changesRefCountedTempFile to Arc<dyn SpillFile> 👍

alamb · 2026-06-24T21:32:54Z

-        let batch = batch.slice(offset, length);
-        offset += batch.num_rows();
-        writer.write(&batch)?;
+/// A  wrapper that counts the exact compressed IPC bytes written by Arrow.


👍

This make sense

adriangbot · 2026-06-24T21:38:17Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4793874108-672-kd9hg 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing abstract-spill-file (4904cd7) to 5fcc550 (merge-base) diff using: spill_io
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-06-24T21:38:28Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4793874108-671-fcdf7 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing abstract-spill-file (4904cd7) to 5fcc550 (merge-base) diff using: sort_tpch
Results will be posted here when complete

File an issue against this benchmark runner

alamb · 2026-06-24T21:46:12Z

The other thing I think that would be helpful in crafting this API is an example of how to provide a user defined spill manager. For example it would be interesting to try and implement one that is backed by an ObjectStore to show how a user could use a remote source as a spill location

I will try and make such an example using this API and report back

Edit: Update, the PR is here

Add ObjectStore-backed TempFileFactor / spill example #23170

adriangbot · 2026-06-24T21:54:20Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and abstract-spill-file
--------------------
Benchmark sort_tpch1.json
--------------------
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query ┃                               HEAD ┃                abstract-spill-file ┃    Change ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Q1    │  164.84 / 165.66 ±0.59 / 166.56 ms │  165.50 / 166.07 ±0.59 / 167.15 ms │ no change │
│ Q2    │  140.97 / 141.78 ±0.82 / 143.24 ms │  141.99 / 143.41 ±1.15 / 145.17 ms │ no change │
│ Q3    │  653.09 / 655.34 ±1.69 / 658.07 ms │  651.11 / 655.03 ±2.16 / 656.93 ms │ no change │
│ Q4    │  195.33 / 197.64 ±3.57 / 204.73 ms │  196.02 / 197.48 ±1.33 / 199.45 ms │ no change │
│ Q5    │  279.10 / 280.15 ±0.94 / 281.71 ms │  279.14 / 280.18 ±0.83 / 281.11 ms │ no change │
│ Q6    │  293.39 / 295.03 ±0.90 / 295.85 ms │  292.47 / 297.87 ±6.15 / 308.42 ms │ no change │
│ Q7    │  472.58 / 475.57 ±2.16 / 477.88 ms │  466.57 / 468.88 ±1.57 / 470.91 ms │ no change │
│ Q8    │  333.33 / 341.98 ±7.37 / 352.17 ms │  327.11 / 331.25 ±2.63 / 335.19 ms │ no change │
│ Q9    │ 347.86 / 362.05 ±11.58 / 375.08 ms │ 342.58 / 350.16 ±10.85 / 371.25 ms │ no change │
│ Q10   │ 500.05 / 519.53 ±20.74 / 546.77 ms │  484.92 / 493.79 ±6.00 / 501.49 ms │ no change │
│ Q11   │  246.90 / 256.73 ±9.84 / 274.39 ms │ 245.79 / 252.65 ±12.38 / 277.40 ms │ no change │
└───────┴────────────────────────────────────┴────────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 3691.47ms │
│ Total Time (abstract-spill-file)   │ 3636.79ms │
│ Average Time (HEAD)                │  335.59ms │
│ Average Time (abstract-spill-file) │  330.62ms │
│ Queries Faster                     │         0 │
│ Queries Slower                     │         0 │
│ Queries with No Change             │        11 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

Resource Usage

sort_tpch — base (merge-base)

Metric	Value
Wall time	20.0s
Peak memory	2.5 GiB
Avg memory	1.2 GiB
CPU user	66.2s
CPU sys	3.0s
Peak spill	0 B

sort_tpch — branch

Metric	Value
Wall time	20.0s
Peak memory	2.4 GiB
Avg memory	1.2 GiB
CPU user	66.7s
CPU sys	2.8s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-06-24T21:54:36Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and abstract-spill-file
--------------------
Benchmark smj.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                 HEAD ┃                  abstract-spill-file ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │          8.73 / 8.87 ±0.11 / 9.03 ms │          8.48 / 8.65 ±0.12 / 8.81 ms │     no change │
│ QQuery 2  │    173.51 / 175.16 ±1.62 / 177.82 ms │    172.88 / 177.16 ±4.98 / 183.72 ms │     no change │
│ QQuery 3  │    105.93 / 106.85 ±0.64 / 107.59 ms │    106.49 / 107.72 ±0.93 / 109.25 ms │     no change │
│ QQuery 4  │       28.39 / 28.58 ±0.13 / 28.75 ms │       28.65 / 28.85 ±0.17 / 29.12 ms │     no change │
│ QQuery 5  │       21.67 / 21.84 ±0.20 / 22.20 ms │       21.72 / 21.99 ±0.26 / 22.44 ms │     no change │
│ QQuery 6  │    169.58 / 175.41 ±5.59 / 185.10 ms │    175.53 / 178.79 ±3.86 / 185.76 ms │     no change │
│ QQuery 7  │    211.12 / 216.14 ±9.21 / 234.55 ms │    212.57 / 214.02 ±0.86 / 215.20 ms │     no change │
│ QQuery 8  │       20.17 / 20.42 ±0.16 / 20.56 ms │       20.36 / 20.63 ±0.21 / 20.94 ms │     no change │
│ QQuery 9  │    214.68 / 222.24 ±9.45 / 240.74 ms │    215.47 / 218.22 ±2.04 / 221.07 ms │     no change │
│ QQuery 10 │       69.59 / 75.10 ±4.96 / 84.24 ms │       72.36 / 75.92 ±3.20 / 80.88 ms │     no change │
│ QQuery 11 │       27.25 / 27.42 ±0.12 / 27.56 ms │       27.35 / 27.86 ±0.29 / 28.23 ms │     no change │
│ QQuery 12 │       70.02 / 72.16 ±1.55 / 73.84 ms │       65.35 / 69.86 ±2.64 / 73.45 ms │     no change │
│ QQuery 13 │    109.20 / 112.69 ±4.81 / 122.15 ms │    101.47 / 108.63 ±5.19 / 117.24 ms │     no change │
│ QQuery 14 │       70.41 / 72.19 ±1.18 / 73.75 ms │       70.46 / 71.34 ±0.95 / 73.05 ms │     no change │
│ QQuery 15 │       70.53 / 77.58 ±8.14 / 93.44 ms │       71.41 / 72.53 ±1.31 / 75.09 ms │ +1.07x faster │
│ QQuery 16 │       12.81 / 13.02 ±0.17 / 13.24 ms │       12.97 / 13.56 ±0.36 / 14.03 ms │     no change │
│ QQuery 17 │    148.51 / 149.57 ±0.56 / 150.08 ms │    150.55 / 151.94 ±1.30 / 154.27 ms │     no change │
│ QQuery 18 │    112.08 / 112.95 ±0.91 / 114.70 ms │    113.02 / 115.98 ±5.26 / 126.48 ms │     no change │
│ QQuery 19 │   387.10 / 534.74 ±73.94 / 578.14 ms │   568.72 / 579.96 ±10.29 / 598.18 ms │  1.08x slower │
│ QQuery 20 │ 1268.56 / 1275.26 ±5.76 / 1285.00 ms │ 1279.00 / 1284.84 ±4.46 / 1291.94 ms │     no change │
│ QQuery 21 │       95.50 / 96.89 ±1.15 / 98.98 ms │      95.48 / 98.93 ±5.87 / 110.61 ms │     no change │
│ QQuery 22 │    100.85 / 106.00 ±5.55 / 116.59 ms │    101.83 / 103.29 ±1.35 / 105.42 ms │     no change │
│ QQuery 23 │    109.82 / 110.55 ±0.64 / 111.45 ms │    110.50 / 111.26 ±0.71 / 112.45 ms │     no change │
│ QQuery 24 │       27.32 / 28.59 ±1.66 / 31.88 ms │       27.39 / 27.89 ±0.29 / 28.18 ms │     no change │
│ QQuery 25 │       71.68 / 73.67 ±2.27 / 78.07 ms │       67.95 / 71.68 ±2.44 / 75.49 ms │     no change │
│ QQuery 26 │    105.40 / 110.50 ±5.41 / 120.86 ms │    106.68 / 108.75 ±1.77 / 110.90 ms │     no change │
└───────────┴──────────────────────────────────────┴──────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 4024.37ms │
│ Total Time (abstract-spill-file)   │ 4070.25ms │
│ Average Time (HEAD)                │  154.78ms │
│ Average Time (abstract-spill-file) │  156.55ms │
│ Queries Faster                     │         1 │
│ Queries Slower                     │         1 │
│ Queries with No Change             │        24 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

Resource Usage

smj — base (merge-base)

Metric	Value
Wall time	25.0s
Peak memory	627.4 MiB
Avg memory	279.5 MiB
CPU user	169.1s
CPU sys	3.4s
Peak spill	0 B

smj — branch

Metric	Value
Wall time	25.0s
Peak memory	634.8 MiB
Avg memory	276.8 MiB
CPU user	167.4s
CPU sys	3.4s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-06-24T21:54:49Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                  HEAD                                   abstract-spill-file
-----                                  ----                                   -------------------
spill_compression/q16/lz4_frame        1.00     39.0±3.66ms        ? ?/sec    1.47     57.5±0.89ms        ? ?/sec
spill_compression/q16/uncompressed     1.00     36.3±7.70ms        ? ?/sec    2.21     80.2±1.54ms        ? ?/sec
spill_compression/q16/zstd             1.00     63.7±3.56ms        ? ?/sec    1.24     78.7±2.59ms        ? ?/sec
spill_compression/q2/lz4_frame         1.00     18.9±3.68ms        ? ?/sec    1.40     26.5±0.43ms        ? ?/sec
spill_compression/q2/uncompressed      1.00    19.1±11.60ms        ? ?/sec    1.91     36.5±0.73ms        ? ?/sec
spill_compression/q2/zstd              1.00    32.4±10.13ms        ? ?/sec    1.04     33.8±0.47ms        ? ?/sec
spill_compression/q20/lz4_frame        1.00     25.6±2.86ms        ? ?/sec    1.49     38.2±0.70ms        ? ?/sec
spill_compression/q20/uncompressed     1.00     23.5±3.94ms        ? ?/sec    2.06     48.4±0.85ms        ? ?/sec
spill_compression/q20/zstd             1.00     45.9±0.66ms        ? ?/sec    1.15     53.0±1.45ms        ? ?/sec
spill_compression/wide/lz4_frame       1.00     92.4±6.29ms        ? ?/sec    1.58    145.7±1.73ms        ? ?/sec
spill_compression/wide/uncompressed    1.00     99.8±5.27ms        ? ?/sec    2.17    216.7±3.17ms        ? ?/sec
spill_compression/wide/zstd            1.00    162.5±5.29ms        ? ?/sec    1.23    199.7±2.65ms        ? ?/sec
spill_io/StreamReader/read_100/        1.00     51.2±9.19ms        ? ?/sec    2.88    147.2±3.32ms        ? ?/sec

Resource Usage

spill_io — base (merge-base)

Metric	Value
Wall time	475.1s
Peak memory	243.3 MiB
Avg memory	35.7 MiB
CPU user	119.4s
CPU sys	44.8s
Peak spill	0 B

spill_io — branch

Metric	Value
Wall time	505.1s
Peak memory	335.5 MiB
Avg memory	65.3 MiB
CPU user	155.4s
CPU sys	94.8s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-06-24T21:57:12Z

🤔 hmm the spill_io benchmark looks pretty bad: #21882 (comment)

adriangbot · 2026-06-24T22:10:50Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and abstract-spill-file
--------------------
Benchmark external_aggr.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃                               HEAD ┃                abstract-spill-file ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Q1(64.0 MB)  │     56.10 / 59.52 ±2.89 / 64.35 ms │     58.65 / 62.52 ±3.23 / 68.39 ms │ 1.05x slower │
│ Q1(32.0 MB)  │     50.06 / 53.68 ±2.49 / 57.65 ms │     56.73 / 58.42 ±1.71 / 61.63 ms │ 1.09x slower │
│ Q1(16.0 MB)  │     51.58 / 55.33 ±2.55 / 58.88 ms │     60.85 / 62.24 ±0.72 / 62.82 ms │ 1.12x slower │
│ Q2(512.0 MB) │ 282.03 / 304.71 ±22.38 / 333.45 ms │ 269.78 / 299.49 ±27.33 / 347.85 ms │    no change │
│ Q2(256.0 MB) │ 248.60 / 263.29 ±16.62 / 293.08 ms │  260.72 / 267.83 ±6.19 / 277.68 ms │    no change │
│ Q2(128.0 MB) │ 245.61 / 269.83 ±40.40 / 350.46 ms │  264.78 / 269.21 ±4.26 / 277.30 ms │    no change │
│ Q2(64.0 MB)  │  248.06 / 251.73 ±2.85 / 255.35 ms │  281.25 / 284.44 ±2.43 / 286.85 ms │ 1.13x slower │
│ Q2(32.0 MB)  │ 319.21 / 365.31 ±71.36 / 504.54 ms │  365.67 / 371.74 ±6.17 / 383.59 ms │    no change │
└──────────────┴────────────────────────────────────┴────────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 1623.40ms │
│ Total Time (abstract-spill-file)   │ 1675.90ms │
│ Average Time (HEAD)                │  202.93ms │
│ Average Time (abstract-spill-file) │  209.49ms │
│ Queries Faster                     │         0 │
│ Queries Slower                     │         4 │
│ Queries with No Change             │         4 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

Resource Usage

external_aggr — base (merge-base)

Metric	Value
Wall time	510.1s
Peak memory	630.4 MiB
Avg memory	11.2 MiB
CPU user	25.2s
CPU sys	3.7s
Peak spill	0 B

external_aggr — branch

Metric	Value
Wall time	510.1s
Peak memory	462.1 MiB
Avg memory	10.0 MiB
CPU user	33.8s
CPU sys	10.7s
Peak spill	0 B

File an issue against this benchmark runner

alamb · 2026-06-24T22:35:58Z

+
+/// Factory for creating spill files.
+pub trait TempFileFactory:
+    Send + Sync + std::panic::UnwindSafe + std::panic::RefUnwindSafe


Why does this need to be std::panic::UnwindSafe + std::panic::RefUnwindSafe? If there is a good reason it should be documented on the trait as well I think

I hit this while working on an example of using this API in

Add ObjectStore-backed TempFileFactor / spill example #23170

IIRC, it was a semver failure that forced me to add it.

Because DiskManager and DiskManagerBuilder now hold an Arc<dyn TempFileFactory>, they lost their auto-implemented UnwindSafe and RefUnwindSafe traits. This triggered a cargo-semver-checks failure because existing downstream code relies on DiskManager being unwind-safe:

Checking datafusion-execution v53.1.0 -> v53.1.0 (no change; assume patch) --- failure auto_trait_impl_removed: auto trait no longer implemented --- type DiskManagerBuilder is no longer UnwindSafe type DiskManagerBuilder is no longer RefUnwindSafe type DiskManager is no longer UnwindSafe ...

I didn't have any other architectural reason for adding the bounds beyond satisfying the checker. Since this PR already requires an upgrade guide, would you prefer I remove these bounds and we just accept the semver break, or should I keep them and add a doc comment explaining why they are there?

I think we should accept the "semver" break as this is a breaking API change anyways

sure, I have removed the trait bounds

alamb · 2026-06-24T22:42:01Z

+}
+
+/// Writer for spill file backends.
+pub trait SpillWriter: std::io::Write + Send {


It was also strange that the SpillWriter is a sync API, but the read stream API is async

fn read_stream(&self) -> Result<Pin<Box<dyn Stream<Item = Result<Bytes>> + Send>>>;

I found this while working on an example showing how to write to a remote object store

Add ObjectStore-backed TempFileFactor / spill example #23170

…ileFactory

…erhead

pantShrey · 2026-06-25T21:26:56Z

@alamb I spent some time looking into the spill_io regression.

From what I can tell, increasing the ReaderStream capacity somewhat restores the benchmark performance locally. I initially tried going up to 1 MB, but capacities around 256 KB started failing the SQL logic tests, so for now I've pushed 128 KB.

My current understanding is that the previous StreamReader + BufReader implementation, despite having an 8 KB buffer, would typically read an entire IPC frame without yielding. With the current Tokio async stream, once the initial
8 KB is consumed, the task yields repeatedly while reading the remainder of the frame. Combined with the "copy" into the decoder's scratch buffer, this seems to add noticeable overhead for multi-MB frames/batches.

tokio_util::io::ReaderStream::with_capacity(file, 128 * 1024)

I've pushed the change mainly so the CI benchmarks can run and to get your thoughts. If this direction makes sense, or should I make it configurable instead?

github-actions Bot added execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Apr 27, 2026

philippemnoel mentioned this pull request May 5, 2026

JOINs: M3: Add Support for Spilling to Disk paradedb/paradedb#4064

Open

tomz mentioned this pull request May 8, 2026

PROPOSAL Hash Join Spilling Proposal #17267

Open

alamb reviewed May 9, 2026

View reviewed changes

pantShrey force-pushed the abstract-spill-file branch from 2971e41 to de6697f Compare May 13, 2026 12:56

github-actions Bot added the auto detected api change Auto detected API change label May 13, 2026

pantShrey force-pushed the abstract-spill-file branch from e31bff4 to 086632a Compare May 14, 2026 19:38

pantShrey marked this pull request as ready for review May 14, 2026 20:08

pantShrey mentioned this pull request May 15, 2026

refactor: Update SortMergeJoin to use async spill abstractions #22230

Merged

pantShrey marked this pull request as draft May 19, 2026 11:33

pantShrey force-pushed the abstract-spill-file branch from 915532b to 6954b55 Compare May 19, 2026 14:08

pantShrey added 2 commits June 24, 2026 13:17

Merge remote-tracking branch 'upstream/main' into abstract-spill-file

4d9dd6a

Merge remote-tracking branch 'upstream/main' into abstract-spill-file

420d69d

pantShrey force-pushed the abstract-spill-file branch from 7632429 to 420d69d Compare June 24, 2026 07:49

alamb added 2 commits June 24, 2026 17:28

Add upgrade guide

b6fdd7a

Merge remote-tracking branch 'apache/main' into abstract-spill-file

024ec98

github-actions Bot added the documentation Improvements or additions to documentation label Jun 24, 2026

prettier

4904cd7

alamb approved these changes Jun 24, 2026

View reviewed changes

alamb mentioned this pull request Jun 24, 2026

Add ObjectStore-backed TempFileFactor / spill example #23170

Draft

alamb reviewed Jun 24, 2026

View reviewed changes

pantShrey added 4 commits June 26, 2026 00:49

Merge remote-tracking branch 'upstream/main' into abstract-spill-file

895ad6e

refactor: remove UnwindSafe and RefUnwindSafe trait bounds from TempF…

4942949

…ileFactory

perf: increase spill reader stream capacity to 1MB to reduce async ov…

e350919

…erhead

decrease spill reader stream capacity to 128KB

ae9ecc9

Uh oh!

Conversation

pantShrey commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pantShrey commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 7, 2026

Uh oh!

alamb commented May 9, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pantShrey commented May 10, 2026

Uh oh!

alamb commented May 12, 2026

Uh oh!

pantShrey commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 12, 2026

Uh oh!

pantShrey commented May 12, 2026

Uh oh!

adriangb commented May 13, 2026

Uh oh!

pantShrey commented May 13, 2026

Uh oh!

pantShrey commented May 13, 2026

Uh oh!

adriangb commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pantShrey commented May 13, 2026

Uh oh!

pantShrey commented May 14, 2026

Uh oh!

alamb commented May 15, 2026

Uh oh!

alamb commented May 15, 2026

Uh oh!

pantShrey commented May 19, 2026

Uh oh!

pantShrey commented Jun 24, 2026

Uh oh!

alamb commented Jun 24, 2026

Uh oh!

alamb commented Jun 24, 2026

Uh oh!

alamb commented Jun 24, 2026

Uh oh!

adriangbot commented Jun 24, 2026

Uh oh!

adriangbot commented Jun 24, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pantShrey commented Apr 27, 2026 •

edited

Loading

pantShrey commented Apr 27, 2026 •

edited

Loading

pantShrey commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading

alamb commented Jun 24, 2026 •

edited

Loading

alamb Jun 24, 2026 •

edited

Loading