Kept null-aware anti-join NULLs in the pushed dynamic filter. by mdashti · Pull Request #2 · paradedb/datafusion

mdashti · 2026-06-22T23:51:48Z

What

A null-aware anti join (NOT IN) wrongly returned rows when its inner side had a NULL. This PR makes the pushed-down dynamic filter keep the probe's NULL rows, so the result collapses to zero as it should.

Why

A hash join pushes a build-side filter (key IN build_keys) down to the probe scan. That's fine for most joins, but the probe's NULL is special for NOT IN: one NULL makes every comparison unknown, so the result has to be empty. The pushed filter dropped that NULL at the scan, before the join's null-aware check ran, so rows leaked through.

How

Threaded null_aware into SharedBuildAccumulator. When it's set, build_filter ORs probe_key IS NULL into the predicate before updating the dynamic filter. Non-NULL probe rows still get filtered, so the optimization stays. HashJoinExec already validates that a null-aware join has a single probe key.

Tests

Added a parquet repro to null_aware_anti_join.slt. The existing cases use in-memory VALUES, whose scans never apply the pushed filter, so they passed despite the bug. The new one sets parquet.pushdown_filters = true so the filter runs row-level. Without the fix it returns 1, 3; with it, zero rows.

## Which issue does this PR close? - Closes apache#21419 ## Rationale for this change The CSV scanner currently uses `calculate_range` which issues two extra `get_opts` requests per byte range to find newline boundaries (one for the start boundary, one for the end boundary), plus one GET for the actual data. For a file split into 3 partitions, this results in 8 total object store requests. apache#20823 solved this same problem for the JSON scanner by introducing `AlignedBoundaryStream`, which wraps the raw byte stream and lazily aligns to newline boundaries as data is read, eliminating the extra boundary-seeking requests entirely. This PR applies the same approach to CSV. ## What changes are included in this PR? Based on the approach from apache#20823: Moved `AlignedBoundaryStream` from `datasource-json` to the shared `datasource` crate so it can be reused by both JSON and CSV scanners. Updated `CsvOpener` to use instead of `calculate_range`, and removed the `calculate_range` & `find_first_newline` as they no longer had any callers. Updated tests to reflect. Public API changes include: - Removal of the `RangeCalculation` enum - Removal of the `calculate_change` function - `boundary_stream` (containing `AlignedBoundaryStream`) moved from `datafusion_datasource_json` to `datafusion_datasource` ## Are these changes tested? Yes. The existing `AlignedBoundaryStream` unit tests (16 tests covering boundary alignment edge cases) were moved along with the implementation and continue to pass. The `query_csv_file_with_byte_range_partitions` snapshot test in `object_store_access.rs` has been updated to verify the new request pattern (4 requests instead of 8). ## Are there any user-facing changes? No.

philippemnoel · 2026-06-23T00:37:27Z

Is this different from what Mithun fixed? apache#22104

mdashti · 2026-06-23T00:48:55Z

@philippemnoel interesting! Didn't see that. It showed up in our tests in paradedb/paradedb#5363 . Should check if it's the same fix or not.

mdashti · 2026-06-23T00:52:38Z

@philippemnoel AFAI understand, PR apache#22104 is a different bug: it preserves null_aware on the logical JoinNode proto round-trip (a serialization drop). The current PR fixes the physical dynamic filter dropping the probe NULL at runtime. Different layers.

philippemnoel · 2026-06-23T01:06:03Z

Cool. Awesome then :)

The hash-join dynamic filter pushed `key IN build_keys` down to the probe scan for null-aware anti joins too. That drops the probe-side NULL, but `NOT IN` three-valued logic needs it to collapse the result to zero rows, so the join silently returned rows. OR `probe_key IS NULL` into the pushed predicate. Non-NULL probe rows still get filtered; only the NULL additionally survives.

Exercises the pushdown path the existing in-memory tests miss: parquet with row-level filtering, so the pushed dynamic filter actually drops rows. Without the fix `id NOT IN (SELECT eid ...)` returns 1 and 3 instead of zero rows.

mdashti · 2026-06-23T01:30:25Z

Superseded by the upstream PR apache#23104.

github-actions · 2026-06-23T01:44:32Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 103.959s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 100.176s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.035s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.607s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 206.814s] datafusion
    Building datafusion-datasource v54.0.0 (current)
       Built [  35.997s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  35.898s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.249s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure enum_missing: pub enum removed or renamed ---

Description:
A publicly-visible enum cannot be imported by its prior path. A `pub use` may have been removed, or the enum itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_missing.ron

Failed in:
  enum datafusion_datasource::RangeCalculation, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource/src/mod.rs:427

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_missing.ron

Failed in:
  function datafusion_datasource::calculate_range, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource/src/mod.rs:442

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  73.776s] datafusion-datasource
    Building datafusion-datasource-csv v54.0.0 (current)
       Built [  36.232s] (current)
     Parsing datafusion-datasource-csv v54.0.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-datasource-csv v54.0.0 (baseline)
       Built [  36.052s] (baseline)
     Parsing datafusion-datasource-csv v54.0.0 (baseline)
      Parsed [   0.011s] (baseline)
    Checking datafusion-datasource-csv v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.111s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  73.823s] datafusion-datasource-csv
    Building datafusion-datasource-json v54.0.0 (current)
       Built [  36.388s] (current)
     Parsing datafusion-datasource-json v54.0.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-datasource-json v54.0.0 (baseline)
       Built [  35.742s] (baseline)
     Parsing datafusion-datasource-json v54.0.0 (baseline)
      Parsed [   0.013s] (baseline)
    Checking datafusion-datasource-json v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.089s] 223 checks: 220 pass, 3 fail, 0 warn, 30 skip

--- failure module_missing: pub module removed or renamed ---

Description:
A publicly-visible module cannot be imported by its prior path. A `pub use` may have been removed, or the module may have been renamed, removed, or made non-public.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/module_missing.ron

Failed in:
  mod datafusion_datasource_json::boundary_stream, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource-json/src/boundary_stream.rs:18

--- failure pub_module_level_const_missing: pub module-level const is missing ---

Description:
A public const is missing or renamed
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/pub_module_level_const_missing.ron

Failed in:
  END_SCAN_LOOKAHEAD in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource-json/src/boundary_stream.rs:36

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_datasource_json::boundary_stream::AlignedBoundaryStream, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource-json/src/boundary_stream.rs:64

     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  73.705s] datafusion-datasource-json
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  34.795s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.125s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  35.110s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.134s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.592s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  72.235s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 174.747s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 173.951s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.088s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 352.566s] datafusion-sqllogictest

github-actions Bot added the physical-plan label Jun 22, 2026

mdashti force-pushed the moe/null-aware-dynamic-filter branch from d1369d1 to a71cb76 Compare June 23, 2026 00:28

github-actions Bot added the sqllogictest label Jun 23, 2026

mdashti added 2 commits June 22, 2026 18:29

mdashti force-pushed the moe/null-aware-dynamic-filter branch from 5162364 to f30d9bc Compare June 23, 2026 01:29

github-actions Bot added core datasource labels Jun 23, 2026

mdashti closed this Jun 23, 2026

philippemnoel deleted the moe/null-aware-dynamic-filter branch June 24, 2026 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kept null-aware anti-join NULLs in the pushed dynamic filter.#2

Kept null-aware anti-join NULLs in the pushed dynamic filter.#2
mdashti wants to merge 3 commits into
mainfrom
moe/null-aware-dynamic-filter

mdashti commented Jun 22, 2026 •

edited

Loading

Uh oh!

philippemnoel commented Jun 23, 2026

Uh oh!

mdashti commented Jun 23, 2026

Uh oh!

mdashti commented Jun 23, 2026

Uh oh!

philippemnoel commented Jun 23, 2026

Uh oh!

mdashti commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mdashti commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Tests

Uh oh!

philippemnoel commented Jun 23, 2026

Uh oh!

mdashti commented Jun 23, 2026

Uh oh!

mdashti commented Jun 23, 2026

Uh oh!

philippemnoel commented Jun 23, 2026

Uh oh!

mdashti commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdashti commented Jun 22, 2026 •

edited

Loading