Skip to content

Kept null-aware anti-join NULLs in the pushed dynamic filter.#2

Closed
mdashti wants to merge 3 commits into
mainfrom
moe/null-aware-dynamic-filter
Closed

Kept null-aware anti-join NULLs in the pushed dynamic filter.#2
mdashti wants to merge 3 commits into
mainfrom
moe/null-aware-dynamic-filter

Conversation

@mdashti

@mdashti mdashti commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

What

A null-aware anti join (NOT IN) wrongly returned rows when its inner side had a NULL. This PR makes the pushed-down dynamic filter keep the probe's NULL rows, so the result collapses to zero as it should.

Why

A hash join pushes a build-side filter (key IN build_keys) down to the probe scan. That's fine for most joins, but the probe's NULL is special for NOT IN: one NULL makes every comparison unknown, so the result has to be empty. The pushed filter dropped that NULL at the scan, before the join's null-aware check ran, so rows leaked through.

How

Threaded null_aware into SharedBuildAccumulator. When it's set, build_filter ORs probe_key IS NULL into the predicate before updating the dynamic filter. Non-NULL probe rows still get filtered, so the optimization stays. HashJoinExec already validates that a null-aware join has a single probe key.

Tests

Added a parquet repro to null_aware_anti_join.slt. The existing cases use in-memory VALUES, whose scans never apply the pushed filter, so they passed despite the bug. The new one sets parquet.pushdown_filters = true so the filter runs row-level. Without the fix it returns 1, 3; with it, zero rows.

@mdashti mdashti force-pushed the moe/null-aware-dynamic-filter branch from d1369d1 to a71cb76 Compare June 23, 2026 00:28
## Which issue does this PR close?

- Closes apache#21419

## Rationale for this change

The CSV scanner currently uses `calculate_range` which issues two extra
`get_opts` requests per byte range to find newline boundaries (one for
the start boundary, one for the end boundary), plus one GET for the
actual data. For a file split into 3 partitions, this results in 8 total
object store requests.

apache#20823 solved this same problem for the JSON scanner by introducing
`AlignedBoundaryStream`, which wraps the raw byte stream and lazily
aligns to newline boundaries as data is read, eliminating the extra
boundary-seeking requests entirely. This PR applies the same approach to
CSV.

## What changes are included in this PR?

Based on the approach from apache#20823:

Moved `AlignedBoundaryStream` from `datasource-json` to the shared
`datasource` crate so it can be reused by both JSON and CSV scanners.
Updated `CsvOpener` to use instead of `calculate_range`, and removed the
`calculate_range` & `find_first_newline` as they no longer had any
callers. Updated tests to reflect.

Public API changes include:
- Removal of the `RangeCalculation` enum
- Removal of the `calculate_change` function
- `boundary_stream` (containing `AlignedBoundaryStream`) moved from
`datafusion_datasource_json` to `datafusion_datasource`

## Are these changes tested?

Yes. The existing `AlignedBoundaryStream` unit tests (16 tests covering
boundary alignment edge cases) were moved along with the implementation
and continue to pass. The `query_csv_file_with_byte_range_partitions`
snapshot test in `object_store_access.rs` has been updated to verify the
new request pattern (4 requests instead of 8).


## Are there any user-facing changes?

No.
@philippemnoel

Copy link
Copy Markdown
Member

Is this different from what Mithun fixed? apache#22104

@mdashti

mdashti commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

@philippemnoel interesting! Didn't see that. It showed up in our tests in paradedb/paradedb#5363 . Should check if it's the same fix or not.

@mdashti

mdashti commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

@philippemnoel AFAI understand, PR apache#22104 is a different bug: it preserves null_aware on the logical JoinNode proto round-trip (a serialization drop). The current PR fixes the physical dynamic filter dropping the probe NULL at runtime. Different layers.

@philippemnoel

Copy link
Copy Markdown
Member

Cool. Awesome then :)

mdashti added 2 commits June 22, 2026 18:29
The hash-join dynamic filter pushed `key IN build_keys` down to the probe
scan for null-aware anti joins too. That drops the probe-side NULL, but
`NOT IN` three-valued logic needs it to collapse the result to zero rows,
so the join silently returned rows.

OR `probe_key IS NULL` into the pushed predicate. Non-NULL probe rows
still get filtered; only the NULL additionally survives.
Exercises the pushdown path the existing in-memory tests miss: parquet with
row-level filtering, so the pushed dynamic filter actually drops rows. Without
the fix `id NOT IN (SELECT eid ...)` returns 1 and 3 instead of zero rows.
@mdashti mdashti force-pushed the moe/null-aware-dynamic-filter branch from 5162364 to f30d9bc Compare June 23, 2026 01:29
@mdashti

mdashti commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by the upstream PR apache#23104.

@mdashti mdashti closed this Jun 23, 2026
@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 103.959s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 100.176s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.035s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.607s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 206.814s] datafusion
    Building datafusion-datasource v54.0.0 (current)
       Built [  35.997s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  35.898s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.249s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure enum_missing: pub enum removed or renamed ---

Description:
A publicly-visible enum cannot be imported by its prior path. A `pub use` may have been removed, or the enum itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/enum_missing.ron

Failed in:
  enum datafusion_datasource::RangeCalculation, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource/src/mod.rs:427

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_missing.ron

Failed in:
  function datafusion_datasource::calculate_range, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource/src/mod.rs:442

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  73.776s] datafusion-datasource
    Building datafusion-datasource-csv v54.0.0 (current)
       Built [  36.232s] (current)
     Parsing datafusion-datasource-csv v54.0.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-datasource-csv v54.0.0 (baseline)
       Built [  36.052s] (baseline)
     Parsing datafusion-datasource-csv v54.0.0 (baseline)
      Parsed [   0.011s] (baseline)
    Checking datafusion-datasource-csv v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.111s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  73.823s] datafusion-datasource-csv
    Building datafusion-datasource-json v54.0.0 (current)
       Built [  36.388s] (current)
     Parsing datafusion-datasource-json v54.0.0 (current)
      Parsed [   0.011s] (current)
    Building datafusion-datasource-json v54.0.0 (baseline)
       Built [  35.742s] (baseline)
     Parsing datafusion-datasource-json v54.0.0 (baseline)
      Parsed [   0.013s] (baseline)
    Checking datafusion-datasource-json v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.089s] 223 checks: 220 pass, 3 fail, 0 warn, 30 skip

--- failure module_missing: pub module removed or renamed ---

Description:
A publicly-visible module cannot be imported by its prior path. A `pub use` may have been removed, or the module may have been renamed, removed, or made non-public.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/module_missing.ron

Failed in:
  mod datafusion_datasource_json::boundary_stream, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource-json/src/boundary_stream.rs:18

--- failure pub_module_level_const_missing: pub module-level const is missing ---

Description:
A public const is missing or renamed
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/pub_module_level_const_missing.ron

Failed in:
  END_SCAN_LOOKAHEAD in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource-json/src/boundary_stream.rs:36

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_datasource_json::boundary_stream::AlignedBoundaryStream, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/340b07beb964ef4ccc14e3c5b2e76a96e456b41e/datafusion/datasource-json/src/boundary_stream.rs:64

     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  73.705s] datafusion-datasource-json
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  34.795s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.125s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  35.110s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.134s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.592s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  72.235s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 174.747s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 173.951s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.088s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 352.566s] datafusion-sqllogictest

@philippemnoel philippemnoel deleted the moe/null-aware-dynamic-filter branch June 24, 2026 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants