Kept null-aware anti-join NULLs in the pushed dynamic filter.#2
Closed
mdashti wants to merge 3 commits into
Closed
Kept null-aware anti-join NULLs in the pushed dynamic filter.#2mdashti wants to merge 3 commits into
mdashti wants to merge 3 commits into
Conversation
d1369d1 to
a71cb76
Compare
## Which issue does this PR close? - Closes apache#21419 ## Rationale for this change The CSV scanner currently uses `calculate_range` which issues two extra `get_opts` requests per byte range to find newline boundaries (one for the start boundary, one for the end boundary), plus one GET for the actual data. For a file split into 3 partitions, this results in 8 total object store requests. apache#20823 solved this same problem for the JSON scanner by introducing `AlignedBoundaryStream`, which wraps the raw byte stream and lazily aligns to newline boundaries as data is read, eliminating the extra boundary-seeking requests entirely. This PR applies the same approach to CSV. ## What changes are included in this PR? Based on the approach from apache#20823: Moved `AlignedBoundaryStream` from `datasource-json` to the shared `datasource` crate so it can be reused by both JSON and CSV scanners. Updated `CsvOpener` to use instead of `calculate_range`, and removed the `calculate_range` & `find_first_newline` as they no longer had any callers. Updated tests to reflect. Public API changes include: - Removal of the `RangeCalculation` enum - Removal of the `calculate_change` function - `boundary_stream` (containing `AlignedBoundaryStream`) moved from `datafusion_datasource_json` to `datafusion_datasource` ## Are these changes tested? Yes. The existing `AlignedBoundaryStream` unit tests (16 tests covering boundary alignment edge cases) were moved along with the implementation and continue to pass. The `query_csv_file_with_byte_range_partitions` snapshot test in `object_store_access.rs` has been updated to verify the new request pattern (4 requests instead of 8). ## Are there any user-facing changes? No.
Member
|
Is this different from what Mithun fixed? apache#22104 |
Collaborator
Author
|
@philippemnoel interesting! Didn't see that. It showed up in our tests in paradedb/paradedb#5363 . Should check if it's the same fix or not. |
Collaborator
Author
|
@philippemnoel AFAI understand, PR apache#22104 is a different bug: it preserves |
Member
|
Cool. Awesome then :) |
The hash-join dynamic filter pushed `key IN build_keys` down to the probe scan for null-aware anti joins too. That drops the probe-side NULL, but `NOT IN` three-valued logic needs it to collapse the result to zero rows, so the join silently returned rows. OR `probe_key IS NULL` into the pushed predicate. Non-NULL probe rows still get filtered; only the NULL additionally survives.
Exercises the pushdown path the existing in-memory tests miss: parquet with row-level filtering, so the pushed dynamic filter actually drops rows. Without the fix `id NOT IN (SELECT eid ...)` returns 1 and 3 instead of zero rows.
5162364 to
f30d9bc
Compare
Collaborator
Author
|
Superseded by the upstream PR apache#23104. |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A null-aware anti join (
NOT IN) wrongly returned rows when its inner side had a NULL. This PR makes the pushed-down dynamic filter keep the probe's NULL rows, so the result collapses to zero as it should.Why
A hash join pushes a build-side filter (
key IN build_keys) down to the probe scan. That's fine for most joins, but the probe's NULL is special forNOT IN: one NULL makes every comparison unknown, so the result has to be empty. The pushed filter dropped that NULL at the scan, before the join's null-aware check ran, so rows leaked through.How
Threaded
null_awareintoSharedBuildAccumulator. When it's set,build_filterORsprobe_key IS NULLinto the predicate before updating the dynamic filter. Non-NULL probe rows still get filtered, so the optimization stays.HashJoinExecalready validates that a null-aware join has a single probe key.Tests
Added a parquet repro to
null_aware_anti_join.slt. The existing cases use in-memoryVALUES, whose scans never apply the pushed filter, so they passed despite the bug. The new one setsparquet.pushdown_filters = trueso the filter runs row-level. Without the fix it returns1, 3; with it, zero rows.