Skip to content

refactor: make join projection pushdown schema-aware via ColumnIndex/…#23185

Open
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:refactor/23010
Open

refactor: make join projection pushdown schema-aware via ColumnIndex/…#23185
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:refactor/23010

Conversation

@Phoenix500526

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

try_pushdown_through_join (datafusion/physical-plan/src/projection.rs)
decided whether each projected join-output column belonged to the left or right
child by comparing its output index against the left child's field count
(join_table_borders) — i.e. it assumed the join output is a plain
left ++ right concatenation.

That assumption does not hold for all join types. A LeftMark / RightMark
join appends a synthetic mark boolean column that has JoinSide::None and
originates from neither child, so reasoning from output position can route the
mark column to the wrong child. This is the panic that #22902 worked around by
disabling projection pushdown for LeftMark / RightMark in HashJoinExec
and NestedLoopJoinExec — correct, but it left the helper reasoning from output
position rather than from the join's actual output-origin contract.

What changes are included in this PR?

  • Thread the join's output-origin metadata (&[ColumnIndex], already computed by
    build_join_schema) into try_pushdown_through_join.
  • Replace the output-position split (join_table_borders +
    index < left_field_count) with grouping by ColumnIndex.side:
    • Left / Right columns are collected per child using the child-relative
      ColumnIndex.index;
    • a JoinSide::None (synthetic, e.g. mark) column makes the pushdown decline
      (Ok(None)), so the projection is embedded into the join instead — no panic,
      no mis-routing.
  • Remove the LeftMark | RightMark bypass from
    HashJoinExec::try_swapping_with_projection and
    NestedLoopJoinExec::try_swapping_with_projection; they now call
    try_pushdown_through_join uniformly.
  • The shared helpers used by cross / symmetric-hash / sort-merge join pushdown
    (new_join_children, join_table_borders, join_allows_pushdown,
    update_join_on, update_join_filter) keep their signatures. The side-grouped
    path uses a new private new_join_children_from_groups and reuses
    update_join_on / update_join_filter with a zero column-index offset (the
    groups already carry child-relative indices).

Are these changes tested?

Yes.

  • New subquery.slt cases place a subset/reordering projection over a mark join
    — hash LeftMark (IN under OR), negated mark (NOT EXISTS under OR),
    and nested-loop mark (non-equi correlated) — asserting both the query result
    and the EXPLAIN plan shape.
  • Existing coverage is preserved: cargo test -p datafusion-physical-plan projection (incl. the filter-pushdown and join_table_borders unit tests),
    the join *.slt suite, and subquery.slt all pass.
  • cargo clippy -p datafusion-physical-plan -- -D warnings and cargo fmt --all
    are clean.

Are there any user-facing changes?

No behavior change: the produced physical plans are unchanged for the supported
cases (mark joins keep falling back to the same embedded-projection plan as
before, regular joins push down as before).

One signature change to a pub helper: try_pushdown_through_join gains a
column_indices: &[ColumnIndex] parameter (all in-tree callers are updated). If
the project treats this helper as part of the public API surface, please add the
api change label.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jun 25, 2026
…JoinSide

try_pushdown_through_join decided whether each projected join-output column
belonged to the left or right child by comparing its output index against the
left child's field count, assuming a `left ++ right` output schema. That is
wrong for joins whose output is not a plain concatenation -- mark joins append a
synthetic `mark` column (JoinSide::None) belonging to neither child -- which is
why apache#22902 had to bypass projection pushdown for LeftMark / RightMark.

Group the projected columns by the join's output-origin metadata
(ColumnIndex.side) instead of output position: Left/Right columns push to the
matching child via the child-relative ColumnIndex.index, and a synthetic
(JoinSide::None) column makes the pushdown decline so the projection is embedded
into the join instead -- no panic, no mis-routing. This lets the LeftMark /
RightMark bypass be removed from HashJoinExec and NestedLoopJoinExec.

The shared helpers used by cross / symmetric / sort-merge joins
(new_join_children, join_table_borders, update_join_on/filter) keep their
signatures; the side-grouped path uses a new new_join_children_from_groups and
reuses update_join_on/filter with a zero column-index offset. Mark joins still
fall back to embedding (pushing their child columns down is a follow-up).

Closes apache#23010

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  36.108s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.126s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  35.296s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.136s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.575s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_missing.ron

Failed in:
  function datafusion_physical_plan::projection::try_pushdown_through_join, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/c519b008fd84f8f6bb5b5219ec0f7d2212d6fa60/datafusion/physical-plan/src/projection.rs:645

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  73.497s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 173.811s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 173.233s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.087s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 349.473s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 25, 2026
It is a crate-internal helper, called only by HashJoinExec and
NestedLoopJoinExec within datafusion-physical-plan, is not re-exported,
and has no external consumers. Reducing its visibility keeps it off the
public API surface so future signature changes no longer trip
cargo-semver-checks.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: Make join projection pushdown schema-aware via ColumnIndex / JoinSide

1 participant