Skip to content

[improvement](be) Optimize nested loop join materialization#62956

Open
BiteTheDDDDt wants to merge 14 commits intoapache:masterfrom
BiteTheDDDDt:improve-nlj-lazy-materialization
Open

[improvement](be) Optimize nested loop join materialization#62956
BiteTheDDDDt wants to merge 14 commits intoapache:masterfrom
BiteTheDDDDt:improve-nlj-lazy-materialization

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

@BiteTheDDDDt BiteTheDDDDt commented Apr 30, 2026

This pull request introduces support for lazy materialization in Nested Loop Join (NLJ) operators, enabling the engine to selectively materialize only the necessary columns (slots) during query execution. This optimization can significantly reduce memory usage and improve performance, especially for wide tables or queries with many columns. The changes span both the backend (BE) and frontend (FE), including new APIs, data structures, and Thrift serialization.

Key changes:

Backend (BE) Lazy Materialization Support

  • Added new member variables and methods to NestedLoopJoinProbeOperatorX and NestedLoopJoinProbeLocalState to support lazy materialization, including tracking which columns are materialized and enabling/disabling lazy materialization features. (nested_loop_join_probe_operator.h) [1] [2] [3]
  • Updated logic for row and block processing to leverage lazy materialization, such as methods for advancing probe rows, generating blocks, and appending rows with default values. (nested_loop_join_probe_operator.h)

Frontend (FE) Materialized Slot Tracking

  • Introduced tracking of materialized slot IDs in NestedLoopJoinNode, including new data structures (materializedSlotIds, materializedSlotIdMap) and API methods to manage them. (NestedLoopJoinNode.java) [1] [2]
  • Updated the plan translator to determine and propagate which slots should be materialized for NLJ nodes, including handling for mark join slots and project nodes. (PhysicalPlanTranslator.java) [1] [2] [3]

Thrift & Explain Integration

  • Extended the Thrift definition (PlanNodes.thrift) to include an optional materialized_slot_ids field for TNestedLoopJoinNode, allowing the FE to communicate materialization requirements to the BE.
  • Enhanced explain output for NLJ nodes to show which slot IDs are being materialized, aiding debugging and observability. (NestedLoopJoinNode.java)

Code Quality & Safety

  • Improved header includes and type safety in the BE codebase, and added additional checks for row descriptor pointers to prevent null dereferences. (nested_loop_join_probe_operator.h) [1] [2] [3] [4]

References:

SET enable_fallback_to_original_planner = false;
SET disable_join_reorder = true;
SET runtime_filter_mode = OFF;
SET enable_runtime_filter_prune = false;
SET enable_profile = true;
SET profile_level = 2;
SET parallel_pipeline_task_num = 1;

DROP TABLE IF EXISTS nlj_lazy_perf2_probe;
CREATE TABLE nlj_lazy_perf2_probe (
    id BIGINT NOT NULL,
    k INT NOT NULL,
    g INT NOT NULL,
    v1 BIGINT NOT NULL,
    v2 BIGINT NOT NULL,
    v3 BIGINT NOT NULL,
    v4 BIGINT NOT NULL,
    s1 VARCHAR(256) NOT NULL,
    s2 VARCHAR(256) NOT NULL,
    s3 VARCHAR(256) NOT NULL,
    s4 VARCHAR(256) NOT NULL
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 16
PROPERTIES ("replication_num" = "1");

INSERT INTO nlj_lazy_perf2_probe
SELECT
    number AS id,
    CAST(number % 100000 AS INT) AS k,
    CAST(number % 1000 AS INT) AS g,
    number + 1 AS v1,
    number + 2 AS v2,
    number + 3 AS v3,
    number + 4 AS v4,
    CONCAT(REPEAT('p1_', 40), CAST(number AS STRING)) AS s1,
    CONCAT(REPEAT('p2_', 40), CAST(number AS STRING)) AS s2,
    CONCAT(REPEAT('p3_', 40), CAST(number AS STRING)) AS s3,
    CONCAT(REPEAT('p4_', 40), CAST(number AS STRING)) AS s4
FROM numbers("number" = "1000000");

DROP TABLE IF EXISTS nlj_lazy_perf2_build;
CREATE TABLE nlj_lazy_perf2_build (
    bid BIGINT NOT NULL,
    lo INT NOT NULL,
    hi INT NOT NULL,
    g_lo INT NOT NULL,
    g_hi INT NOT NULL,
    v1 BIGINT NOT NULL,
    v2 BIGINT NOT NULL,
    s1 VARCHAR(256) NOT NULL,
    s2 VARCHAR(256) NOT NULL
)
DUPLICATE KEY(bid)
DISTRIBUTED BY HASH(bid) BUCKETS 1
PROPERTIES ("replication_num" = "1");

INSERT INTO nlj_lazy_perf2_build
SELECT
    number AS bid,
    CAST(number * 1000 AS INT) AS lo,
    CAST(number * 1000 + 1 AS INT) AS hi,
    0 AS g_lo,
    1 AS g_hi,
    number + 11 AS v1,
    number + 12 AS v2,
    CONCAT(REPEAT('b1_', 40), CAST(number AS STRING)) AS s1,
    CONCAT(REPEAT('b2_', 40), CAST(number AS STRING)) AS s2
FROM numbers("number" = "100");

ANALYZE TABLE nlj_lazy_perf2_probe WITH SYNC;
ANALYZE TABLE nlj_lazy_perf2_build WITH SYNC;

-- =============================================================================
-- Part A: delayed materialization effect
--
-- ON predicate uses only p.k, p.g, b.lo, b.hi, b.g_lo, b.g_hi.
-- Final aggregate uses wide payload strings, so payload columns cannot be
-- globally pruned by FE. They can only be delayed until after ON filtering.
-- About 100 of 100,000 keys are selected; each selected key has 10 rows.
-- Expected matched rows: about 1,000.
-- CRC32() is used instead of LENGTH() so string data must be read, not only
-- varchar offset subcolumns.
-- =============================================================================
-- A1. Main delayed-materialization benchmark.
SELECT 
       'A1_lazy_payload_after_selective_on' AS case_name,
       COUNT(*) AS matched_rows,
       SUM(CRC32(p.s1) + CRC32(p.s2) + CRC32(p.s3) + CRC32(p.s4)
           + CRC32(b.s1) + CRC32(b.s2)) AS payload_checksum
FROM nlj_lazy_perf2_probe p
JOIN nlj_lazy_perf2_build b
    ON p.k >= b.lo
   AND p.k < b.hi
   AND p.g >= b.g_lo
   AND p.g < b.g_hi;
5.89 sec -> 0.20 sec

-- A2. Same selective ON, but with additional numeric payload. This keeps the
-- shape close to real projection/aggregate workloads.
SELECT 
       'A2_lazy_mixed_payload_after_selective_on' AS case_name,
       COUNT(*) AS matched_rows,
       SUM(p.v1 + p.v2 + p.v3 + p.v4 + b.v1 + b.v2
           + CRC32(p.s1) + CRC32(p.s2) + CRC32(p.s3) + CRC32(p.s4)
           + CRC32(b.s1) + CRC32(b.s2)) AS payload_checksum
FROM nlj_lazy_perf2_probe p
JOIN nlj_lazy_perf2_build b
    ON p.k >= b.lo
   AND p.k < b.hi
   AND p.g >= b.g_lo
   AND p.g < b.g_hi;
6.05 sec -> 0.24 sec
-- =============================================================================
-- Part B: const/eval-block effect; no post-join delayed materialization space
--
-- B1 uses only narrow ON columns and COUNT(*). There is no payload to
-- materialize after ON, so this stresses placeholder/default rows and fixed-side
-- eval columns over 100M candidate pairs.
--
-- B2 adds string payload into the ON predicate but still only outputs COUNT(*).
-- These strings are needed for predicate evaluation, not delayed output
-- materialization. This stresses eval-block ColumnConst behavior for payload
-- expressions without giving delayed materialization a post-filter payload win.
-- =============================================================================

-- B1. All 100M candidate pairs pass, output is COUNT(*).
SELECT 
       'B1_const_count_all_match' AS case_name,
       COUNT(*) AS joined_rows
FROM nlj_lazy_perf2_probe p
JOIN nlj_lazy_perf2_build b
    ON p.k + b.lo >= 0;
0.30 sec -> 0.21 sec

-- B2. All 100M candidate pairs pass, payload strings participate only in ON.
SELECT 
       'B2_const_string_predicate_all_match' AS case_name,
       COUNT(*) AS joined_rows
FROM nlj_lazy_perf2_probe p
JOIN nlj_lazy_perf2_build b
    ON p.k + b.lo >= 0
   AND CRC32(p.s1) + CRC32(b.s1) >= 0
   AND CRC32(p.s2) + CRC32(b.s2) >= 0;
0.60 sec-> 0.51 sec

Copilot AI review requested due to automatic review settings April 30, 2026 02:05
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds FE-planned materialized-slot information for nested loop joins and updates the BE nested loop join probe operator to support delaying payload materialization until after non-equi join conjunct evaluation, reducing unnecessary row expansion/copy for projection-pruned queries (e.g. COUNT(*)).

Changes:

  • Extend TNestedLoopJoinNode thrift with optional materialized_slot_ids to drive BE materialization behavior.
  • FE: Track/map intermediate slots and emit materialized_slot_ids, including explicitly-empty lists for constant/COUNT projections.
  • BE: Implement lazy join-block construction for inner/cross nested loop joins by evaluating join conjuncts using a lightweight eval block and only materializing required payload columns.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
gensrc/thrift/PlanNodes.thrift Adds materialized_slot_ids to nested loop join plan node thrift.
fe/fe-core/src/main/java/org/apache/doris/planner/NestedLoopJoinNode.java Stores and serializes materialized slot ids (and explain output).
fe/fe-core/src/main/java/org/apache/doris/nereids/glue/translator/PhysicalPlanTranslator.java Computes materialized slots for NLJ intermediate tuple; prunes them based on project requirements.
be/src/exec/operator/nested_loop_join_probe_operator.h Declares lazy-materialization helper methods and new state fields.
be/src/exec/operator/nested_loop_join_probe_operator.cpp Implements lazy join generation paths and enables them under specific conditions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1995 to +1996
Map<ExprId, SlotReference> materializedSlotReferenceMap = Maps.newHashMap(outputSlotReferenceMap);
nestedLoopJoinNode.enableMaterializedSlotIds();
!_is_mark_join &&
(_join_op == TJoinOp::INNER_JOIN || _join_op == TJoinOp::CROSS_JOIN) &&
conjuncts().empty() && !projections().empty() &&
&projections_row_desc() == &intermediate_row_desc();
Comment on lines +1053 to +1055
// Slots that need to be materialized after join conjunct evaluation.
// If this field is not set, BE keeps the legacy behavior.
// If this field is set to an empty list, no payload slot needs materialization.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review result: request changes. I found a blocking nested-loop-join correctness regression, and the new correctness-critical lazy materialization path has no dedicated test coverage in this PR.

Critical checkpoint conclusions:

  • Goal/test proof: the goal is to avoid materializing pruned nested-loop-join payload columns before join-conjunct evaluation. The implementation does not fully preserve column invariants, and no regression/unit test proves the new lazy path.
  • Scope: the actual GitHub PR is focused on nested-loop-join materialization and the FE/BE thrift plumbing.
  • Concurrency: no new shared concurrent state issue found in the actual PR diff.
  • Lifecycle/static initialization: no special lifecycle or cross-TU static initialization issue found.
  • Configuration: no config item added.
  • Compatibility: the new thrift field is optional and BE has a legacy fallback when absent; no mixed-version issue found there.
  • Parallel paths: the change handles FE translation plus BE execution for nested-loop join, but the BE mark-join null append path regresses from the base implementation.
  • Special checks: no new special conditional check issue found.
  • Test coverage/results: insufficient; this kernel execution change needs tests for projection-pruned inner/cross nested-loop joins, filtered join conjuncts, zero materialized slots, and nullable/outer/mark-adjacent paths.
  • Observability: no new observability requirement identified.
  • Transaction/persistence/data writes: not applicable.
  • FE/BE variables: materialized_slot_ids is passed through FE/thrift/BE, but the BE output column handling has a size mismatch bug.
  • Performance: intended optimization is plausible, but correctness must be fixed and covered first.
  • User focus: no additional user-provided review focus was supplied.

}
if (_has_materialized_slot_ids) {
for (const auto slot_id : _materialized_slot_ids) {
const int column_id = intermediate_row_desc().get_column_id(slot_id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regresses the base behavior from resizing by _probe_side_process_count to resizing by 1. The nested nullable column above inserts _probe_side_process_count probe rows, so when more than one row is appended the ColumnNullable nested column and null map have different sizes. That can crash or corrupt later block processing/serialization. Please keep the null map size aligned with the inserted range.

Suggested change
const int column_id = intermediate_row_desc().get_column_id(slot_id);
.resize_fill(origin_sz + _probe_side_process_count, 0);

}

if (!_matched_rows_done && !_need_more_input_data) {
if (use_generate_block_base_build()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables a new execution path that bypasses the existing full materialization plus _do_filtering_and_update_visited_flags() flow for inner/cross nested-loop joins, but the PR does not add any BE unit or regression coverage for it. This is a kernel execution correctness change; please add tests covering at least projection-pruned joins with non-equi join conjuncts, all-slots-pruned cases such as COUNT(*), and nullable columns so the lazy placeholder/materialization logic is exercised.

HappenLee
HappenLee previously approved these changes May 6, 2026
Copy link
Copy Markdown
Contributor

@HappenLee HappenLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 6, 2026
BiteTheDDDDt and others added 2 commits May 6, 2026 12:53
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Nested loop join materialized full intermediate payload columns before evaluating non-equi join conjuncts, making COUNT and projection-pruned joins pay unnecessary copy/filter cost.

### Release note

Improve nested loop join execution by delaying payload materialization for projection-pruned inner/cross joins.

### Check List (For Author)

- Test: Build and style checks
    - ./generated-source.sh
    - build-support/clang-format.sh be/src/exec/operator/nested_loop_join_probe_operator.cpp be/src/exec/operator/nested_loop_join_probe_operator.h && build-support/check-format.sh
    - cd fe && mvn checkstyle:check -pl fe-core -DskipTests
    - ./build.sh --fe -j8
    - ./build.sh --be -j8
    - build-support/run-clang-tidy.sh (blocked by pre-existing readability-function-size/readability-function-cognitive-complexity diagnostics in nested_loop_join_probe_operator.cpp and jni-util.h static_assert diagnostics)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oins

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Extend nested loop join lazy materialization beyond inner/cross joins by tracking probe-side match state and finalizing left outer/semi/anti rows without eagerly materializing unused payload columns.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - ./run-be-ut.sh --run --filter='*NLJ*' -j8

    - build-support/clang-format.sh

    - build-support/check-format.sh

    - git diff --check

    - build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (blocked by existing be/src/util/jni-util.h static_assert diagnostics after changed-line warnings were addressed)

- Behavior changed: Yes. LEFT OUTER/LEFT SEMI/LEFT ANTI nested loop joins can use the lazy materialization path when the existing lazy-materialization gates are satisfied.

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt BiteTheDDDDt force-pushed the improve-nlj-lazy-materialization branch from 6933d37 to 25a2d28 Compare May 6, 2026 06:31
@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 6, 2026
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional correctness issue in the new lazy nested-loop-join path. Existing review threads already cover the missing tests, nullable null-map sizing, IDL comment clarity, BE guard clarity, and the FE map allocation, so I did not duplicate those.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to reduce nested-loop-join materialization by passing required slot IDs from FE and lazily materializing in BE. The approach is plausible, but the current left-side finalization state machine can drop output rows; existing coverage also does not prove this path.
  • Scope/focus: The change is focused on NLJ materialization, but it introduces a large alternate execution path with separate row-advancement semantics.
  • Concurrency/lifecycle: No new shared-state concurrency or special lifecycle issue found beyond the existing operator local-state flow.
  • Config/compatibility: No new config. Thrift field is optional, so old FE keeps legacy behavior; mixed-version fallback appears intended.
  • Parallel paths: Hash join has separate pruning logic; this PR intentionally handles NLJ. The lazy NLJ path must still preserve legacy outer/semi/anti finalization semantics.
  • Conditional checks: The lazy enablement conditions are already covered by an existing comment; no duplicate raised.
  • Test coverage: Existing comments already request coverage. The issue below should be covered by a regression such as LEFT OUTER/LEFT ANTI nested-loop join with an empty build side and enough probe rows to cross a batch boundary.
  • Observability/performance: No additional blocking observability issue found.
  • Transaction/persistence/data-write: Not applicable.
  • FE/BE variable passing: The new thrift field is passed on the Nereids NLJ path; no additional passing gap found.

User focus: No additional user-provided review focus was specified.

return Status::OK();
}
if (_join_block.rows() >= state->batch_size()) {
return Status::OK();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning OK here when the output batch is already full lets _advance_lazy_probe_row() continue to _reset_with_next_probe_row() and move _probe_block_pos forward without emitting the required finalized row. A concrete case is LEFT OUTER JOIN or LEFT ANTI JOIN with an empty build side: _current_build_pos == build_blocks.size() remains true, so _generate_lazy_block_base_probe() repeatedly finalizes probe rows; after one append fills state->batch_size(), the next probe row hits this early return, then line 342 advances past it, dropping the null/default output row. Please make finalization report that the current probe row was not consumed, or otherwise stop before advancing when the row cannot be appended.

BiteTheDDDDt and others added 7 commits May 6, 2026 15:48
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Remove the redundant nested loop join old-version flag from the lazy gate and allow lazy materialization when post-join conjuncts are present, relying on FE materialized slot ids to cover post-join filter inputs.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - ./run-be-ut.sh --run --filter='*NLJ*' -j8

    - build-support/clang-format.sh

    - build-support/check-format.sh

    - git diff --check

    - build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (blocked by existing be/src/util/jni-util.h static_assert diagnostics)

- Behavior changed: Yes. NLJ lazy materialization can now remain enabled when post-join conjuncts exist and required slots are provided by FE.

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Remove fallback-style descriptor handling from the nested loop join probe operator now that nested-loop plan descriptors are expected to exist, and use explicit assertions for the invariant.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - ./run-be-ut.sh --run --filter='*NLJ*' -j8

    - build-support/clang-format.sh

    - build-support/check-format.sh

    - git diff --check

    - build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (blocked by existing be/src/util/jni-util.h static_assert diagnostics)

- Behavior changed: No

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Introduce lazy output-policy helpers for nested loop join and support build-side visited tracking/finalization so RIGHT and FULL nested loop joins can use lazy materialization under the existing gates.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - ./run-be-ut.sh --run --filter='*NLJ*' -j8

    - build-support/clang-format.sh

    - build-support/check-format.sh

    - git diff --check

    - build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (blocked by existing be/src/util/jni-util.h static_assert diagnostics)

- Behavior changed: Yes. RIGHT OUTER/RIGHT SEMI/RIGHT ANTI/FULL OUTER nested loop joins can use lazy materialization when the existing lazy-materialization gates are satisfied.

- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Extend nested loop join lazy materialization to additional finalization cases. NULL-aware left anti join can now use probe-side lazy finalization, simple mark left semi/anti joins without mark conjuncts can lazily produce mark output, and build-side finalization now keeps a row cursor so RIGHT/FULL batches do not overrun batch size or duplicate rows when resuming inside a build block.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Unit Test: ./run-be-ut.sh --run --filter='*NLJ*' -j8
    - Manual test: build-support/clang-format.sh; build-support/check-format.sh; git diff --check
    - Manual test: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN was run but blocked by existing be/src/util/jni-util.h static_assert(false) diagnostics
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Complete lazy materialization support for left semi/anti mark nested loop joins. The lazy path now keeps tri-state mark state for each probe row, evaluates NLJ mark join conjuncts in the lazy eval block, materializes nullable mark output correctly, and ensures Nereids always sends mark join conjuncts for nested loop join plans.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Unit Test: ./run-be-ut.sh --run --filter='*NLJ*' -j8
    - Manual test: build-support/clang-format.sh; build-support/check-format.sh; git diff --check
    - Manual test: cd fe && mvn checkstyle:check -pl fe-core -q
    - Manual test: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN was run but blocked by existing be/src/util/jni-util.h static_assert(false) diagnostics
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add a session variable to control Nested Loop Join lazy materialization. When enable_nested_loop_join_lazy_materialization is false, Nereids does not send materialized_slot_ids to BE, so BE keeps the existing legacy materialization path.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Manual test: git diff --check
    - Manual test: cd fe && mvn checkstyle:check -pl fe-core -q
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Remove the session variable switch for Nested Loop Join lazy materialization. The switch only controlled FE materialized_slot_ids transmission and could not fully isolate the branch's BE lazy materialization changes, so keeping it would be misleading.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Manual test: git diff --check --cached
    - Manual test: cd fe && mvn checkstyle:check -pl fe-core -q
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking compatibility issue. The lazy nested-loop join optimization still has existing unresolved correctness/test-coverage threads, and this PR also removes the descriptor fallback needed for old FE plans during rolling upgrade.

Critical checkpoint conclusions:

  • Goal/test: The goal is to lazily materialize nested-loop join payload slots. The implementation is not yet proven safe because existing correctness bugs and missing coverage remain, and this review found a separate rolling-upgrade failure.
  • Scope/clarity: The change is broad for a hot execution path and includes FE/BE/thrift protocol behavior; it is not a small local refactor.
  • Concurrency: No new shared-thread synchronization primitive was introduced in the reviewed diff; existing build/probe shared-state usage appears to follow the current operator model.
  • Lifecycle/static initialization: No new static initialization-order issue found.
  • Configuration: No new config item.
  • Compatibility: Blocking issue found: old FE plans without nested_loop_join_node no longer have a safe row-descriptor fallback in BE.
  • Parallel code paths: Hash join has similar materialized-slot pruning concepts; nested-loop join differs enough that the new BE path needs its own complete coverage.
  • Special conditions: Lazy enablement has several non-obvious conditions; existing review threads already cover clarity/comment concerns.
  • Test coverage: Existing thread already requests regression/unit coverage for the new lazy path; I did not find additional tests in this diff.
  • Test results: Not verified locally in this review.
  • Observability: No new critical observability gap beyond the correctness issues.
  • Transaction/persistence/data writes: Not applicable.
  • FE-BE variable passing: New thrift field is passed for Nereids nested-loop joins, but compatibility with old FE is broken as noted inline.
  • Performance: Existing threads cover the main performance/extra-allocation concern; no additional blocking performance issue found.

User focus: no additional user-provided review focus was supplied.

@@ -232,7 +269,8 @@ class NestedLoopJoinProbeOperatorX final
Status push(RuntimeState* state, Block* input_block, bool eos) const override;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the old-FE fallback for nested-loop/cross join plans. When a new BE receives a plan from an old FE, tnode.__isset.nested_loop_join_node is false, so JoinProbeOperatorX leaves _intermediate_row_desc and _output_row_desc unset (see the legacy branch in join_probe_operator.cpp). Before this PR _old_version_flag made intermediate_row_desc() and row_desc() return _row_descriptor; now the same rolling-upgrade plan hits this DORIS_CHECK (or the _output_row_desc check below) and the query fails/crashes. Please preserve the legacy descriptor fallback or initialize these descriptors for the old-plan case before removing _old_version_flag.

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary:

I reviewed the PR end-to-end with the Doris code-review checklist. The change adds lazy materialization for nested-loop joins, wires FE slot-id transmission, updates thrift, and adds a regression suite. There is still one additional determinism issue in the new FE serialization path, and the existing inline threads already contain several blocking correctness/compatibility issues that should remain addressed before merge.

Critical checkpoint conclusions:

  • Goal/test coverage: The goal is clear and there is now a regression suite covering several lazy-materialization join variants, but existing review context still identifies missing/insufficient coverage for projection-pruned/non-equi/all-slots-pruned/nullable paths and correctness bugs in lazy finalization.
  • Scope/focus: The PR is mostly focused on nested-loop join lazy materialization, but it introduces a broad new BE execution path across many join variants.
  • Concurrency: No new threads, dependencies, or shared mutable cross-task state were introduced in this PR path beyond existing shared build/probe state; no new lock-order issue found.
  • Lifecycle/static initialization: No new static/global initializer dependency found.
  • Configuration: No new config item.
  • Compatibility: Existing review thread already flags a rolling-upgrade old-FE/new-BE descriptor fallback regression; I did not duplicate that inline.
  • Parallel code paths: Hash join already has materialized slot pruning semantics; this PR adds the nested-loop path. I found a new deterministic serialization issue in the nested-loop path.
  • Special conditional checks: Existing review threads already flag non-obvious descriptor checks and lazy finalization behavior.
  • Test result correctness: I did not run the regression suite in this runner; review is static.
  • Observability: No new critical observability gap found beyond plan/explain determinism.
  • Transaction/persistence/data writes: Not applicable.
  • FE-BE variables: New materialized_slot_ids is passed FE to BE, but its order is nondeterministic because FE stores it in a HashSet before serialization.
  • Performance: Lazy materialization reduces avoidable materialization; no additional hot-path issue found beyond existing comments.

User focus: No additional user-provided focus was present.

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

BiteTheDDDDt and others added 2 commits May 9, 2026 16:13
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Reduce unnecessary lazy nested loop join work by avoiding mark-state initialization for joins that do not finalize mark results, short-circuiting terminal TRUE mark rows, and simplifying filter-result handling after conjunct evaluation.

### Release note

None

### Check List (For Author)

- Test: Unit Test and Regression test
    - ./run-be-ut.sh --run --filter='*NLJ*' -j8
    - ./run-regression-test.sh --run -d query_p0/join -s test_nestedloop_lazy_materialization
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Once a lazy MARK LEFT SEMI/ANTI probe row reaches TRUE, remaining build blocks cannot change the mark result. Advance the build-block cursor to the end so the probe row is finalized without dispatching additional lazy join work.

### Release note

None

### Check List (For Author)

- Test: Unit Test and Regression test
    - ./run-be-ut.sh --run --filter='*NLJ*' -j8
    - ./run-regression-test.sh --run -d query_p0/join -s test_nestedloop_lazy_materialization
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.00% (1844/2364)
Line Coverage 64.64% (32997/51047)
Region Coverage 65.22% (16376/25109)
Branch Coverage 55.83% (8747/15668)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29956 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a47b2b9deb1ffca1a162b678d6c50f0a0da20f8f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17689	3919	3883	3883
q2	q3	10691	897	614	614
q4	4693	466	338	338
q5	7774	1371	1171	1171
q6	333	174	143	143
q7	938	959	767	767
q8	10990	1467	1244	1244
q9	7333	5456	5413	5413
q10	6310	2087	1803	1803
q11	469	282	255	255
q12	633	430	297	297
q13	18112	3346	2702	2702
q14	291	284	265	265
q15	q16	911	871	788	788
q17	991	1098	749	749
q18	6547	5742	5638	5638
q19	1183	1295	1232	1232
q20	575	464	310	310
q21	4551	2343	2002	2002
q22	453	378	342	342
Total cold run time: 101467 ms
Total hot run time: 29956 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4874	4686	4789	4686
q2	q3	4697	4792	4183	4183
q4	2130	2207	1427	1427
q5	5033	5040	5314	5040
q6	204	168	139	139
q7	2100	1814	1641	1641
q8	3397	3114	3104	3104
q9	8461	8493	8453	8453
q10	4462	4527	4259	4259
q11	598	418	388	388
q12	726	751	528	528
q13	3269	3582	3149	3149
q14	307	304	276	276
q15	q16	750	788	716	716
q17	1345	1320	1358	1320
q18	8104	7347	6992	6992
q19	1181	1167	1142	1142
q20	2231	2220	1930	1930
q21	6167	5482	4951	4951
q22	560	492	409	409
Total cold run time: 60596 ms
Total hot run time: 54733 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171653 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a47b2b9deb1ffca1a162b678d6c50f0a0da20f8f, data reload: false

query5	4327	642	516	516
query6	353	229	196	196
query7	4346	556	310	310
query8	339	228	215	215
query9	8815	4109	4054	4054
query10	460	363	308	308
query11	5725	2410	2281	2281
query12	185	134	133	133
query13	1302	597	435	435
query14	6576	5417	5114	5114
query14_1	4383	4393	4369	4369
query15	218	208	185	185
query16	1031	488	481	481
query17	1165	780	655	655
query18	2738	496	369	369
query19	225	217	172	172
query20	141	146	139	139
query21	217	143	121	121
query22	13626	14769	14363	14363
query23	17439	16609	16277	16277
query23_1	16371	16298	16138	16138
query24	7421	1746	1332	1332
query24_1	1341	1327	1347	1327
query25	562	478	418	418
query26	1328	317	169	169
query27	2691	635	343	343
query28	4296	1984	1945	1945
query29	1004	667	523	523
query30	306	240	187	187
query31	1100	1047	940	940
query32	83	73	71	71
query33	526	337	290	290
query34	1187	1153	641	641
query35	759	787	665	665
query36	1296	1350	1219	1219
query37	153	105	85	85
query38	3213	3137	3099	3099
query39	921	925	899	899
query39_1	872	876	879	876
query40	228	158	137	137
query41	63	61	59	59
query42	112	107	108	107
query43	327	328	288	288
query44	
query45	210	203	194	194
query46	1108	1177	735	735
query47	2342	2333	2240	2240
query48	403	421	301	301
query49	635	513	440	440
query50	701	291	210	210
query51	4237	4320	4185	4185
query52	103	104	96	96
query53	269	276	205	205
query54	327	268	256	256
query55	97	94	85	85
query56	314	306	305	305
query57	1435	1425	1340	1340
query58	299	271	275	271
query59	1537	1668	1404	1404
query60	338	340	327	327
query61	160	161	157	157
query62	667	611	560	560
query63	248	204	204	204
query64	2395	828	684	684
query65	
query66	1697	516	382	382
query67	30047	30005	29797	29797
query68	
query69	462	331	306	306
query70	983	1007	1028	1007
query71	304	285	269	269
query72	2938	2775	2458	2458
query73	869	772	433	433
query74	5028	4924	4721	4721
query75	2756	2653	2329	2329
query76	2304	1132	762	762
query77	417	430	352	352
query78	13064	13001	12373	12373
query79	1508	954	785	785
query80	751	582	480	480
query81	471	283	241	241
query82	1333	167	126	126
query83	347	281	254	254
query84	266	143	112	112
query85	869	513	459	459
query86	401	322	314	314
query87	3393	3355	3247	3247
query88	3508	2675	2657	2657
query89	441	388	338	338
query90	1877	187	179	179
query91	176	173	140	140
query92	80	80	74	74
query93	958	957	551	551
query94	568	366	310	310
query95	676	389	443	389
query96	1092	782	366	366
query97	2709	2699	2582	2582
query98	242	232	221	221
query99	1131	1132	994	994
Total cold run time: 254046 ms
Total hot run time: 171653 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 90.70% (507/559) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.64% (27814/37769)
Line Coverage 57.52% (300999/523255)
Region Coverage 54.56% (250078/458368)
Branch Coverage 56.20% (108403/192885)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants