Skip to content

[fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues#62821

Open
hubgeter wants to merge 1 commit into
apache:masterfrom
hubgeter:fix_refactor_error
Open

[fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues#62821
hubgeter wants to merge 1 commit into
apache:masterfrom
hubgeter:fix_refactor_error

Conversation

@hubgeter
Copy link
Copy Markdown
Contributor

@hubgeter hubgeter commented Apr 24, 2026

What problem does this PR solve?

Related PR: #62306

Problem Summary:
This PR fixes some issues caused by the refactoring #62306 and legacy issues:

  1. For Iceberg/Paimon systems, it's necessary to pass metadata partition values ​​for each split. Simply relying on information from files to obtain partition values ​​is unreliable, especially for tables migrated from Hive.

  2. Condition cache conflicts with CountReader and Lazy RF; see comments in be/src/exec/scan/file_scanner.cpp for details.

  3. PR [refactoring](multi-catalog)data_lake_reader_refactoring.  #62306 omitted handling of Iceberg name_mapping.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 24, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter hubgeter marked this pull request as draft April 24, 2026 10:32
@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 28.12% (124/441) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.39% (26776/37508)
Line Coverage 53.73% (279564/520283)
Region Coverage 47.08% (214867/456418)
Branch Coverage 50.36% (97288/193169)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 37.04% (70/189) 🎉
Increment coverage report
Complete coverage report

@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hubgeter
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24932133823

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/91) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 76.29% (428/561) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.71% (27641/37499)
Line Coverage 57.45% (298996/520472)
Region Coverage 54.60% (248566/455285)
Branch Coverage 56.17% (107684/191696)

@hubgeter hubgeter force-pushed the fix_refactor_error branch from e4393e5 to ab89427 Compare April 26, 2026 15:03
@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 27.78% (45/162) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 14.72% (34/231) 🎉
Increment coverage report
Complete coverage report

@hubgeter hubgeter force-pushed the fix_refactor_error branch 2 times, most recently from 8e8c7c1 to c2cf480 Compare April 27, 2026 09:40
@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 41.38% (72/174) 🎉
Increment coverage report
Complete coverage report

@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 81.40% (442/543) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.72% (27657/37516)
Line Coverage 57.49% (299336/520697)
Region Coverage 54.59% (248725/455608)
Branch Coverage 56.23% (107845/191798)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 19.10% (34/178) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 41.38% (72/174) 🎉
Increment coverage report
Complete coverage report

@hubgeter hubgeter force-pushed the fix_refactor_error branch 2 times, most recently from a2ca7ab to 39c2537 Compare April 29, 2026 06:42
@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hubgeter hubgeter changed the title fix refactor error [fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues Apr 29, 2026
@hubgeter hubgeter force-pushed the fix_refactor_error branch from 39c2537 to 9dfaed5 Compare April 29, 2026 07:21
@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@hubgeter hubgeter marked this pull request as ready for review April 29, 2026 07:21
@hubgeter hubgeter force-pushed the fix_refactor_error branch from 9dfaed5 to 67b3ec3 Compare April 30, 2026 09:49
@hubgeter
Copy link
Copy Markdown
Contributor Author

run buildall

@Gabriel39
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two issues that should be fixed before merging. The main correctness concern is that Paimon partition keys are now exposed as path partition keys using the original Paimon key names, while Doris external table columns are normalized to lower case; mixed-case partition columns can therefore stop being treated as partition columns and fail to materialize metadata partition values. There is also a regression-test portability issue from hard-coding a specific EMR HDFS namenode instead of using the configured test environment.

Critical checkpoint conclusions: Goal: the PR addresses metadata partition values for Iceberg/Paimon, condition-cache safety, and Iceberg name mapping, with added unit/regression coverage, but the Paimon mixed-case partition path is not fully correct. Scope: the changes are generally focused, though the new regression test contains environment-specific configuration. Concurrency/lifecycle: no new concurrency or non-obvious lifecycle issue found in the reviewed paths. Config/compatibility: no new Doris config or storage-format incompatibility found; FE/BE path metadata remains within existing thrift fields. Parallel paths: Iceberg mostly lowercases identity partition keys, but Paimon does not, which creates the distinct issue noted inline. Tests: coverage was added for lowercase migrated partitions, but not for mixed-case Paimon partition column names and the new Paimon p2 suite may not run outside the author environment. Observability/performance: no additional blocking issue found beyond the correctness/test-portability concerns. User focus: no additional user-provided review focus was supplied.

return Collections.emptyList();
}
return new ArrayList<>(source.getPaimonTable().partitionKeys());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns Paimon's original partition key names, but Doris external Paimon columns are normalized to lower case when building the schema (PaimonExternalTable.initSchema). For a table with a mixed-case partition key such as Dt, classifyColumn() will compare the lower-case slot name (dt) with partitionKeys containing Dt, so the partition column is classified as REGULAR. Later setPaimonParams() also writes columns_from_path_keys as Dt, which BE compares against lower-case ColumnDescriptor.name, so the metadata partition value is not filled. This regresses exactly the metadata-partition path this PR is adding for any migrated Paimon table whose partition column casing is not already lower-case. Please normalize the returned keys, and the keys stored by PaimonUtil.getPartitionInfoMap, to the same lower-case names used in Doris schema, and add a test with a mixed-case partition column.

"type" = "paimon",
"paimon.catalog.type" = "hms",
"warehouse" = "hdfs://master-1-1.c-a212282673679a24.cn-beijing.emr.aliyuncs.com:9000/user/hive/warehouse/",
'hive.version' = '3.1.3',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes one EMR cluster's HDFS namenode into the regression test. The same suite already gates on enableExternalEmrTest and reads emrCatalogCommonProp, so in other regression environments this catalog will point at an unreachable host even when the EMR/HMS properties are configured correctly. Please derive the warehouse from the regression config (for example the same external/env property used by the EMR catalog setup) instead of committing an environment-specific hostname.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants