[fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues#62821
[fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues#62821hubgeter wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
e4393e5 to
ab89427
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
8e8c7c1 to
c2cf480
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
FE UT Coverage ReportIncrement line coverage |
a2ca7ab to
39c2537
Compare
|
run buildall |
39c2537 to
9dfaed5
Compare
|
run buildall |
9dfaed5 to
67b3ec3
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found two issues that should be fixed before merging. The main correctness concern is that Paimon partition keys are now exposed as path partition keys using the original Paimon key names, while Doris external table columns are normalized to lower case; mixed-case partition columns can therefore stop being treated as partition columns and fail to materialize metadata partition values. There is also a regression-test portability issue from hard-coding a specific EMR HDFS namenode instead of using the configured test environment.
Critical checkpoint conclusions: Goal: the PR addresses metadata partition values for Iceberg/Paimon, condition-cache safety, and Iceberg name mapping, with added unit/regression coverage, but the Paimon mixed-case partition path is not fully correct. Scope: the changes are generally focused, though the new regression test contains environment-specific configuration. Concurrency/lifecycle: no new concurrency or non-obvious lifecycle issue found in the reviewed paths. Config/compatibility: no new Doris config or storage-format incompatibility found; FE/BE path metadata remains within existing thrift fields. Parallel paths: Iceberg mostly lowercases identity partition keys, but Paimon does not, which creates the distinct issue noted inline. Tests: coverage was added for lowercase migrated partitions, but not for mixed-case Paimon partition column names and the new Paimon p2 suite may not run outside the author environment. Observability/performance: no additional blocking issue found beyond the correctness/test-portability concerns. User focus: no additional user-provided review focus was supplied.
| return Collections.emptyList(); | ||
| } | ||
| return new ArrayList<>(source.getPaimonTable().partitionKeys()); | ||
| } |
There was a problem hiding this comment.
This returns Paimon's original partition key names, but Doris external Paimon columns are normalized to lower case when building the schema (PaimonExternalTable.initSchema). For a table with a mixed-case partition key such as Dt, classifyColumn() will compare the lower-case slot name (dt) with partitionKeys containing Dt, so the partition column is classified as REGULAR. Later setPaimonParams() also writes columns_from_path_keys as Dt, which BE compares against lower-case ColumnDescriptor.name, so the metadata partition value is not filled. This regresses exactly the metadata-partition path this PR is adding for any migrated Paimon table whose partition column casing is not already lower-case. Please normalize the returned keys, and the keys stored by PaimonUtil.getPartitionInfoMap, to the same lower-case names used in Doris schema, and add a test with a mixed-case partition column.
| "type" = "paimon", | ||
| "paimon.catalog.type" = "hms", | ||
| "warehouse" = "hdfs://master-1-1.c-a212282673679a24.cn-beijing.emr.aliyuncs.com:9000/user/hive/warehouse/", | ||
| 'hive.version' = '3.1.3', |
There was a problem hiding this comment.
This hard-codes one EMR cluster's HDFS namenode into the regression test. The same suite already gates on enableExternalEmrTest and reads emrCatalogCommonProp, so in other regression environments this catalog will point at an unreachable host even when the EMR/HMS properties are configured correctly. Please derive the warehouse from the regression config (for example the same external/env property used by the EMR catalog setup) instead of committing an environment-specific hostname.
What problem does this PR solve?
Related PR: #62306
Problem Summary:
This PR fixes some issues caused by the refactoring #62306 and legacy issues:
For Iceberg/Paimon systems, it's necessary to pass metadata partition values for each split. Simply relying on information from files to obtain partition values is unreliable, especially for tables migrated from Hive.
Condition cache conflicts with CountReader and Lazy RF; see comments in
be/src/exec/scan/file_scanner.cppfor details.PR [refactoring](multi-catalog)data_lake_reader_refactoring. #62306 omitted handling of Iceberg name_mapping.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)