Skip to content

[feature](be) Support expression zonemap pruning#63389

Open
mrhhsg wants to merge 11 commits into
apache:masterfrom
mrhhsg:zonemap
Open

[feature](be) Support expression zonemap pruning#63389
mrhhsg wants to merge 11 commits into
apache:masterfrom
mrhhsg:zonemap

Conversation

@mrhhsg

@mrhhsg mrhhsg commented May 19, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Support expression-based ZoneMap pruning for internal and Parquet scan paths. Supported expressions include comparisons, IN/NOT IN, IS NULL/IS NOT NULL, and starts_with. The implementation adds segment/page/row-group pruning, conservative fallback semantics, and profile counters, with BE unit and SQL regression coverage. CHAR string range statistics are supported through the trimmed logical bounds produced by ZoneMap deserialization, so CHAR predicates can participate in expression ZoneMap pruning.

This follow-up also removes duplicated capability checks between can_evaluate_zonemap_filter and evaluate_zonemap_filter. Capability checks now focus on expression shape and NULL-literal cases, while storage/runtime datatype invariants are asserted with DORIS_CHECK in the evaluation path.

Release note

Support expression-based ZoneMap pruning for scan predicates to skip data using ZoneMap statistics.

Check List (For Author)

  • Test: Unit Test / Regression test / Format check
    • DORIS_HOME= ninja -C be/ut_build_ASAN Exprs Storage Exec Format doris_be_test
    • DORIS_HOME= ninja -C be/ut_build_ASAN Exprs doris_be_test
    • ./run-be-ut.sh --run --filter=SegmentFilterHelpersTest.*
    • ./run-be-ut.sh --run --filter=ScanNormalizePredicateTest.:RuntimeFilterConsumerHelperTest.:VRuntimeFilterWrapperSamplingTest.:ScannerLateArrivalRFTest.
    • ./run-be-ut.sh --run --filter=ParquetExprTest.test_expr_zonemap_*
    • ./run-be-ut.sh --run --filter=ExprZonemapFilterTest.*
    • ./run-be-ut.sh --run --filter='ExprZonemapFilterTest.FunctionStringStartsWithZonemapUsesPrefixRange:function_string_test.function_starts_with_test'
    • ./run-be-ut.sh --run --filter='ExprZonemapFilterTest.:ParquetExprTest.test_expr_zonemap_'
    • ninja -C be/ut_build_ASAN src/storage/CMakeFiles/Storage.dir/segment/segment.cpp.o src/storage/CMakeFiles/Storage.dir/segment/segment_iterator.cpp.o
    • ./run-be-ut.sh --run --filter=ParquetExprTest.test_expr_zonemap_page_filter_keeps_unsupported_results_and_counts_stats
    • doris-local-regression.sh all -d inverted_index_p0 -s test_index_range_in_select
    • ./run-regression-test.sh --conf --run -d query_p0/expr_zonemap -s test_expr_zonemap_pruning
    • ./run-regression-test.sh --conf --run -d query_p1/expr_zonemap -s test_expr_zonemap_pruning_p1
    • doris-local-regression --network 10.26.20.3/24 all -d query_p0/expr_zonemap -s test_expr_zonemap_pruning
    • doris-local-regression --network 10.26.20.3/24 all -d query_p1/expr_zonemap -s test_expr_zonemap_pruning_p1
    • build-support/clang-format.sh
    • build-support/check-format.sh
    • doris-local-regression --network 10.26.20.3/24 run -d query_p0/expr_zonemap -s test_expr_zonemap_pruning
    • doris-local-regression --network 10.26.20.3/24 run -d query_p1/expr_zonemap -s test_expr_zonemap_pruning_p1
    • git diff --check
  • Behavior changed: Yes. Supported scan predicates, including CHAR string predicates, can now prune data with expression ZoneMap evaluation.
  • Does this need documentation: No

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg mrhhsg force-pushed the zonemap branch 4 times, most recently from 12e23d5 to 645db81 Compare May 19, 2026 08:52
@mrhhsg

mrhhsg commented May 19, 2026

Copy link
Copy Markdown
Member Author

/review

@mrhhsg

mrhhsg commented May 19, 2026

Copy link
Copy Markdown
Member Author

/review

@mrhhsg

mrhhsg commented May 19, 2026

Copy link
Copy Markdown
Member Author

run buildall

@mrhhsg

mrhhsg commented May 19, 2026

Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 32343 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fdb02e11fe807a3a965e4543dac6f6f3fed0539c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17858	4064	4005	4005
q2	q3	10769	1418	819	819
q4	4701	473	344	344
q5	7694	2268	2176	2176
q6	383	179	147	147
q7	974	789	651	651
q8	9382	1669	1735	1669
q9	7014	4966	4932	4932
q10	6436	2204	1810	1810
q11	507	353	319	319
q12	716	567	435	435
q13	18103	3380	2813	2813
q14	269	260	239	239
q15	q16	823	773	719	719
q17	1015	952	1016	952
q18	6980	5912	5720	5720
q19	1217	1399	1161	1161
q20	840	674	526	526
q21	6082	2786	2572	2572
q22	469	389	334	334
Total cold run time: 102232 ms
Total hot run time: 32343 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4694	4852	4584	4584
q2	q3	4887	5246	4609	4609
q4	2139	2244	1471	1471
q5	4829	4702	4666	4666
q6	226	179	128	128
q7	1863	1814	1471	1471
q8	2210	1934	2011	1934
q9	7278	7248	7293	7248
q10	4479	4394	3969	3969
q11	541	386	365	365
q12	721	717	530	530
q13	3070	3367	2774	2774
q14	276	282	252	252
q15	q16	691	693	614	614
q17	1293	1249	1249	1249
q18	7579	7014	6781	6781
q19	1108	1100	1091	1091
q20	2247	2235	1961	1961
q21	5362	4649	4554	4554
q22	529	468	431	431
Total cold run time: 56022 ms
Total hot run time: 50682 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171489 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fdb02e11fe807a3a965e4543dac6f6f3fed0539c, data reload: false

query5	4348	679	536	536
query6	347	251	229	229
query7	4285	615	331	331
query8	334	262	251	251
query9	8829	4102	4100	4100
query10	469	365	314	314
query11	5797	2427	2254	2254
query12	184	131	127	127
query13	1306	639	448	448
query14	6014	5365	5063	5063
query14_1	4359	4429	4312	4312
query15	211	208	185	185
query16	1017	448	426	426
query17	1098	723	569	569
query18	2430	479	345	345
query19	230	202	165	165
query20	143	134	133	133
query21	229	157	136	136
query22	13726	13564	13638	13564
query23	17209	16311	16005	16005
query23_1	16231	16064	16092	16064
query24	7543	1761	1302	1302
query24_1	1299	1312	1296	1296
query25	561	465	420	420
query26	1355	362	218	218
query27	2619	584	366	366
query28	4461	1972	1979	1972
query29	1002	668	500	500
query30	329	269	235	235
query31	1142	1092	973	973
query32	87	77	73	73
query33	539	364	292	292
query34	1150	1170	662	662
query35	780	782	671	671
query36	1340	1350	1183	1183
query37	154	104	92	92
query38	3233	3124	3042	3042
query39	937	956	905	905
query39_1	888	876	876	876
query40	271	191	170	170
query41	65	64	62	62
query42	116	115	123	115
query43	321	331	298	298
query44	
query45	209	201	199	199
query46	1062	1199	722	722
query47	2375	2365	2237	2237
query48	404	425	298	298
query49	623	495	403	403
query50	1041	412	331	331
query51	4319	4307	4192	4192
query52	114	114	102	102
query53	273	303	223	223
query54	334	297	272	272
query55	95	92	88	88
query56	424	384	367	367
query57	1416	1388	1324	1324
query58	342	314	310	310
query59	1611	1648	1401	1401
query60	345	351	336	336
query61	157	179	180	179
query62	675	630	567	567
query63	255	220	230	220
query64	2503	826	651	651
query65	
query66	1756	524	411	411
query67	30097	30063	29843	29843
query68	
query69	471	342	303	303
query70	1070	988	953	953
query71	398	392	362	362
query72	2990	2712	2402	2402
query73	858	751	442	442
query74	5152	4916	4782	4782
query75	2785	2669	2334	2334
query76	2303	1174	823	823
query77	412	439	353	353
query78	12136	12261	11633	11633
query79	1297	1041	760	760
query80	699	636	566	566
query81	456	340	298	298
query82	246	166	125	125
query83	327	329	305	305
query84	294	166	134	134
query85	1004	710	662	662
query86	373	357	310	310
query87	3449	3388	3243	3243
query88	3609	2721	2719	2719
query89	470	423	378	378
query90	2176	211	212	211
query91	230	266	176	176
query92	82	80	76	76
query93	1493	1516	957	957
query94	557	339	329	329
query95	771	587	468	468
query96	1024	808	336	336
query97	2709	2672	2605	2605
query98	246	229	237	229
query99	1100	1121	967	967
Total cold run time: 253610 ms
Total hot run time: 171489 ms

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 77.76% (790/1016) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.90% (27312/37984)
Line Coverage 55.28% (291191/526769)
Region Coverage 52.38% (242822/463573)
Branch Coverage 53.63% (104569/194980)

@mrhhsg mrhhsg force-pushed the zonemap branch 3 times, most recently from 5e0a42a to 149b65b Compare May 20, 2026 07:36
@mrhhsg

mrhhsg commented May 20, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the new Parquet row-group expression zonemap pruning path. The overall feature direction is focused and has BE/unit/regression coverage, but this path can crash for complex columns before safely falling back.

Critical checkpoints:

  • Goal/test: The PR implements expression-based zonemap pruning for internal and Parquet scans and adds targeted tests, but complex Parquet predicate coverage is missing.
  • Scope/focus: The changed feature code is mostly localized to scan predicate/zoning evaluation; the issue is a missing guard in one new path.
  • Concurrency/lifecycle: No new shared mutable lifecycle or locking issue found in the reviewed paths.
  • Config: New mutable configs are present and read dynamically; no restart-only mismatch found.
  • Compatibility/protocol: No new FE-BE protocol compatibility issue found.
  • Parallel paths: The page-index Parquet path already guards complex physical columns; the row-group path needs the same fallback.
  • Tests: Existing tests cover scalar pruning, but not complex Parquet columns with expression zonemap conjuncts.
  • Data correctness/stability: The inline issue can lead to out-of-bounds access/crash instead of conservative fallback.
  • Observability/performance: Counters were added; no blocking observability issue found.

User focus points: No additional user-provided review focus was specified.

Comment thread be/src/format/parquet/vparquet_reader.cpp
@mrhhsg

mrhhsg commented May 20, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary:

No additional blocking issues found in the current PR state. I also checked the existing inline thread about Parquet row-group expr-zonemap handling for complex parent columns and did not re-raise it.

Critical checkpoint conclusions:

  • Goal/test coverage: The PR adds expression-based zonemap pruning for OLAP and Parquet paths, with BE unit tests and regression coverage for comparisons, IN, null predicates, starts_with, page/segment pruning, and the complex-parent Parquet fallback case.
  • Scope/focus: The implementation is focused on expression zonemap evaluation and wiring into existing pruning paths.
  • Concurrency/lifecycle: No new shared mutable state or lock-order risk found in the reviewed paths. Late runtime filters remain a missed optimization for already-initialized storage pruning, not a correctness issue.
  • Configuration: New expr zonemap configs are mutable BE configs; reviewed call sites check them before applying pruning.
  • Compatibility/storage correctness: No storage-format or FE-BE protocol compatibility issue found. Pruning remains conservative on unsupported expressions/types.
  • Parallel paths: OLAP segment/page and Parquet row-group/page paths were reviewed; page pruning intentionally handles only single-slot conjunct groups, while unsupported/multi-slot cases fall back conservatively.
  • Error handling/memory: Status-returning calls in the new paths are checked, and no untracked large allocation or ownership issue stood out.
  • Performance/observability: New profile/stat counters cover filtered segments/pages and fallback reasons; no obvious hot-path anti-pattern requiring a blocking comment found.

User focus: No additional user-provided review focus was specified.

@mrhhsg

mrhhsg commented May 20, 2026

Copy link
Copy Markdown
Member Author

run buildall

@mrhhsg

mrhhsg commented May 20, 2026

Copy link
Copy Markdown
Member Author

run external

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 32367 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b1a1b0efd8d26a90634283b0e201f5cfc2d7ae5c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17643	4038	4072	4038
q2	q3	10837	1485	834	834
q4	4698	495	352	352
q5	7795	2328	2147	2147
q6	390	187	148	148
q7	1002	813	653	653
q8	9414	1787	1755	1755
q9	7160	4937	4939	4937
q10	6434	2145	1830	1830
q11	496	347	338	338
q12	740	567	444	444
q13	18133	3570	2822	2822
q14	269	257	238	238
q15	q16	835	785	717	717
q17	958	974	961	961
q18	7074	5968	5688	5688
q19	1165	1408	1138	1138
q20	843	677	595	595
q21	5833	2681	2419	2419
q22	444	368	313	313
Total cold run time: 102163 ms
Total hot run time: 32367 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4319	4214	4156	4156
q2	q3	4625	4999	4410	4410
q4	2271	2394	1539	1539
q5	4506	4703	5060	4703
q6	279	206	148	148
q7	2058	1870	1804	1804
q8	2515	2307	2223	2223
q9	7982	7786	7980	7786
q10	4712	4596	4065	4065
q11	633	446	426	426
q12	757	741	546	546
q13	3429	3789	3078	3078
q14	318	299	283	283
q15	q16	738	772	682	682
q17	1392	1379	1349	1349
q18	8339	7405	6992	6992
q19	1151	1100	1134	1100
q20	2288	2250	1975	1975
q21	5526	4938	4792	4792
q22	541	498	438	438
Total cold run time: 58379 ms
Total hot run time: 52495 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 170946 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b1a1b0efd8d26a90634283b0e201f5cfc2d7ae5c, data reload: false

query5	4342	657	525	525
query6	348	255	233	233
query7	4325	600	331	331
query8	343	237	223	223
query9	8850	4019	4042	4019
query10	470	377	337	337
query11	5815	2430	2192	2192
query12	185	132	152	132
query13	1301	613	384	384
query14	5969	5350	5058	5058
query14_1	4341	4316	4325	4316
query15	214	207	180	180
query16	1052	450	440	440
query17	1128	726	579	579
query18	2524	476	341	341
query19	224	217	194	194
query20	147	134	135	134
query21	233	162	146	146
query22	13663	13546	13410	13410
query23	17244	16368	16132	16132
query23_1	16163	16324	16160	16160
query24	7494	1779	1329	1329
query24_1	1320	1334	1321	1321
query25	591	506	450	450
query26	1332	376	219	219
query27	2585	604	385	385
query28	4455	2005	2004	2004
query29	1024	649	531	531
query30	328	271	228	228
query31	1124	1101	966	966
query32	90	79	75	75
query33	561	374	314	314
query34	1175	1126	646	646
query35	796	807	673	673
query36	1334	1351	1158	1158
query37	161	105	91	91
query38	3205	3149	3062	3062
query39	937	925	900	900
query39_1	890	900	879	879
query40	269	193	176	176
query41	71	70	70	70
query42	127	120	117	117
query43	335	326	289	289
query44	
query45	222	213	199	199
query46	1075	1188	705	705
query47	2352	2387	2206	2206
query48	419	422	301	301
query49	649	499	406	406
query50	1033	422	331	331
query51	4379	4371	4192	4192
query52	114	117	104	104
query53	277	297	222	222
query54	331	325	287	287
query55	96	95	91	91
query56	371	387	380	380
query57	1408	1409	1348	1348
query58	344	321	308	308
query59	1568	1634	1417	1417
query60	392	354	336	336
query61	156	154	155	154
query62	671	615	562	562
query63	253	225	222	222
query64	2424	793	628	628
query65	
query66	1744	515	398	398
query67	29985	29281	29884	29281
query68	
query69	475	346	317	317
query70	1068	968	984	968
query71	416	350	356	350
query72	3023	2691	2469	2469
query73	847	725	425	425
query74	5088	4893	4718	4718
query75	2722	2639	2288	2288
query76	2300	1177	797	797
query77	408	410	346	346
query78	12167	12051	11662	11662
query79	1519	1052	738	738
query80	728	612	532	532
query81	473	328	290	290
query82	1439	158	134	134
query83	399	333	299	299
query84	290	150	129	129
query85	944	655	573	573
query86	389	325	317	317
query87	3504	3357	3221	3221
query88	3566	2686	2679	2679
query89	477	416	362	362
query90	1994	198	196	196
query91	210	199	167	167
query92	82	80	74	74
query93	1629	1495	954	954
query94	538	359	329	329
query95	799	598	473	473
query96	1050	764	373	373
query97	2694	2710	2560	2560
query98	243	238	237	237
query99	1162	1099	990	990
Total cold run time: 254754 ms
Total hot run time: 170946 ms

@mrhhsg

mrhhsg commented May 21, 2026

Copy link
Copy Markdown
Member Author

run external

};

inline void record_unsupported_zonemap_filter(const ZoneMapEvalContext& ctx) {
++ctx.stats.unsupported_expr_count;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsupported_expr_count 这个指导底统计的是什么东西啊?你这不应该是失效的这个zonemap数目吗?为什么搞一个这个名字还要额外搞个函数,拢共就一个加法

Comment thread be/src/exprs/function/function_string.cpp
mrhhsg added 7 commits June 11, 2026 16:09
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Support expression-based ZoneMap pruning for internal and Parquet scan paths. Supported expressions include comparisons, IN/NOT IN, IS NULL/IS NOT NULL, and starts_with. The implementation adds segment/page/row-group pruning, conservative fallback semantics, and profile counters, with BE unit and SQL regression coverage. CHAR string range statistics are supported through the trimmed logical bounds produced by ZoneMap deserialization, so CHAR predicates can participate in expression ZoneMap pruning.

This also removes duplicated capability checks between `can_evaluate_zonemap_filter` and `evaluate_zonemap_filter`. Capability checks now focus on expression shape and NULL-literal cases, while storage/runtime datatype invariants are asserted with `DORIS_CHECK` in the evaluation path.

### Release note

Support expression-based ZoneMap pruning for scan predicates to skip data using ZoneMap statistics.

### Check List (For Author)

- Test: Unit Test / Regression test / Format check
    - DORIS_HOME=<repo> ninja -C be/ut_build_ASAN Exprs Storage Exec Format doris_be_test
    - DORIS_HOME=<repo> ninja -C be/ut_build_ASAN Exprs doris_be_test
    - ./run-be-ut.sh --run --filter=SegmentFilterHelpersTest.*
    - ./run-be-ut.sh --run --filter=ScanNormalizePredicateTest.*:RuntimeFilterConsumerHelperTest.*:VRuntimeFilterWrapperSamplingTest.*:ScannerLateArrivalRFTest.*
    - ./run-be-ut.sh --run --filter=ParquetExprTest.test_expr_zonemap_*
    - ./run-be-ut.sh --run --filter=ExprZonemapFilterTest.*
    - ./run-be-ut.sh --run --filter=ParquetExprTest.test_expr_zonemap_page_filter_keeps_unsupported_results_and_counts_stats
    - doris-local-regression.sh all -d inverted_index_p0 -s test_index_range_in_select
    - ./run-regression-test.sh --conf <local-regression-conf> --run -d query_p0/expr_zonemap -s test_expr_zonemap_pruning
    - ./run-regression-test.sh --conf <local-regression-conf> --run -d query_p1/expr_zonemap -s test_expr_zonemap_pruning_p1
    - doris-local-regression --network 10.26.20.3/24 all -d query_p0/expr_zonemap -s test_expr_zonemap_pruning
    - doris-local-regression --network 10.26.20.3/24 all -d query_p1/expr_zonemap -s test_expr_zonemap_pruning_p1
    - build-support/clang-format.sh
    - build-support/check-format.sh
    - git diff --check
- Behavior changed: Yes. Supported scan predicates, including CHAR string predicates, can now prune data with expression ZoneMap evaluation.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Refine expression ZoneMap pruning checks by keeping expression shape and NULL-literal gating in can_evaluate_zonemap_filter, while relying on DORIS_CHECK for runtime datatype invariants. Rename the runtime fallback statistic from unsupported expressions to unusable ZoneMap evaluations so the profile reflects cases where an evaluable expression cannot use the current ZoneMap or context, such as missing statistics or unusable range metadata. Add focused unit-test coverage for the updated can/evaluate contract and the renamed statistic.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Format check
    - build-support/clang-format.sh
    - git diff --check
    - ./run-be-ut.sh --run --filter=ExprZonemapFilterTest.ComparisonZonemapHandlesNullAndUnsupportedInputs:ExprZonemapFilterTest.StartsWithZonemapUsesPrefixRange
    - ./run-be-ut.sh --run --filter=ExprZonemapFilterTest.ComparisonZonemapHandlesNullAndUnsupportedInputs:ExprZonemapFilterTest.StartsWithZonemapUsesPrefixRange:ExprZonemapFilterTest.CompoundPredicateEvaluatesChildrenForZonemap:ExprZonemapFilterTest.ExprContextZonemapEvaluationShortCircuitsOnNoMatch:ParquetExprTest.test_expr_zonemap_page_filter_prunes_pages_and_intersects_ranges:ParquetExprTest.test_expr_zonemap_page_filter_keeps_unsupported_results_and_counts_stats:ParquetExprTest.test_expr_zonemap_row_group_filter_skips_complex_parent_column:ParquetExprTest.test_expr_zonemap_row_group_filter_skips_type_mismatch_column
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Remove unrelated newline-only changes from runtime_filter_consumer_helper files so the expression ZoneMap merge request only includes relevant code and tests.

### Release note

None

### Check List (For Author)

- Test: No need to test (newline-only cleanup)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Expression zonemap evaluation used a `fetch_zone_map` helper only to forward to `ZoneMapEvalContext::zone_map` and update unsupported-evaluation statistics on a missing zonemap. This made the statistics update happen at the fetch site instead of the decision site that turns the missing zonemap into an unsupported filter result. Remove the redundant helper, call `ctx.zone_map` directly, and update the unsupported statistics when the evaluator actually returns `kUnsupported` for the missing zonemap.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - `./run-be-ut.sh --run --filter='ExprZonemapFilterTest.ComparisonZonemapHandlesNullAndUnsupportedInputs:ExprZonemapFilterTest.StartsWithZonemapUsesPrefixRange:ExprZonemapFilterTest.NullZonemapUsesNullFlagsOnly:ExprZonemapFilterTest.CompoundPredicateEvaluatesChildrenForZonemap:ExprZonemapFilterTest.ExprContextZonemapEvaluationShortCircuitsOnNoMatch:ParquetExprTest.test_expr_zonemap_page_filter_keeps_unsupported_results_and_counts_stats'`
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Expression zonemap evaluation updated unusable-evaluation statistics inside the helper that fetches a compatible slot type. This was inconsistent with the rest of the evaluator flow where unsupported statistics are updated at the branch that returns `kUnsupported`. Move the missing-slot-type statistics update to the `kUnsupported` return sites for comparison, starts_with, and IN zonemap evaluation, and add unit coverage to ensure the count is incremented exactly once.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - `./run-be-ut.sh --run --filter='ExprZonemapFilterTest.MissingSlotTypeCountsUnsupportedZonemapEvalOnce:ExprZonemapFilterTest.ComparisonZonemapHandlesNullAndUnsupportedInputs:ExprZonemapFilterTest.StartsWithZonemapUsesPrefixRange:ParquetExprTest.test_expr_zonemap_page_filter_keeps_unsupported_results_and_counts_stats'`
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The starts_with zonemap evaluator used a private helper that only wrapped `Field::create_field<TYPE_STRING>`. Remove the redundant helper and construct the string fields directly at the two call sites to keep the evaluator simpler.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - `./run-be-ut.sh --run --filter='ExprZonemapFilterTest.StartsWithZonemapUsesPrefixRange:ExprZonemapFilterTest.CharZonemapUsesTrimmedLogicalBounds:ExprZonemapFilterTest.MissingSlotTypeCountsUnsupportedZonemapEvalOnce'`
- Behavior changed: No
- Does this need documentation: No
- Require exact non-string type match in data_types_compatible
- Move NULL literal check before DORIS_CHECK in comparison can_evaluate
- Remove unused zonemap materialize sync members from VDirectInPredicate
- Add DORIS_CHECK for type compatibility in VInPredicate materialize path

Issue Number: None
@mrhhsg

mrhhsg commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue: the new regression suites verify deterministic SQL results with Groovy assertions and omit generated .out files. That bypasses the standard regression golden-output mechanism for a kernel feature, so the tests should be converted before this is ready.

Critical checkpoint conclusions:

  • Goal and coverage: the PR implements expression ZoneMap pruning for OLAP and Parquet and adds BE/unit coverage plus regression suites, but the regression coverage does not yet satisfy Doris regression-test requirements because deterministic SQL results are not captured through generated outputs.
  • Scope: the current-head changes are cohesive around expression ZoneMap pruning; I did not find unrelated code changes requiring a separate review comment.
  • Concurrency and lifecycle: no new locks or static lifecycle risks were found in the reviewed paths. The existing shallow-clone/rebind race is already covered by an existing inline thread, so I did not duplicate it.
  • Config and compatibility: the new FE session variable is forwarded through optional TQueryOptions and defaults to enabled when absent; I found no storage-format or rolling-upgrade compatibility blocker.
  • Parallel paths: OLAP segment/page and Parquet row-group/page paths were reviewed. The already-known concerns in the existing threads were treated as prior context and not re-raised.
  • Tests and results: BE and Parquet unit tests exist, but the new regression suites need qt_sql/order_qt cases and generated .out files for deterministic results.
  • Observability: counters are present for filtered segments/pages and IN/unsupported evaluations; no additional observability blocker found.
  • Transactions, persistence, data writes, and security: not applicable to this PR's changed behavior; no security review was requested.
  • Performance: unsupported-expression gating is present in the hot page-zonemap paths; I found no additional concrete performance blocker beyond the existing review threads.
  • User focus: no additional user-provided review focus was supplied.

def matchedRows = sql """
SELECT COUNT(*) FROM test_expr_zonemap_pruning WHERE starts_with(v, 'a')
"""
assertEquals(4096L, matchedRows[0][0] as long)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deterministic query results in this new regression suite are checked with Groovy assertEquals, and the PR does not add a generated .out file under regression-test/data/.... Doris regression tests should record deterministic SQL results with qt_sql/order_qt (or explicit ORDER BY) and commit the generated output; direct assertions should be reserved for checks that cannot be represented in golden output, such as profile counters. Please convert these result checks to qt_... cases and add the generated .out file.

SELECT '${notNullToken}', COUNT(*) FROM test_expr_zonemap_pruning_p1
WHERE nullable_v IS NOT NULL OR starts_with(v, 'never_')
"""
assertEquals(2048L, notNullRows[0][1] as long)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue in this P1 suite: deterministic SQL result checks use Groovy assertEquals, and the PR does not add a generated .out file. Please use qt_sql/order_qt for the expected result queries and commit the generated output, leaving only the profile-counter checks as Groovy assertions.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Page-level expression ZoneMap pruning uses reader-schema column ordinals to fetch page ZoneMaps. For layouts where storage expression slot ordinals need rebinding to the reader schema, VExprContext::clone() does not deep-copy the expression tree yet, so rebinding can mutate shared slot refs and make later page-level pruning read the wrong column's ZoneMap. This commit reuses the existing storage expression slot/schema layout guard for page-level expression ZoneMap pruning, conservatively skipping the optimization when the mapping is uncertain, and removes an unused include left by earlier cleanup.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Compile check / Format check

    - git diff --check

    - ninja -C be/ut_build_ASAN src/storage/CMakeFiles/Storage.dir/segment/segment.cpp.o src/storage/CMakeFiles/Storage.dir/segment/segment_iterator.cpp.o

    - ./run-be-ut.sh --run --filter='ExprZonemapFilterTest.*'

- Behavior changed: No

- Does this need documentation: No
@mrhhsg

mrhhsg commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed.

I did not find any additional distinct issue beyond the already-open inline review threads, so I did not add duplicate inline comments. The existing threads about regression result files/assert-based regression checks and type/rebinding safety remain part of the review context.

Part 1.3 checkpoint conclusions:

  • Goal/test: The PR implements expression zonemap pruning for OLAP segment/page paths and Parquet row-group/page paths, with conservative fallback for unsupported predicates. BE unit tests and regression suites cover the main comparison/IN/null/starts_with paths; existing review threads already cover the regression-test result-file issue.
  • Scope: The change is focused on expression zonemap evaluation, scan integration, counters, the FE session variable, and targeted tests. I did not see unrelated behavior mixed into the PR.
  • Concurrency/lifecycle: I did not find new shared mutable cross-thread state. Per-reader/per-segment contexts are built or cloned for scan execution; existing index/page-loading synchronization is reused. No new static initialization or ownership-cycle risk stood out.
  • Config/compatibility: The new FE session variable is wired to optional thrift query options, and BE treats an absent field as enabled, which is compatible with mixed versions. No storage format or transaction persistence change is introduced.
  • Parallel paths/conditions: OLAP and Parquet row-group/page paths are both handled. Guard conditions for schema rebinding, missing stats, complex columns, type mismatch, null-aware runtime filters, and synthetic dictionary predicates are conservative and documented.
  • Testing/results: I did not run the full test suite in this review pass; git diff --check is clean. No generated regression .out files are present, which is already raised in existing inline comments.
  • Observability: New filtered-row/group/page and unusable-evaluation counters are exposed in OLAP and Parquet profiles; I did not see noisy new logging.
  • Transactions/writes: The PR affects read-time pruning only and does not modify transaction, delete-bitmap, EditLog, or data-write paths.
  • FE-BE variable passing: The session variable is added to SessionVariable.toThrift() and the thrift query option; no missing BE consumer path was found in this pass.
  • Performance: The implementation avoids evaluating unsupported expressions where possible and keeps point-check thresholds for large IN lists. No additional performance issue distinct from existing review threads was found.

User focus: review_focus.txt contains no additional user-provided focus points; no focus-specific issue was found.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The expression ZoneMap scan profile counters are easy to confuse because some counters describe pruning results while others describe evaluation fallback paths. Add comments near the Olap scan counter members to document the meaning of filtered segment/page, unusable evaluation, IN point-check, and IN range-only counters.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Regression test / Format check

    - build-support/clang-format.sh

    - git diff --check

    - ./run-be-ut.sh --run --filter='ExprZonemapFilterTest.*:ParquetExprTest.test_expr_zonemap_*'

    - doris-local-regression --network 10.26.20.3/24 run -d query_p0/expr_zonemap -s test_expr_zonemap_pruning

    - doris-local-regression --network 10.26.20.3/24 run -d query_p1/expr_zonemap -s test_expr_zonemap_pruning_p1

- Behavior changed: No

- Does this need documentation: No
@mrhhsg

mrhhsg commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional non-duplicate blocking issues found in the current cumulative GitHub PR diff. I reviewed the existing inline threads first and did not re-submit concerns that were already raised there.

Critical checkpoint conclusions:

  • Goal/test proof: The PR adds BE unit coverage and regression suites for expression ZoneMap pruning. The regression golden-output concern is already covered by an existing thread, so I did not duplicate it.
  • Scope/focus: The actual GitHub PR diff is focused on expression ZoneMap pruning, scan/profile counters, the FE session variable, thrift plumbing, and related tests.
  • Correctness/null/type semantics: The evaluator returns NoMatch only when min/max or null flags prove that a filter cannot match, and it keeps unsupported cases as may-match. Existing threads already cover the remaining type-compatibility/assertion concerns.
  • Storage and format paths: OLAP segment/page pruning and Parquet row-group/page pruning were checked for conservative fallback behavior. Existing threads already cover page-index overhead/counter and schema-layout concerns.
  • Concurrency/lifecycle: I did not find a new concurrency or object-lifetime issue in the current diff.
  • Config/FE-BE compatibility: enable_expr_zonemap_filter is optional in thrift, defaults to enabled, and is guarded in BE for unset query options.
  • Performance/observability: New counters are plumbed through OLAP and Parquet profiles; I did not find a new distinct performance blocker beyond the already-known page-index discussion.
  • Tests/results: New tests exercise OLAP, Parquet, runtime-filter, and expression semantics. Existing review context already flags the regression-result style issue.

User focus: review_focus.txt contains no additional user-provided focus points.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: FunctionStringStartsWith adds starts_with-specific expression zonemap pruning, but the unit test did not explicitly cover the new function override. This adds coverage for the factory-registered starts_with function, supported and non-prunable literal shapes, prefix lower and upper range pruning, NULL-only zonemaps, and the no-next-prefix max-byte edge.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Format check

    - build-support/clang-format.sh

    - git diff --check

    - ./run-be-ut.sh --run --filter='ExprZonemapFilterTest.FunctionStringStartsWithZonemapUsesPrefixRange:function_string_test.function_starts_with_test'

    - ./run-be-ut.sh --run --filter='ExprZonemapFilterTest.*'

- Behavior changed: No

- Does this need documentation: No
@mrhhsg

mrhhsg commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed against the net PR diff (43 files) and the existing review threads.

Requesting changes. I did not add new inline comments because the blocking points I still see are already covered by existing threads:

  • The new P0/P1 regression suites still check deterministic SQL results with Groovy assertEquals and do not add generated .out files. Doris regression-test standards require deterministic SQL results to be represented with qt_sql/order_qt plus generated output; keep Groovy assertions for profile/counter checks only.
  • The comparison zonemap capability path still advertises support and then relies on a datatype-compatibility DORIS_CHECK in the evaluation path. If an incompatible BE-materialized slot/literal comparison reaches this path, the scanner aborts instead of conservatively returning may-match/unsupported; this remains covered by the existing comparison thread.
  • The earlier OLAP/Parquet concerns around unsupported page-index reads, complex Parquet columns, and page-level row-group counters appear addressed in the current net diff.

Critical checkpoint conclusions:

  • Goal/test: The expr-zonemap pruning goal is clear and has BE unit tests plus regression suites, but the regression outputs are not yet in Doris golden-file form.
  • Scope/focus: The net PR is focused on expr-zonemap pruning; the cloud/broker changes visible in a raw base-tip diff are not part of the GitHub PR diff and were not reviewed as PR changes.
  • Concurrency/lifecycle: No new thread/lifetime issue found beyond the existing expression clone/rebinding discussions; current OLAP segment/page paths now skip reader-schema mismatch layouts.
  • Config/protocol: enable_expr_zonemap_filter is added as an FE session variable and forwarded through TQueryOptions; BE defaults to enabled when the field is absent, preserving mixed-version behavior.
  • Compatibility/storage: No storage-format change found; reviewed OLAP and Parquet paths fall back conservatively for missing or unusable stats.
  • Parallel paths: OLAP segment/page and Parquet row-group/page paths are both wired.
  • Observability/performance: Counters are added; the existing page-index hot-path concerns appear addressed in the current diff.
  • Transactions/data writes/security: Not applicable to the net PR.
  • User focus: review_focus.txt contains no additional user-provided focus points.

### What problem does this PR solve?

Issue Number: None

Related PR: apache#63389

Problem Summary: This change keeps expression zonemap pruning helper logic focused and reusable. It centralizes single-slot zonemap candidate detection, scopes HybridSet literal materialization to VDirectInPredicate, removes redundant defensive handling after can_evaluate_zonemap_filter has already validated the expression shape, asserts the segment zonemap invariant at the ColumnReader accessor boundary, and drops an unrelated runtime filter unit-test change from the PR diff.

### Release note

None

### Check List (For Author)

- Test: Unit Test; Code style
    - `./build-support/clang-format.sh be/src/exprs/expr_zonemap_filter.cpp be/src/exprs/expr_zonemap_filter.h be/src/exprs/vdirect_in_predicate.h be/src/exprs/vexpr.cpp be/src/exprs/vexpr.h be/src/exprs/vin_predicate.cpp be/src/storage/segment/column_reader.cpp be/src/storage/segment/segment.cpp be/src/storage/segment/segment_iterator.cpp be/test/exec/runtime_filter/runtime_filter_consumer_helper_test.cpp`
    - `git diff --check`
    - `git diff --cached --check`
    - `./run-be-ut.sh --run --filter='ExprZonemapFilterTest.*:ParquetExprTest.test_expr_zonemap_*'`
- Behavior changed: No
- Does this need documentation: No
@mrhhsg

mrhhsg commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the current final diff against the existing inline review threads and did not find additional distinct inline issues to add without duplicating already-known review context.

Requesting changes because the current PR still appears to contain known blocking items from existing threads:

  • be/src/exprs/function/functions_comparison.h: comparison_zonemap_detail::can_evaluate() still uses DORIS_CHECK for slot/literal type compatibility. A pushed-down incompatible materialized literal can abort instead of conservatively disabling expr-zonemap evaluation.
  • regression-test/suites/query_p0/expr_zonemap/test_expr_zonemap_pruning.groovy and regression-test/suites/query_p1/expr_zonemap/test_expr_zonemap_pruning_p1.groovy: the new regression suites still use Groovy assertEquals for deterministic SQL results and do not add generated .out files. Doris regression cases should use qt_sql / order_qt with committed output for deterministic results.
  • Please also resolve or explicitly confirm the existing Parquet unsupported-expression page-index thread. The current implementation avoids the column-index read in some cases, but still parses offset indexes before computing _has_expr_zonemap_page_filter.

Checkpoint conclusions:

  • Goal and tests: expression zonemap pruning is implemented for BE expression, OLAP, and Parquet paths with unit/regression coverage, but the regression proof is incomplete while golden outputs are missing.
  • Scope: the final GitHub PR file list is focused on expression-zonemap pruning and its tests; I did not find unrelated final-diff files.
  • Concurrency and lifecycle: OLAP segment/page expr-zonemap now skips layouts needing rebind, addressing the earlier shared shallow-clone false-prune path; no new locks or threads were introduced.
  • Config and protocol: the FE session variable and BE TQueryOptions plumbing are present with default-on behavior; I did not find a storage format change.
  • Parallel paths: OLAP segment/page and Parquet row-group/page pruning paths are covered; ORC only gets lazy-context propagation.
  • Conditional checks: existing type-mismatch handling in comparison zonemap evaluation remains blocking because it asserts instead of returning unsupported.
  • Test results: I did not run the full Doris build or regression suite during review.
  • Observability: profile counters were added for filtered segments/pages, unusable evaluations, and IN-check modes.
  • Performance: supported-expression gating is present, but the existing Parquet page-index concern should be resolved or confirmed.
  • User focus: no additional user-provided review focus was supplied in .code-review.Ms1ioT/review_focus.txt.

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 65.18% (685/1051) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.96% (21177/39245)
Line Coverage 37.72% (202257/536209)
Region Coverage 33.71% (158479/470141)
Branch Coverage 34.79% (69520/199839)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 88.84% (931/1048) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 69.43% (26616/38333)
Line Coverage 52.63% (280583/533089)
Region Coverage 48.99% (231402/472309)
Branch Coverage 50.18% (100406/200102)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 6.67% (2/30) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants