[refactor](inverted-index) replace void* query_value with typed paraminterface#62960
[refactor](inverted-index) replace void* query_value with typed paraminterface#62960csun5285 wants to merge 7 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
a842dba to
5fee30d
Compare
|
run buildall |
5fee30d to
636dac4
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Automated review completed.
Critical checkpoint conclusions:
- Goal/test proof: The PR replaces inverted-index query values from raw pointers/typed wrapper objects with , removes the obsolete path, and adds expanded BE unit coverage for BKD range queries and type-limit sentinels. The changed tests are aligned with the new reader interface.
- Scope/focus: The implementation is mostly focused on the query-value refactor and shared key-encoding helper reuse; no unrelated GitHub PR files were present in the actual PR diff.
- Concurrency/lifecycle: No new concurrent mutable state or lock ordering changes were introduced. now owns the , which removes the previous raw pointer lifetime risk.
- Config/compatibility: No new config, persisted format, edit log, storage format, or FE/BE protocol compatibility changes were introduced.
- Parallel paths: The main inverted-index call sites in expression pushdown, storage predicates, match/multi-match, IP, arrays, and tests were updated to pass ; I did not find a remaining live raw path for inverted index readers.
- Error handling: New status-returning BKD encoding dispatch reports unsupported field types instead of silently continuing. Existing CLucene boundaries remain unchanged.
- Data correctness: BKD min/max sentinel handling now uses on storage/key types, with added specializations for and ; I did not find an additional correctness regression not already covered by existing review threads.
- Memory/observability/performance: No large new allocations, ownership cycles, or observability gaps stood out. Moving from raw pointers to owned values is a small per-query copy but removes unsafe lifetime coupling.
- Test coverage: BE unit tests were expanded for BKD-supported primitive types and sentinel regressions. No regression-test result files were added, which is acceptable for this internal BE refactor.
- User focus: No additional user-provided review focus was supplied.
Existing review context was checked first; I did not duplicate the existing comments about storage-type specialization or the old explanation.
There was a problem hiding this comment.
Automated review completed.
Critical checkpoint conclusions:
- Goal/test proof: The PR replaces inverted-index query values from raw pointers/typed wrapper objects with Field, removes the obsolete TypeInfo::set_to_min/max path, and adds expanded BE unit coverage for BKD range queries and type-limit sentinels. The changed tests are aligned with the new reader interface.
- Scope/focus: The implementation is mostly focused on the query-value refactor and shared key-encoding helper reuse; no unrelated GitHub PR files were present in the actual PR diff.
- Concurrency/lifecycle: No new concurrent mutable state or lock ordering changes were introduced. InvertedIndexParam now owns the Field, which removes the previous raw pointer lifetime risk.
- Config/compatibility: No new config, persisted format, edit log, storage format, or FE/BE protocol compatibility changes were introduced.
- Parallel paths: The main inverted-index call sites in expression pushdown, storage predicates, match/multi-match, IP, arrays, and tests were updated to pass Field; I did not find a remaining live raw query_value path for inverted index readers.
- Error handling: New status-returning BKD encoding dispatch reports unsupported field types instead of silently continuing. Existing CLucene boundaries remain unchanged.
- Data correctness: BKD min/max sentinel handling now uses type_limit on storage/key types, with added specializations for decimal12_t and uint24_t; I did not find an additional correctness regression not already covered by existing review threads.
- Memory/observability/performance: No large new allocations, ownership cycles, or observability gaps stood out. Moving from raw pointers to owned Field values is a small per-query copy but removes unsafe lifetime coupling.
- Test coverage: BE unit tests were expanded for BKD-supported primitive types and sentinel regressions. No regression-test result files were added, which is acceptable for this internal BE refactor.
- User focus: No additional user-provided review focus was supplied.
Existing review context was checked first; I did not duplicate the existing comments about type_limit storage-type specialization or the old KeyCoder explanation.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 29387 ms |
TPC-DS: Total hot run time: 169793 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
|
||
| // DECIMALV2 storage. Largest representable DecimalV2 value (18 digits . 9 digits). | ||
| template <> | ||
| struct type_limit<decimal12_t> { |
There was a problem hiding this comment.
我觉得这个定义可能是有问题的。 前面那些定义,都是关联到计算层的value上的,但是你加的这两个,似乎又是存储层的value
| return doris::Status::InternalError("unsupported BKD field type {}", static_cast<int>(ft)); | ||
| } | ||
|
|
||
| static doris::Status encode_bkd_max_ascending(doris::FieldType ft, const doris::KeyCoder* coder, |
There was a problem hiding this comment.
对于decimal 和 datetime 这种类型,他是有scale的,此时我们min和max 怎么体现呢?
There was a problem hiding this comment.
这个 encoding 不需要考虑 scale,因为是把里面的 int64 或者 int128 来进行encoding,scale 只是缩放比例,对于同一个列的缩放比例是一样的,所以不需要考虑。
There was a problem hiding this comment.
datetime 这种也不影响,他的scale 只是表示小数秒精度,不同的精度的min 和max 都是000000 和 999999,不影响大小比较。
| SCOPED_RAW_TIMER(&context->stats->inverted_index_query_timer); | ||
|
|
||
| std::string search_str = *reinterpret_cast<const std::string*>(query_value); | ||
| std::string search_str = query_value.get<PrimitiveType::TYPE_STRING>(); |
There was a problem hiding this comment.
typename PrimitiveTypeTraits::CppType& Field::get() { 去把field 里这个函数的定义改一下,里面加一个检查,当t != type的时候抛异常把
| const T* value = (const T*)(iter->get_value()); | ||
| RETURN_IF_ERROR(InvertedIndexQueryParamFactory::create_query_value<Type>( | ||
| value, query_param)); | ||
| field_value = Field::create_field<Type>(*value); |
There was a problem hiding this comment.
in list predicate 的时候,只有string 类型不是计算层的类型吗? 其他的,比如date 都是计算层类型吗?
或者我们更广泛的说,predicate 里,运算的时候,都是按照计算层在计算吗?
比如date 类型,string的padding 之类的
There was a problem hiding this comment.
predicate 里都是按照计算层在计算,读出来的列读到columnxxx 里面就是计算层的column
| // Convert a Field value to its storage representation (via PrimitiveTypeConvertor) | ||
| // and full-encode it as a byte-comparable ascending key via KeyCoder. | ||
| template <PrimitiveType PT> | ||
| inline void full_encode_field_as_key(const Field& f, const KeyCoder* coder, std::string* buf) { |
There was a problem hiding this comment.
string 类型的padding,以及decimal和datetime 类型的scale 是怎么处理的
There was a problem hiding this comment.
- RowCursor 里面特殊处理了 string 类型的padding,索引里面没有特殊处理,因为索引里面碰见 \0 就认为结束了,padding 没意义。
- decimal类型的scale 不需要考虑,scale 只是缩放比例,同一列缩放比例一样,原始的 int 值比较大小就可以。
- datetime 这种也不影响,他的scale 只是表示小数秒精度,存储 int 字典序 ≡ 时间先后序,不影响大小比较。
… interface
Production query path no longer carries a const void* + reinterpret_cast
through InvertedIndexReader::query / try_query. Three classes with
distinct responsibilities replace the old conflated InvertedIndexQueryParamFactory:
* InvertedIndexQueryParam — abstract value interface; readers
pull the value via typed virtuals
(get_string / encode_ascending /
encode_min_ascending /
encode_max_ascending).
* TypedInvertedIndexQueryParam<PT> — concrete typed value; numeric/
date/decimal/IP specialisation
implements the encode_* virtuals
using type_limit<>; string
specialisation implements
get_string only.
* InvertedIndexQueryParamFactory — static-only namespace class that
maps FE values onto the correct
TypedInvertedIndexQueryParam<PT>;
no instances, no inheritance.
BkdIndexReader::construct_bkd_query_value drops the std::vector<char>
tmp scratch buffer and the _type_info->set_to_min/max calls used to
synthesize +/-infinity sentinels for half-bounded range queries. The
sentinel is now produced directly by the typed query value
(encode_min_ascending / encode_max_ascending), so only inverted-index
supported types ever need to know how to emit a min/max.
With BKD no longer the only consumer, the entire TypeInfo::set_to_min/max
API surface is removed: TypeInfo virtuals, ScalarTypeInfo storage,
List/Map/Struct DCHECK-fail overrides, every FieldTypeTraits<...>
specialization, the OLAP_FIELD_TYPE_CHAR static function pointer in
types.cpp, Field::set_to_min/max wrappers, and the CharField/VarcharField
/StringField overrides. Corresponding storage_types_test cases are removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the InvertedIndexQueryParam / StringQueryParam / NumericQueryParam hierarchy and the InvertedIndexQueryParamFactory. All IndexReader::query and try_query methods now take a const Field& directly. BkdIndexReader performs the Field -> KeyCoder dispatch internally via a macro-expanded switch on FieldType, using CppTypeTraits<FT>::CppType as the encoding type (which already handles DATETIME's signed/unsigned distinction). Removes ~200 lines of factory plus the param hierarchy, eliminates the runtime dynamic_cast in BkdIndexReader::query, and pushes type dispatch from predicate-construction time to query time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…encode_field inverted_index_query_param.h has no remaining includers; the matching test file is a stub. Remove both. Also drop the FieldType template parameter from bkd_encode_field — the storage value's bytes are already correct for KeyCoder's compile-time CppType, so the explicit key_t conversion was unnecessary. bkd_encode_min/max still need the CppTypeTraits<FT>::CppType for the right type_limit sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both RowCursor::_encode_field and BKD's encode_bkd_field_ascending did the same Field -> storage value -> KeyCoder dispatch with their own copy of the (FieldType, PrimitiveType) table. Extract the conversion helper and the dispatch X-macro into storage/field_key_encoder.h so both call sites share one source of truth. - field.h: expose StorageField::key_coder() for callers that already have a KeyCoder-shaped helper. - field_key_encoder.h: new header with full_encode_field_as_key / encode_field_as_key templates plus DORIS_APPLY_FOR_KEY_ENCODABLE_NON_STRING_TYPES X-macro. - row_cursor.cpp: 19 hand-written cases collapse into one macro expansion; encode_non_string_field<PT> wrapper removed. - inverted_index_reader.cpp: drops local bkd_encode_field<PT> and BKD_TYPE_CASES; the three encode_bkd_*_ascending functions reuse the shared macro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Field-to-key encoding helpers and the dispatch X-macro fit naturally next to KeyCoder rather than in a stand-alone header, since they are thin wrappers around KeyCoder calls. Inline them into storage/key_coder.h and remove storage/field_key_encoder.h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…match indexed type `Field::get<PT>()` DCHECKs that the Field's primitive type tag equals `PT`, but predicates like `arr_col = []` reach `encode_bkd_field_ascending` via `FunctionComparison<EqualsOp>` with the entire const ARRAY literal as the query Field, so `actual = TYPE_ARRAY` while the BKD index records the inner scalar (e.g. IPV4). Under ASAN the assert aborts the BE with "requested IPV4, actual ARRAY" -- before the void*->Field refactor the old factory rejected non-scalar types via NotSupported and the engine fell back, this defense was lost when the typed dispatch moved into BKD. Validate the Field type before dispatching to `full_encode_field_as_key<PT>` and return INVERTED_INDEX_EVALUATE_SKIPPED on mismatch so `SegmentIterator::_apply_index_expr` downgrades to scalar evaluation instead of crashing on the assert. Scalar predicates (`int_col = 1`, `array_contains(int_arr, 2)`) keep matching as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pe_limit bkd_encode_min/max are now templated by PrimitiveType instead of FieldType. The +/- infinity sentinel is taken from type_limit<CppType> in the compute layer (e.g. DecimalV2Value::get_min_decimal, VecDateTimeValue::datetime_min_value) and projected onto the storage POD via PrimitiveTypeConvertor<PT>::to_storage_field_type. This single-sources every limit constant: DecimalV2 bounds live only on DecimalV2Value, DATE bounds only on VecDateTimeValue. The two storage-layer type_limit<> specialisations added for decimal12_t and uint24_t in the previous PR are no longer required and are removed along with their includes and the half-bounded-BKD comment block. core/type_limit.h is now exclusively a compute-layer header. Tests: the two sanity-probe tests that asserted on the deleted specialisations are removed; verify_bkd_range_queries (one TEST_F per BKD-supported PT) still exercises the same +/- infinity codepath end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12f6e41 to
0a391ce
Compare
What problem does this PR solve?
A typed query-param interface replaces the void*:
produces the correct +inf byte string.
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)