[feature-wip](inverted index) Introduce SPIMI V4 inverted index storage format#63633
[feature-wip](inverted index) Introduce SPIMI V4 inverted index storage format#63633airborne12 wants to merge 67 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
d47cb9d to
ab92f46
Compare
ab92f46 to
9281771
Compare
|
run buildall |
9281771 to
0f96cfa
Compare
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
2 similar comments
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found blocking correctness issues in the V4 SPIMI spill path and validation gaps that can expose unsupported V4 tables.
Critical checkpoint conclusions:
- Goal/test proof: the PR wires V4 SPIMI write/read and adds broad tests, but the tests do not cover the production multi-spill case with absolute row ids and include a timing assertion that is likely flaky in CI.
- Scope/focus: mostly focused on V4 SPIMI, but the new p0 latency benchmark is not a stable functional test.
- Concurrency/lifecycle: no new shared concurrent state issue identified in the reviewed path; cache-reader lifetime changes appear to use existing searcher cache patterns.
- Configuration/compatibility: V4 is exposed through table properties/config and thrift/proto, but FE validation still permits unsupported V4 index shapes that BE later rejects during load.
- Parallel paths: cloud/non-cloud schema propagation has V4 branches, but unsupported parser/array paths need consistent FE rejection.
- Data correctness: multi-spill merging can corrupt posting doc ids, and flush can persist a stale doc_count for the triggering row. These affect query correctness for larger V4 fulltext segments.
- Test coverage/results: there are many unit/regression tests, but the missing absolute-doc-id multi-spill coverage lets the main spill corruption escape; the latency regression should not be p0 threshold-based.
- Observability/performance: no additional observability blocker found beyond the functional issues above.
User focus: no additional user-provided focus points were present.
| // Apply doc_id offset and append. | ||
| for (auto& d : docs) { | ||
| d.doc_id += offset; | ||
| } |
There was a problem hiding this comment.
This offset corrupts production V4 spills. InvertedIndexColumnWriter::add_values() appends the segment-level _rid into SpimiPostingBuffer, and SpillManager::FlushBuffer() emits those doc ids as-is, so spill inputs already contain absolute row ids. When a large segment crosses the SPIMI memory budget more than once, the second and later spills get running added again here, shifting postings beyond their real rows (often beyond total_doc_count) and causing MATCH queries to return wrong/missing rows. The current merger tests build artificial inputs with local doc ids, so they do not cover the actual writer->spill contract. Please either make spill buffers localize doc ids before emitting, or remove this offset for V4 spill inputs and add an end-to-end test with multiple spills using absolute _rid values.
| // and continue accepting tokens. | ||
| if (_spimi_writer->ShouldFlush()) { | ||
| _spimi_writer->FlushPending(_spimi_doc_count); | ||
| } |
There was a problem hiding this comment.
The doc count passed to the spill is stale for the row that triggered the flush. At this point the current row's tokens have already been appended with doc id _rid, but _spimi_doc_count is only advanced below this block. If this is the first row past the budget, the spill manifest can advertise a doc_count that excludes a doc whose postings are present. That stale count is then used by SegmentMerger for offsets/metadata. Update _spimi_doc_count before FlushPending() (or pass static_cast<int32_t>(_rid) + 1) so each spill's metadata covers all postings it contains.
| @@ -1275,11 +1277,15 @@ public static TInvertedIndexFileStorageFormat analyzeInvertedIndexFileStorageFor | |||
| return TInvertedIndexFileStorageFormat.V2; | |||
| } else if (invertedIndexFileStorageFormat.equalsIgnoreCase("v3")) { | |||
| return TInvertedIndexFileStorageFormat.V3; | |||
| } else if (invertedIndexFileStorageFormat.equalsIgnoreCase("v4")) { | |||
| return TInvertedIndexFileStorageFormat.V4; | |||
| } else if (invertedIndexFileStorageFormat.equalsIgnoreCase("default")) { | |||
There was a problem hiding this comment.
Exposing explicit V4 here needs matching FE validation for the supported index shapes. The BE writer now rejects V4 when should_analyzer is false and also rejects array string indexes, but InvertedIndexUtil.checkInvertedIndexParser() still accepts parser none and array-with-parser-none for the same table property. That lets CREATE TABLE ... PROPERTIES("inverted_index_storage_format"="V4") succeed for keyword/array inverted indexes, and the first load then fails in BE writer init. Please reject unsupported V4 parser/type combinations during analysis, or implement those paths end-to-end.
| // on top of the actual reader latency. A real reader regression | ||
| // would shift the median far past this cap. | ||
| assertTrue(ratio < 2.0, | ||
| "${tag}: V4 median ${v4.median} us / V2 median ${v2.median} us = ${ratio} " + |
There was a problem hiding this comment.
A p0 regression suite should not fail on wall-clock query timing. This median ratio includes planner/executor startup, cache state, BE scheduling, network/runner noise, and unrelated concurrent load; on shared CI it can exceed 2x without a functional regression, especially with only 9 retained samples. This will create flaky failures for unrelated changes. Please keep this as logging/manual benchmark coverage, move it out of p0/per-PR execution, or replace the assertion with deterministic functional checks.
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 31767 ms |
TPC-DS: Total hot run time: 172092 ms |
FE Regression Coverage ReportIncrement line coverage |
3eae47c to
b334f74
Compare
…h() catch-all Addresses two review findings on the SPIMI V4 write/read path: 1. query_term_enum.cpp Cursor::ReadVInt/ReadVLong left-shifted `b << shift` with no bound on `shift`. A crafted/corrupt .tis with >=5 (VInt) / >=10 (VLong) continuation bytes drives shift past the operand width (>=32 / >=64) — undefined behavior. Mirror the existing `shift >= 32U/64U` guards from term_dict_reader.cpp; throw CLuceneError(CL_ERR_IO), which the query path already converts to INVERTED_INDEX_FILE_CORRUPTED. Adds a corrupt-input regression test (CorruptVIntShiftOverflowThrowsNotUB). 2. InvertedIndexColumnWriter::finish() caught only CLuceneError and doris::Exception around the SPIMI emit. SpimiIndexWriter::Finish() rethrows the original exception type (e.g. a std::bad_alloc), so an unlisted type escaped past the FINALLY block — leaking the seven SPIMI IndexOutputs and skipping the eptr->Status conversion. Add a catch-all that funnels any exception through the same error_context path so cleanup runs uniformly and the FINALLY macro converts it to an INVERTED_INDEX_CLUCENE_ERROR Status. Verified locally under ASAN+DCHECKs: 252 passed / 0 failed (5 env-gated skips).
bdc92a0 to
bd88a84
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 28938 ms |
TPC-DS: Total hot run time: 170535 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
SegmentMerger::Merge applied a per-segment doc_id offset (sum of prior doc_counts), assuming local 0-based inputs. But the write path appends the absolute _rid and FreqProxEncoder::StartTerm resets _last_doc to 0, so each spill segment already holds GLOBAL absolute doc_ids that never overlap across inputs. The offset double-shifted every doc after the first spill, silently corrupting the merged segment's row mapping and pushing doc_ids past total_doc_count for any segment that crossed the 256MiB budget or spilled under memory pressure. Existing multi-input merger tests fed local-0-based segments with local doc_counts (a self-consistent combination) so the bug was invisible. The merge now concatenates the already-ordered absolute-id runs verbatim (no offset); removes the unused MergeCursor; adds MultiSpillAbsoluteDocIdsArePreserved reproducing the production contract (SpillManager double spill + cumulative doc_count + read-back); and rewrites the multi-input tests to construct absolute, non-overlapping doc_ids.
V4 inverted index segments are not byte-compatible with CLucene's indexCompaction/mergeTerms; feeding a V4 .idx to it reads past EOF and SIGSEGVs the whole BE during background index compaction. Skip byte-level index compaction for V4 tablets in construct_index_compaction_columns; the index columns then fall through to the normal compaction rewrite path (segment_writer rebuilds the SPIMI index from merged column data, skip_inverted_index stays false), which is correct and crash-safe. Native SPIMI segment merging during compaction is a follow-up.
…ock heap alloc pack_bits allocated a heap vector per 128-value block (a large segment emits millions of blocks); select_patch copied + fully std::sort'd the block only to read one order statistic (the exception threshold); EncodeBlock's low[] was another per-block heap vector. Replace all three with stack buffers (count<=128, width<=32 bound the payload to <=520 B) and nth_element for the single quantile read. Output is byte-identical — WindowFrameEncoderTest.ByteIdentityGolden, the PFOR differential test, and segment roundtrip all stay green — while dropping 3-5 malloc/free per block on the .frq/.prx encode hot path.
…ort, direct buffer calls Three byte-identical write-path wins (ByteIdentityGolden + roundtrip + V4 writer/merger tests green, 127 pass): * HashTerm: replace per-byte FNV-1a (16-22% of V4 wall-clock per the flame graph) with an 8-byte-at-a-time rapidhash-style mum mix. The per-instance seed still seeds the state and is folded by every mum step (C5 DoS resistance) and the hash stays fully deterministic; it only indexes the in-memory slot table (terms emit in lexicographic arena order) so on-disk bytes are unchanged. * Term sort: precompute a big-endian 8-byte prefix key so std::sort resolves most comparisons with one integer compare instead of two ArenaTermAt lookups + a memcmp; a first-8-byte tie falls back to the full arena compare. Sort order, hence emitted bytes, unchanged. * add_values V4 loop: fetch the posting buffer once and call Append/Saturated directly, dropping a cross-TU facade hop per token (BE builds without LTO); Saturated then inlines to a single load.
Disabling .frq ZSTD gives +17~38% write throughput on short/medium-doc corpora (V4 reaches parity with V2): the adaptive-W search no longer compresses every candidate framing. Disk cost is bounded — 73-81% of the ZSTD saving lives in .prx, which stays compressed — so most corpora are flat or smaller; httplogs-class short ASCII pays +10.7% .idx. The .prx framing is already decoupled from the .frq gate, so a raw .frq no longer fragments .prx into incompressible windows. ByteIdentityGolden now also pins frq_zstd_enable=true (alongside the existing zstd_min pin) so it keeps locking the full-ZSTD output regardless of the production default. Decision confirmed by the user after reviewing the throughput/disk trade-off.
With spill-to-disk (spilled bytes are freed to a node-local tmp file, no resident accumulation), a lower budget sharply cuts the large-document write peak — wiki goes 546->259 MB, below V2's 288 MB — for only +0.75% .idx (more spills make the k-way merge re-encode split terms slightly less optimally). Short documents never cross the budget and are unaffected (no spill, peak ~= buffer). memory_budget_test asserts per-record/arena bounds independent of the budget value, so it stays green. Decision confirmed by the user after reviewing the memory/disk trade-off.
…spills MergeSingleInput already byte-copies a single phrase-on spill's .tis/.tii/.frq/.prx and rebuilds only metadata. The omit_term_freq_and_positions guard that excluded DOCS_ONLY single spills was a leftover from before the spill omit flag was lockstepped with the output: a DOCS_ONLY spill is written omit=true (doc-only .frq, empty .prx) in exactly the format the output advertises, so the byte-copy is equally valid — it copies the empty .prx verbatim and rebuilds .fnm with has_prox=false. Dropping the guard lets a DOCS_ONLY segment flushed exactly once skip the decode/re-encode merge (~40-50% of that path's write cost). Renamed the test to fast-path and asserted byte-copy equivalence (output .tis/.frq == input, empty .prx). ASAN: SegmentMergerTest + roundtrip 40 pass / 0 fail.
The in-memory prox chain stored each occurrence's ABSOLUTE position. For long documents an absolute position needs 2-3 VInt bytes while the intra-doc gap is usually 1, so the chain was larger than necessary (the prox pool is one of the largest buckets in compact mode). Store the within-doc delta instead (Lucene-style): last_pos resets to 0 at each doc boundary, the first position in a doc is stored as itself and the rest as gaps. Both decode paths (EmitFromCompactDirect, DecodeCompactTerm/DecodeTermToRecords) prefix-sum the deltas back to absolute before re-encoding, so the on-disk .prx — and the records[] consumed by the norms/materialized path — stay byte-identical. Modular subtraction round-trips any order, matching the doc-delta scheme. DOCS_ONLY is unaffected (no prox chain). Cuts phrase-on peak write memory ~10-20% on long-doc corpora. ASAN: segment roundtrip (positions prefix-sum) + posting-buffer decode (records.position 1,3,5,9) + ByteIdentityGolden + writer/merger, 127 pass.
… sort Two byte-identical follow-ups to the write hot-path series (ByteIdentityGolden + roundtrip + merger/writer suites, 176 pass): * EncodeBlock declared 'values may be modified' but the implementation is read-only (max scan + masked copy into a stack low[]); the non-const signature forced EncodePforPart to scratch-copy every 128-value sub-block before encoding, and EncodeBlockToBytes to copy its whole input. Make the parameter const and feed slices of the staging arrays directly. * The flat-mode counting-sort stage-1 (distinct text_refs) used the same double-ArenaTermAt+memcmp comparator the compact path had; give it the same big-endian 8-byte prefix key so most comparisons are one integer compare.
…ake it production-shaped
config::init's default-application loop aborts (returns false) on the first field whose default needs an unresolvable env substitution — custom_config_dir = "${DORIS_HOME}/conf" — so without DORIS_HOME every config registered after it (ALL inverted_index_* knobs) stayed zero-initialized: prx_zstd=0, zstd_min_window_bytes=0, prx_window_docs=0 (pathologically finest .prx windows), ram_dir=0, and spill_check_interval_rows=0, which clamps to 1 and made V4 run its expensive memory-watermark gate EVERY ROW (production: every 512). The benchmark had been measuring a config unlike production on both axes that matter most for the V2/V4 comparison. The swallowed init failure is now a LOG(FATAL), DORIS_HOME is provided, and a [BENCH-CONFIG] provenance line logs every measurement-affecting knob per run.
Also production-shapes the harness against cluster-E2E discrepancies: per-dataset row caps (SPIMI_*_ROWS; wikipedia ignores the global SPIMI_BENCH_ROWS so a 200K short-doc cap can't inflate it into 5 GB segments), Doris per-alloc memory tracking ON by default (SPIMI_TRACK_MEM=0 restores the old untracked mode), opt-in write-time searcher-cache warmup (SPIMI_WARMUP_CACHE=1, matching CI/cluster confs), production data-page-sized add_values batches (4096, was 32), and concurrent 16/32-thread V2-vs-V4 cases on the real textbench/weibo corpora — the E2E-shaped headline metric (the historical cluster 'V4 1.5x slower' was allocator contention invisible to single-thread benchmarks).
… pos offsets, analytic raw-W search Four byte-identical encoder wins, validated by ByteIdentityGolden + roundtrip + merger/writer suites (128 pass) AND a benchmark idx byte-equality check (all six corpus/mode idx sizes exactly unchanged): * SLIM terms (df < skip_interval, the vocabulary long tail) now write their pure-VInt block straight to the block sink instead of staging through _frq_term_buf and copying at FinishTerm — the slim block is never ZSTD-wrapped, so staging bought nothing. The 0-level skip tail writes no bytes and slim's skip_pointer return is discarded, so output is unchanged. * ComputeNumberOfSkipLevels short-circuits df < skip_interval (floor(log ratio) is provably 0), saving two std::log calls per slim term; FlushFrqBlock/FlushProxBlock construct their faststring compression scratch inside the compression branch so slim/tiny terms never pay it. * The windowed encoder records each doc's position-byte offset at StartDoc (when the boundary is known for free) and passes it to WindowFrameEncoder::Encode, which previously recovered doc boundaries by re-scanning the ENTIRE position VInt stream byte-by-byte per term — .prx is the largest stream, so that was a full extra pass over most of the index. The redundant _win_pos_counts vector (always == _win_freqs) is gone; the offset-less scan survives as the fallback for direct Encode callers (tests). * With .frq ZSTD disabled (the production default), every window payload is the raw tuple, so the adaptive-W search's candidate sizes are pure arithmetic over unit part lengths. AnalyticRawFrqSize mirrors MeasureAndCacheFrq's accounting term for term; the search now computes candidate sizes analytically and composes ONLY the chosen framing, removing the baseline + per-candidate full-term byte copies. A per-term DCHECK cross-checks analytic == measured for the composed framing (held across the whole ASAN suite), and the benchmark idx equality confirms the W decision is bit-identical end to end.
… contract
MergerInlineOutputMatchesExternal still built its two merge inputs with LOCAL per-input doc ids ({1,2} each), relying on the per-segment offset the merger no longer applies (the offset was removed when the multi-spill doc-id corruption was fixed: production inputs carry GLOBAL absolute ids). Under the absolute contract the two inputs overlapped, tripping StartDoc's strictly-ascending DCHECK. Rebuild the inputs as successive slices of one absolute stream ({1,2} then {4,5}, cumulative doc_counts); the reference segment's expectations were already expressed in global ids and are unchanged. This was the one multi-input merge fixture outside spill_segment_merger_test.cpp that the doc-id fix missed.
…copy fast path) For a slim phrase-on term (df < skip_interval — the long tail that dominates a real vocabulary), the posting buffer's freq chain bytes ARE the on-disk slim .frq block (per-doc docCode VInts: same values in the same order; VInt64/VInt of the same value encode identically and direct-emit input is monotonic), and since the buffer stores within-doc position deltas, the prox chain bytes ARE the raw .prx payload FlushProxBlock builds. EmitFromCompactDirect now copies each chain with one memcpy per slice (ByteSliceReader::AppendRemainingTo) and emits it pre-encoded (FreqProxEncoder::EmitSlimTermPreEncoded, sharing FlushProxRaw for the .prx mode-byte + ZSTD policy), replacing the per-occurrence VInt decode -> encoder state machine -> re-encode replay. DOCS_ONLY stays on the replay (its chain carries freq codes the bare-delta on-disk format omits). Byte-identity: the existing CompactDirectEmitMatchesRecordsPathVInt A/B already locks the non-inline branch; the new CompactDirectEmitMatchesRecordsPathV4Inline locks the V4 (windowed + inline) configuration across the inline (df=3), slim-boundary (511) and windowed (600) cases against the records-path replay. Full SPIMI-wide ASAN suite 486 pass / 0 fail; benchmark idx sizes byte-equal on all six corpus/mode cells.
…ms chain-copy
The DOCS_ONLY posting chain stored phrase-shaped docCode VInts (delta<<1|flag [+freq]) that the emit only decoded to throw the freq away and re-encode bare deltas. Store the bare doc-delta directly — written EAGERLY at doc open (no deferred close; FinalizeBlocks skips omit buffers), one entry per doc, which is byte-for-byte the on-disk DOCS_ONLY slim block. Slim omit terms (df < skip_interval) then take the same chain-copy fast path as phrase-on terms (EmitSlimTermPreEncoded grows omit support: no prox bytes, prox_pointer=0); PFOR omit terms replay doc_count bare deltas. The chain also shrinks in memory (no freq codes), trimming the DOCS_ONLY buffer.
Format ownership: the chain format follows the BUFFER's omit flag, the on-disk format follows the WRITER's — chain-copy and the bare-delta replay are gated on the flags AGREEING. The one legal mixed combination (omit writer over a phrase buffer, which tests use to emit DOCS_ONLY from a generic buffer) keeps the docCode decode+re-encode replay; the OmitTfapByteNeutral A/B pair caught the first cut keying this off the writer flag alone.
Omit records semantics: DecodeCompactTerm/DecodeTermToRecords now yield one record per DOC for omit buffers (per-occurrence multiplicity is not recoverable from a bare-delta chain, and nothing downstream needs it — the omit emit ignores freq). Norms are the only per-occurrence consumer, and omit fields never write norms in V2/CLucene either; EmitSegment grows a DCHECK so a future norms-over-omit caller fails loudly.
Validated: SPIMI-wide ASAN suite 486 pass / 0 fail, including the OmitTfapByteNeutral{VInt,Pfor} byte-equality oracles, OmitTfapDirectEmitMatchesRecordsPath, and the DOCS_ONLY roundtrip.
…, last-term swap Three byte-identical term-dictionary wins, one per term across the whole vocabulary: Utf8ToWideInto fills a reused member wstring instead of heap-allocating one per Add/AddInline; the front-coded suffix is staged once (AppendSCharsFromWide, sharing one EncodeSChar core with WriteSCharsFromWide so the encodings cannot drift) and emitted with a single bulk WriteBytes instead of a virtual WriteByte per encoded byte; and the .tis last-term update swaps the scratch instead of copying the wstring (the .tii boundary entries, 1 in 128, still copy and never reach the swap branch). SPIMI-wide ASAN suite 486 pass / 0 fail.
…d emit) The windowed-term replay decoded every position from the prox chain (prefix-summing within-doc deltas to absolutes) only for AddPosition to recompute the SAME deltas and re-encode the SAME LEB128 bytes into the windowed position buffer. Since the chain bytes and the rebuilt buffer are byte-identical, EmitFromCompactDirect now copies the whole prox chain once (one memcpy per slice), decodes only the per-doc docCode entries (df values — cheap relative to occurrences), recovers each doc's position byte offset with a continuation-bit scan (no value decode), and hands everything to FreqProxEncoder::EmitWindowedTermPreDecoded — which produces exactly FinishTermWindowed's output via the same WindowFrameEncoder::Encode call. High-frequency terms (wiki-class head words, millions of occurrences) skip the per-occurrence decode + re-encode entirely. Gated on V4 windowed + phrase-on with a phrase BUFFER (the chain must carry positions); the omit-writer-over-phrase-buffer combination keeps the replay. SPIMI-wide ASAN suite 486 pass / 0 fail, including the V4Inline direct-vs-records A/B's windowed (df=600) case.
|
run buildall |
TPC-H: Total hot run time: 29465 ms |
TPC-DS: Total hot run time: 168328 ms |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Introduces a new inverted index storage format V4 powered by SPIMI
(Single-Pass In-Memory Indexing), replacing the CLucene
IndexWriteron the write path for analyzed (fulltext) string columns.
Why
CLucene's
IndexWriteraccumulates per-tokenPostinglinked-listnodes plus a term hash table plus a char[] interning pool. On Doris
fulltext columns this dominates BE memory during write and shows up
in OOM kills on large segments. The encoding is byte-equivalent to
Lucene 2.x, but the in-memory representation is the cost. SPIMI
keeps a flat
(term_id, doc_id, position)record array plus asingle intern arena, then sorts + emits the same Lucene 2.x sibling
files (
.tis/.tii/.frq/.prx/.fnm/segments_N) onfinish(). Theon-disk format is unchanged; only the writer's working memory shape
changes.
Measured impact (SPIMI_BENCH=1, ~614 K occurrences/segment)
.idxon-disk sizeRepetitive vocab is the architectural trade-off region: V4's
compact-mode VInt-delta stream scales per-occurrence while CLucene's
Posting struct scales per-unique-term. Absolute memory in this
regime is sub-MB on both sides, so the percentage swing has no
production impact. Storage-size delta on repetitive is the
documented PFOR header cost.
What's in this PR
be/src/storage/index/inverted/spimi/):SpimiPostingBuffer(flat record + arena + intern map withhybrid compact-mode VInt-delta migration),
SegmentWriter,TermDictWriter,FieldInfosWriter,SegmentInfosWriter,PFOR encoder for high-doc-freq postings,
ByteOutputfamilyabstracting CLucene's
IndexOutput.SpimiQueryIndexReader,SpimiTermDocsReader,SpimiProxReader,SpimiTermEnum,SpimiSearcherBuilder;SpimiFulltextIndexReaderis theDoris-side adapter (overrides
type() -> SPIMI_FULLTEXTsothe searcher cache routes correctly).
column_reader.cppdispatch: V4 storage format → SPIMIreader; V1/V2/V3 unchanged.
EmitSegmentpost-flush self-validation:ValidateClosedSegmentByteCountsre-queries on-disk filelengths after close, throws
INVERTED_INDEX_FILE_CORRUPTEDonmismatch — guards against the async-S3 partial-flush class of
bugs that single-node tests can't see.
be/test/storage/index/inverted/spimi/plus extended tests under
be/test/storage/segment/:SPIMI_THROW_CORRUPTsite (segments_N / .frq / .prx / PFOR / .tis-.tii readers)
fault-injection case
.idxbyte parity)randomized V2/V4 alternation + full distribution report
repetitive workloads
(
InvertedIndexReaderTest.SpimiV2V4QueryLatencyBenchmark)using the corrected
SpimiFulltextIndexReader::create_shareddispatch
SPIMI_BENCHenv-var tier: default UT runs use 12 Koccurrences (fast regression guard);
SPIMI_BENCH=1scales to~614 K,
SPIMI_BENCH=largescales to ~6 M for full-segmentstress. Keeps headline benchmark numbers reproducible without
ballooning every UT pass.
inverted_index_p0/storage_format/test_storage_format_v4— V2 vs V4 black-box parity across MATCH_ANY / MATCH_ALL /
MATCH_PHRASE / MATCH_PHRASE_PREFIX / MATCH_REGEXP, NULL/empty
handling, and the
support_phrase=false(omit_tfap) no-proxwrite+read path.
test_storage_format_v4_cloud— same coverage gated byisCloudMode()so the async-S3 upload path gets exercised.test_storage_format_v4_query_latency— cluster-levelV2 vs V4 query timing distribution.
PropertyAnalyzer,TabletIndex,OlapTable): acceptinverted_index_storage_format=V4inCREATE TABLE PROPERTIES; propagate through the protocol to BE.
What's NOT in this PR (known gaps)
currently emits a single
_0segment per column; compaction isdocumented as a follow-up in
SPIMI_DESIGN.md.omit_norms=true; the readside synthesizes a default-norm array. Score-using paths
(
MATCH_ALLwith relevance ordering) fall back to V2 behavioron V4 columns. Listed in design doc.
(
should_analyzer=false) and numeric (BKD) paths remain on theexisting writers.
Release note
Add inverted index storage format V4, an in-house SPIMI-based writer
that reduces BE write-side memory by ~55 % and CPU by ~68 % on
diverse-vocab fulltext workloads while keeping segment on-disk
format Lucene 2.x compatible. Enable by setting
inverted_index_storage_format = "V4"in CREATE TABLE PROPERTIES.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)