feat(kvcache): HMA-aware KV block scoring with window-aware SWA#650
Open
sagearc wants to merge 21 commits into
Open
feat(kvcache): HMA-aware KV block scoring with window-aware SWA#650sagearc wants to merge 21 commits into
sagearc wants to merge 21 commits into
Conversation
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
sagearc
commented
Jun 9, 2026
sagearc
commented
Jun 9, 2026
sagearc
commented
Jun 9, 2026
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
6a7f853 to
7a14102
Compare
Member
|
In its current state, this PR only helps pure SWA models - deferring to after release. /hold |
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
…arity Align HMA window-aware scoring with vLLM's cache-hit logic and drop the homogeneous-block-size assumption: - Convert sliding windows to request-key counts with the router's canonical block size (cdiv(window-1, canonicalBlockSize)) instead of the group's engine block size: the scan walks canonical request keys, so router units are the only correct units. The engine block size is metadata only. - Scan same-window SWA groups jointly with AND-presence, mirroring vLLM's per-spec-group lookup (a miss in any group is a miss). A sequential per-group min could overstate hits unboundedly and was order-dependent. - Iterate window classes to a fixed point for heterogeneous windows, mirroring vLLM's restart-on-shrink convergence; a single class (the common case) needs exactly one scan. - Score SWA-only models (no main-attention group) through a unitary-path mirror gated on catalog topology; null-prefix blocks count at weight 1.0. - Skip indexing masked (sparse) group stores whose token span exceeds the hash span (vLLM reachable_block_mask output); re-chunking them would fabricate presence. Metadata is still learned. - Record phase-1 weights as prefix sums so phase-2 truncation is O(1). Tests cover the canonical-vs-engine divisor, indexer wiring end to end, many:1 grouped writes, the masked-store guard scope, joint-AND and fixed-point semantics, SWA-only scoring, and a hot-path benchmark across the legacy / warm-HMA / worst-case regimes. Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR updates KV cache event handling and scoring to support hybrid attention (HMA) models by stamping engine-agnostic attention metadata onto indexed entries, adding sliding-window-aware scoring, and handling masked (sparse) grouped store events safely.
Changes:
- Map vLLM KV cache spec kinds into an engine-agnostic
AttentionKind, and stamp that metadata ontoPodEntryfor both store and remove events. - Add sliding-window-aware longest-prefix scoring (plus benchmarks/tests) and wire canonical block size from the token processor into the scorer.
- Skip indexing for masked/sparse grouped store events while still learning group metadata.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/kvevents/pool.go | Stamps HMA metadata onto entries, learns group metadata, and skips masked group-store indexing. |
| pkg/kvevents/events.go | Documents KVCacheSpecKind semantics and adds kind classification helpers. |
| pkg/kvevents/pool_test.go | Updates/extends tests for HMA metadata, masked stores, and block-size mismatches. |
| pkg/kvcache/kvblock_scorer.go | Implements two-phase HMA-aware scoring using stamped entry metadata. |
| pkg/kvcache/indexer.go | Wires token processor block size into scorer for SWA token→block conversion. |
| pkg/kvcache/kvblock/index.go | Adds AttentionKind and stamped attention fields to PodEntry. |
| pkg/kvcache/kvblock/hma.go | Updates group metadata to store engine-agnostic attention info and makes catalog Get nil-safe. |
| pkg/kvcache/kvblock_scorer_hma_test.go | Adds unit tests covering HMA scoring behavior and indexer wiring. |
| pkg/kvcache/kvblock_scorer_bench_test.go | Adds benchmarks for scoring hot-path scenarios (legacy vs HMA). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+351
to
+364
| // Masked (sparse) group stores: vLLM's reachable_block_mask | ||
| // (mixed-page-size hybrids, retention-interval checkpointing) | ||
| // emits token_ids spanning the full block range while | ||
| // block_hashes covers only the kept tail blocks. Re-chunking | ||
| // those tokens would fabricate presence for spans the engine | ||
| // never cached, so skip indexing the event; group metadata | ||
| // above is still learned. | ||
| if len(ev.Tokens) > 0 && ev.BlockSize > 0 && len(ev.Tokens) != len(ev.BlockHashes)*ev.BlockSize { | ||
| debugLogger.Info("skipping masked group store: token span != hash span", | ||
| "podIdentifier", podIdentifier, "groupIdx", groupID, | ||
| "numTokens", len(ev.Tokens), "numHashes", len(ev.BlockHashes), | ||
| "blockSize", ev.BlockSize) | ||
| continue | ||
| } |
Comment on lines
+359
to
+362
| debugLogger.Info("skipping masked group store: token span != hash span", | ||
| "podIdentifier", podIdentifier, "groupIdx", groupID, | ||
| "numTokens", len(ev.Tokens), "numHashes", len(ev.BlockHashes), | ||
| "blockSize", ev.BlockSize) |
Comment on lines
61
to
66
| // HMA group-aware scoring needs no extra wiring: the pool stamps each PodEntry | ||
| // with its own group's attention kind and window, and the scorer reads them off | ||
| // the entry. Set CanonicalBlockSize to enable the sliding-window reduction. | ||
| func NewKVBlockScorer(config *KVBlockScorerConfig) (*LongestPrefixScorer, error) { | ||
| switch config.ScoringStrategy { | ||
| case LongestPrefixMatch: |
Comment on lines
+201
to
231
| // Phase 1: per-pod main-attention prefix, recording cumulative per-block | ||
| // weights so phase 2 can truncate to the converged hit without re-summing. | ||
| cumWeights := make(map[string][]float64) | ||
|
|
||
| // Scratch map reused across iterations to avoid per-key allocation. | ||
| curWeights := make(map[string]float64) | ||
| s.fillMainWeights(curWeights, keyToPods[keys[0]]) | ||
|
|
||
| // Build weight index for the first key in a single pass over entries. | ||
| fillMaxWeights(curWeights, keyToPods[keys[0]], s.MediumWeights) | ||
|
|
||
| // activePods tracks pods still in the consecutive prefix chain. | ||
| // Using a plain map and in-place deletion avoids allocating new sets | ||
| // on every iteration. | ||
| activePods := make(map[string]struct{}, len(curWeights)) | ||
| for pod, w := range curWeights { | ||
| activePods[pod] = struct{}{} | ||
| podScores[pod] = w | ||
| cumWeights[pod] = []float64{w} | ||
| } | ||
|
|
||
| for i := 1; i < len(keys); i++ { | ||
| if len(activePods) == 0 { | ||
| break | ||
| } | ||
|
|
||
| // Reuse scratch map: clear and refill for current key. | ||
| clear(curWeights) | ||
| fillMaxWeights(curWeights, keyToPods[keys[i]], s.MediumWeights) | ||
| s.fillMainWeights(curWeights, keyToPods[keys[i]]) | ||
|
|
||
| // In-place intersection: delete pods from activePods that are not | ||
| // in the current key, and accumulate scores for those that remain. | ||
| for pod := range activePods { | ||
| if w, exists := curWeights[pod]; exists { | ||
| podScores[pod] += w | ||
| cum := cumWeights[pod] | ||
| cumWeights[pod] = append(cum, cum[len(cum)-1]+w) | ||
| } else { | ||
| delete(activePods, pod) | ||
| } | ||
| } | ||
| } |
Comment on lines
+108
to
+127
| // collectPodAttention builds the per-pod attention view from the entries at the | ||
| // scored keys. | ||
| func collectPodAttention(keyToPods map[kvblock.BlockHash][]kvblock.PodEntry) map[string]podAttention { | ||
| meta := make(map[string]podAttention) | ||
| for _, entries := range keyToPods { | ||
| for _, e := range entries { | ||
| m := meta[e.PodIdentifier] | ||
| // Non-HMA entries (no group) and main-attention groups both anchor | ||
| // the main-prefix path. | ||
| if !e.HasGroup || e.AttentionKind == kvblock.AttentionMain { | ||
| m.hasMain = true | ||
| } | ||
| if e.AttentionKind == kvblock.AttentionSlidingWindow && e.SlidingWindowSize > m.slidingWindowSize { | ||
| m.slidingWindowSize = e.SlidingWindowSize | ||
| } | ||
| meta[e.PodIdentifier] = m | ||
| } | ||
| } | ||
| return meta | ||
| } |
Comment on lines
+690
to
+693
| meta, ok := pool.groupCatalog.Get("pod-hma", kvblock.GroupID(0)) | ||
| require.True(t, ok) | ||
| assert.Equal(t, string(KVCacheSpecKindSlidingWindow), meta.Kind) | ||
| assert.Equal(t, 16, meta.BlockSize) | ||
| require.NotNil(t, meta.SlidingWindowSize) | ||
| assert.Equal(t, 128, *meta.SlidingWindowSize) | ||
| assert.Equal(t, kvblock.AttentionSlidingWindow, meta.Kind, "sliding-window group is not main attention") | ||
| assert.Equal(t, 128, meta.SlidingWindowSize) |
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds HMA-aware prefix scoring, building on the group metadata parsed in #612 and indexed in #627.
The scorer mirrors vLLM's hybrid cache-hit convergence: route on the prefix that all attention groups actually share, rather than treating every group as plain full attention. Scope is homogeneous block sizes — all KV cache groups share one block size, equal to the router's hash block size (the common case). Differing-block-size hybrids are deferred (see Notes).
What changed
LongestPrefixScorer:SlidingWindowManager.find_longest_cache_hitcdiv(window-1, blockSize)using the group's own block size — mirroring vLLM's_contiguous_blocks_for_hitGroupCataloglearns group kind / block size / window fromBlockStoredevents at runtime — no static per-model config; wired into the scorer viaIndexer.SetGroupCataloggroup_idx == 0fallback when a group's kind is not yet learned, matching the Dynamo consumer (fix(kv-router): filter KV events by cache spec kind [DYN-3176] ai-dynamo/dynamo#8751)HMA context
Third step of the HMA support tracked in #336, after #612 (parse metadata) and #627 (index group identity).
The algorithm is grounded in vLLM's
HybridKVCacheCoordinator.find_longest_cache_hitconvergence and the per-type managers, not the issue's original presence-set sketch.Notes
Scope is homogeneous block sizes (all groups share the router's hash block size). Differing-block-size hybrids (e.g. Gemma) are unsupported at the indexing layer — the single request-key granularity cannot match a differently-sized group's blocks — and are deferred (#336).
The phase-2 reduction is a single sequential per-group min — exact for one modeled SWA group (homogeneous windows). Multiple heterogeneous SWA groups would need vLLM's fixed-point re-check; also deferred (#336).
Related