feat(kvcache): HMA-aware KV block scoring with window-aware SWA by sagearc · Pull Request #650 · llm-d/llm-d-kv-cache

sagearc · 2026-06-09T11:34:22Z

Summary

Adds HMA-aware prefix scoring, building on the group metadata parsed in #612 and indexed in #627.

The scorer mirrors vLLM's hybrid cache-hit convergence: route on the prefix that all attention groups actually share, rather than treating every group as plain full attention. Scope is homogeneous block sizes — all KV cache groups share one block size, equal to the router's hash block size (the common case). Differing-block-size hybrids are deferred (see Notes).

What changed

two-phase scoring in LongestPrefixScorer:
1. main-attention (full / MLA / sink-full) contiguous prefix from block 0 — the binding constraint, since full attention needs the whole prefix
2. sliding-window reduction — a right-to-left trailing-window scan per SWA group that can only shrink the prefix, mirroring vLLM's SlidingWindowManager.find_longest_cache_hit
the SWA trailing-window length is cdiv(window-1, blockSize) using the group's own block size — mirroring vLLM's _contiguous_blocks_for_hit
GroupCatalog learns group kind / block size / window from BlockStored events at runtime — no static per-model config; wired into the scorer via Indexer.SetGroupCatalog
group_idx == 0 fallback when a group's kind is not yet learned, matching the Dynamo consumer (fix(kv-router): filter KV events by cache spec kind [DYN-3176] ai-dynamo/dynamo#8751)

HMA context

Third step of the HMA support tracked in #336, after #612 (parse metadata) and #627 (index group identity).

The algorithm is grounded in vLLM's HybridKVCacheCoordinator.find_longest_cache_hit convergence and the per-type managers, not the issue's original presence-set sketch.

Notes

Scope is homogeneous block sizes (all groups share the router's hash block size). Differing-block-size hybrids (e.g. Gemma) are unsupported at the indexing layer — the single request-key granularity cannot match a differently-sized group's blocks — and are deferred (#336).

The phase-2 reduction is a single sequential per-group min — exact for one modeled SWA group (homogeneous windows). Multiple heterogeneous SWA groups would need vLLM's fixed-point re-check; also deferred (#336).

…arity Align HMA window-aware scoring with vLLM's cache-hit logic and drop the homogeneous-block-size assumption: - Convert sliding windows to request-key counts with the router's canonical block size (cdiv(window-1, canonicalBlockSize)) instead of the group's engine block size: the scan walks canonical request keys, so router units are the only correct units. The engine block size is metadata only. - Scan same-window SWA groups jointly with AND-presence, mirroring vLLM's per-spec-group lookup (a miss in any group is a miss). A sequential per-group min could overstate hits unboundedly and was order-dependent. - Iterate window classes to a fixed point for heterogeneous windows, mirroring vLLM's restart-on-shrink convergence; a single class (the common case) needs exactly one scan. - Score SWA-only models (no main-attention group) through a unitary-path mirror gated on catalog topology; null-prefix blocks count at weight 1.0. - Skip indexing masked (sparse) group stores whose token span exceeds the hash span (vLLM reachable_block_mask output); re-chunking them would fabricate presence. Metadata is still learned. - Record phase-1 weights as prefix sums so phase-2 truncation is O(1). Tests cover the canonical-vs-engine divisor, indexer wiring end to end, many:1 grouped writes, the masked-store guard scope, joint-AND and fixed-point semantics, SWA-only scoring, and a hot-path benchmark across the legacy / warm-HMA / worst-case regimes. Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates KV cache event handling and scoring to support hybrid attention (HMA) models by stamping engine-agnostic attention metadata onto indexed entries, adding sliding-window-aware scoring, and handling masked (sparse) grouped store events safely.

Changes:

Map vLLM KV cache spec kinds into an engine-agnostic AttentionKind, and stamp that metadata onto PodEntry for both store and remove events.
Add sliding-window-aware longest-prefix scoring (plus benchmarks/tests) and wire canonical block size from the token processor into the scorer.
Skip indexing for masked/sparse grouped store events while still learning group metadata.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/kvevents/pool.go	Stamps HMA metadata onto entries, learns group metadata, and skips masked group-store indexing.
pkg/kvevents/events.go	Documents KVCacheSpecKind semantics and adds kind classification helpers.
pkg/kvevents/pool_test.go	Updates/extends tests for HMA metadata, masked stores, and block-size mismatches.
pkg/kvcache/kvblock_scorer.go	Implements two-phase HMA-aware scoring using stamped entry metadata.
pkg/kvcache/indexer.go	Wires token processor block size into scorer for SWA token→block conversion.
pkg/kvcache/kvblock/index.go	Adds `AttentionKind` and stamped attention fields to `PodEntry`.
pkg/kvcache/kvblock/hma.go	Updates group metadata to store engine-agnostic attention info and makes catalog `Get` nil-safe.
pkg/kvcache/kvblock_scorer_hma_test.go	Adds unit tests covering HMA scoring behavior and indexer wiring.
pkg/kvcache/kvblock_scorer_bench_test.go	Adds benchmarks for scoring hot-path scenarios (legacy vs HMA).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+				// Masked (sparse) group stores: vLLM's reachable_block_mask
+				// (mixed-page-size hybrids, retention-interval checkpointing)
+				// emits token_ids spanning the full block range while
+				// block_hashes covers only the kept tail blocks. Re-chunking
+				// those tokens would fabricate presence for spans the engine
+				// never cached, so skip indexing the event; group metadata
+				// above is still learned.
+				if len(ev.Tokens) > 0 && ev.BlockSize > 0 && len(ev.Tokens) != len(ev.BlockHashes)*ev.BlockSize {
+					debugLogger.Info("skipping masked group store: token span != hash span",
+						"podIdentifier", podIdentifier, "groupIdx", groupID,
+						"numTokens", len(ev.Tokens), "numHashes", len(ev.BlockHashes),
+						"blockSize", ev.BlockSize)
+					continue
+				}


+					debugLogger.Info("skipping masked group store: token span != hash span",
+						"podIdentifier", podIdentifier, "groupIdx", groupID,
+						"numTokens", len(ev.Tokens), "numHashes", len(ev.BlockHashes),
+						"blockSize", ev.BlockSize)


+// HMA group-aware scoring needs no extra wiring: the pool stamps each PodEntry
+// with its own group's attention kind and window, and the scorer reads them off
+// the entry. Set CanonicalBlockSize to enable the sliding-window reduction.
+func NewKVBlockScorer(config *KVBlockScorerConfig) (*LongestPrefixScorer, error) {
 	switch config.ScoringStrategy {
 	case LongestPrefixMatch:


+	// Phase 1: per-pod main-attention prefix, recording cumulative per-block
+	// weights so phase 2 can truncate to the converged hit without re-summing.
+	cumWeights := make(map[string][]float64)

 	// Scratch map reused across iterations to avoid per-key allocation.
 	curWeights := make(map[string]float64)
+	s.fillMainWeights(curWeights, keyToPods[keys[0]])

-	// Build weight index for the first key in a single pass over entries.
-	fillMaxWeights(curWeights, keyToPods[keys[0]], s.MediumWeights)
-
-	// activePods tracks pods still in the consecutive prefix chain.
-	// Using a plain map and in-place deletion avoids allocating new sets
-	// on every iteration.
 	activePods := make(map[string]struct{}, len(curWeights))
 	for pod, w := range curWeights {
 		activePods[pod] = struct{}{}
-		podScores[pod] = w
+		cumWeights[pod] = []float64{w}
 	}

 	for i := 1; i < len(keys); i++ {
 		if len(activePods) == 0 {
 			break
 		}

-		// Reuse scratch map: clear and refill for current key.
 		clear(curWeights)
-		fillMaxWeights(curWeights, keyToPods[keys[i]], s.MediumWeights)
+		s.fillMainWeights(curWeights, keyToPods[keys[i]])

-		// In-place intersection: delete pods from activePods that are not
-		// in the current key, and accumulate scores for those that remain.
 		for pod := range activePods {
 			if w, exists := curWeights[pod]; exists {
-				podScores[pod] += w
+				cum := cumWeights[pod]
+				cumWeights[pod] = append(cum, cum[len(cum)-1]+w)
 			} else {
 				delete(activePods, pod)
 			}
 		}
 	}


+// collectPodAttention builds the per-pod attention view from the entries at the
+// scored keys.
+func collectPodAttention(keyToPods map[kvblock.BlockHash][]kvblock.PodEntry) map[string]podAttention {
+	meta := make(map[string]podAttention)
+	for _, entries := range keyToPods {
+		for _, e := range entries {
+			m := meta[e.PodIdentifier]
+			// Non-HMA entries (no group) and main-attention groups both anchor
+			// the main-prefix path.
+			if !e.HasGroup || e.AttentionKind == kvblock.AttentionMain {
+				m.hasMain = true
+			}
+			if e.AttentionKind == kvblock.AttentionSlidingWindow && e.SlidingWindowSize > m.slidingWindowSize {
+				m.slidingWindowSize = e.SlidingWindowSize
+			}
+			meta[e.PodIdentifier] = m
+		}
+	}
+	return meta
+}


+	meta, ok := pool.groupCatalog.Get("pod-hma", kvblock.GroupID(0))
 	require.True(t, ok)
-	assert.Equal(t, string(KVCacheSpecKindSlidingWindow), meta.Kind)
-	assert.Equal(t, 16, meta.BlockSize)
-	require.NotNil(t, meta.SlidingWindowSize)
-	assert.Equal(t, 128, *meta.SlidingWindowSize)
+	assert.Equal(t, kvblock.AttentionSlidingWindow, meta.Kind, "sliding-window group is not main attention")
+	assert.Equal(t, 128, meta.SlidingWindowSize)


Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

sagearc added 2 commits June 8, 2026 16:31

hma awareness in scorer

6787388

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

hma tests

c603860

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

sagearc requested review from dannyharnik, kfirtoledo, liu-cong and vMaroon as code owners June 9, 2026 11:34

github-actions Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 9, 2026

github-actions Bot requested review from hyeongyun0916 and yankay June 9, 2026 11:34

sagearc force-pushed the hma-scoring branch from 4bed882 to dd43c75 Compare June 9, 2026 11:38

sagearc marked this pull request as draft June 9, 2026 11:38

sagearc and others added 3 commits June 9, 2026 14:52

window aware hma scoring with indexer owned catalog

d0609f7

Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

tests for window aware hma scoring

9bbe83e

Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

drop unused gosec nolint directives

0c8261e

Co-authored-by: Kapil Jain <kapiljain1989@gmail.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

sagearc force-pushed the hma-scoring branch from dd43c75 to 0c8261e Compare June 9, 2026 11:53

sagearc commented Jun 9, 2026

View reviewed changes

Comment thread examples/helper/events.go Outdated

sagearc commented Jun 9, 2026

View reviewed changes

Comment thread pkg/kvevents/zmq_subscriber_bench_test.go Outdated

sagearc commented Jun 9, 2026

View reviewed changes

Comment thread pkg/tokenization/pool_test.go Outdated

sagearc added 6 commits June 9, 2026 15:24

revert unrelated lint fixes

541990e

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

move KVCacheSpecKind to kvevents pkg

1334f43

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

pool owns hma catalog scorer consumes it via getter

ec89c6a

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

reduce noise

6371c57

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

drop gratuitous variable renames in scorer

c496930

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

check ok in prefix scorer type assertion for errcheck

b4ae2b0

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 9, 2026

sagearc marked this pull request as ready for review June 9, 2026 13:48

support homogeneous block sizes for swa scoring

8fe8265

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

sagearc force-pushed the hma-scoring branch from 00fa6b3 to 8fe8265 Compare June 9, 2026 13:58

sagearc changed the title ~~feat(kvcache): HMA-aware KV block scoring with window-aware SWA~~ feat(kvcache): HMA-aware KV block scoring with window-aware SWA (homogeneous block sizes) Jun 9, 2026

vMaroon force-pushed the hma-scoring branch 2 times, most recently from 6a7f853 to 7a14102 Compare June 12, 2026 11:18

github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 12, 2026

vMaroon force-pushed the hma-scoring branch from 7a14102 to b775f9b Compare June 12, 2026 12:05

github-actions Bot added the hold PRs that are blocked on design, other features, release cycle, etc. label Jun 13, 2026

sagearc and others added 2 commits June 15, 2026 12:54

Merge upstream/main

d7e6013

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

sagearc force-pushed the hma-scoring branch from b775f9b to 04d3d09 Compare June 15, 2026 10:01

github-actions Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 16, 2026

sagearc force-pushed the hma-scoring branch from f08a667 to 04d3d09 Compare June 16, 2026 11:15

github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 16, 2026

sagearc added 2 commits June 16, 2026 14:19

Merge upstream/main

5492dd2

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

add attention metadata to pod entry

7657ad3

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

sagearc changed the title ~~feat(kvcache): HMA-aware KV block scoring with window-aware SWA (homogeneous block sizes)~~ feat(kvcache): HMA-aware KV block scoring with window-aware SWA Jun 16, 2026

sagearc added 2 commits June 16, 2026 17:36

lint and remove noise

37ffcbc

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

remove redundant attention type check

d4de3ff

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

yankay requested a review from Copilot June 16, 2026 14:55

Copilot AI reviewed Jun 16, 2026

View reviewed changes

sagearc added 2 commits June 16, 2026 18:12

Revert HMA pod-entry attention metadata changes to align with 5492dd2

aaf7a37

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

index masked swa group stores by content identity

7747949

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>

github-actions Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(kvcache): HMA-aware KV block scoring with window-aware SWA#650

feat(kvcache): HMA-aware KV block scoring with window-aware SWA#650
sagearc wants to merge 21 commits into
llm-d:mainfrom
sagearc:hma-scoring

sagearc commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vMaroon commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sagearc commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

HMA context

Notes

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vMaroon commented Jun 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sagearc commented Jun 9, 2026 •

edited

Loading