Skip to content

feat(kvcache): HMA-aware KV block scoring with window-aware SWA#650

Open
sagearc wants to merge 21 commits into
llm-d:mainfrom
sagearc:hma-scoring
Open

feat(kvcache): HMA-aware KV block scoring with window-aware SWA#650
sagearc wants to merge 21 commits into
llm-d:mainfrom
sagearc:hma-scoring

Conversation

@sagearc

@sagearc sagearc commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds HMA-aware prefix scoring, building on the group metadata parsed in #612 and indexed in #627.

The scorer mirrors vLLM's hybrid cache-hit convergence: route on the prefix that all attention groups actually share, rather than treating every group as plain full attention. Scope is homogeneous block sizes — all KV cache groups share one block size, equal to the router's hash block size (the common case). Differing-block-size hybrids are deferred (see Notes).

What changed

  • two-phase scoring in LongestPrefixScorer:
    1. main-attention (full / MLA / sink-full) contiguous prefix from block 0 — the binding constraint, since full attention needs the whole prefix
    2. sliding-window reduction — a right-to-left trailing-window scan per SWA group that can only shrink the prefix, mirroring vLLM's SlidingWindowManager.find_longest_cache_hit
  • the SWA trailing-window length is cdiv(window-1, blockSize) using the group's own block size — mirroring vLLM's _contiguous_blocks_for_hit
  • GroupCatalog learns group kind / block size / window from BlockStored events at runtime — no static per-model config; wired into the scorer via Indexer.SetGroupCatalog
  • group_idx == 0 fallback when a group's kind is not yet learned, matching the Dynamo consumer (fix(kv-router): filter KV events by cache spec kind [DYN-3176] ai-dynamo/dynamo#8751)

HMA context

Third step of the HMA support tracked in #336, after #612 (parse metadata) and #627 (index group identity).

The algorithm is grounded in vLLM's HybridKVCacheCoordinator.find_longest_cache_hit convergence and the per-type managers, not the issue's original presence-set sketch.

Notes

Scope is homogeneous block sizes (all groups share the router's hash block size). Differing-block-size hybrids (e.g. Gemma) are unsupported at the indexing layer — the single request-key granularity cannot match a differently-sized group's blocks — and are deferred (#336).

The phase-2 reduction is a single sequential per-group min — exact for one modeled SWA group (homogeneous windows). Multiple heterogeneous SWA groups would need vLLM's fixed-point re-check; also deferred (#336).

Related

sagearc added 2 commits June 8, 2026 16:31
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@github-actions github-actions Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 9, 2026
@github-actions github-actions Bot requested review from hyeongyun0916 and yankay June 9, 2026 11:34
@sagearc sagearc marked this pull request as draft June 9, 2026 11:38
sagearc and others added 3 commits June 9, 2026 14:52
Co-authored-by: Kapil Jain <kapiljain1989@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Co-authored-by: Kapil Jain <kapiljain1989@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Co-authored-by: Kapil Jain <kapiljain1989@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Comment thread examples/helper/events.go Outdated
Comment thread pkg/kvevents/zmq_subscriber_bench_test.go Outdated
Comment thread pkg/tokenization/pool_test.go Outdated
sagearc added 6 commits June 9, 2026 15:24
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@github-actions github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 9, 2026
@sagearc sagearc marked this pull request as ready for review June 9, 2026 13:48
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@sagearc sagearc changed the title feat(kvcache): HMA-aware KV block scoring with window-aware SWA feat(kvcache): HMA-aware KV block scoring with window-aware SWA (homogeneous block sizes) Jun 9, 2026
@github-actions github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 11, 2026
@vMaroon vMaroon force-pushed the hma-scoring branch 2 times, most recently from 6a7f853 to 7a14102 Compare June 12, 2026 11:18
@github-actions github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 12, 2026
@vMaroon

vMaroon commented Jun 13, 2026

Copy link
Copy Markdown
Member

In its current state, this PR only helps pure SWA models - deferring to after release.

/hold

@github-actions github-actions Bot added the hold PRs that are blocked on design, other features, release cycle, etc. label Jun 13, 2026
sagearc and others added 2 commits June 15, 2026 12:54
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
…arity

Align HMA window-aware scoring with vLLM's cache-hit logic and drop the
homogeneous-block-size assumption:

- Convert sliding windows to request-key counts with the router's canonical
  block size (cdiv(window-1, canonicalBlockSize)) instead of the group's
  engine block size: the scan walks canonical request keys, so router units
  are the only correct units. The engine block size is metadata only.
- Scan same-window SWA groups jointly with AND-presence, mirroring vLLM's
  per-spec-group lookup (a miss in any group is a miss). A sequential
  per-group min could overstate hits unboundedly and was order-dependent.
- Iterate window classes to a fixed point for heterogeneous windows,
  mirroring vLLM's restart-on-shrink convergence; a single class (the
  common case) needs exactly one scan.
- Score SWA-only models (no main-attention group) through a unitary-path
  mirror gated on catalog topology; null-prefix blocks count at weight 1.0.
- Skip indexing masked (sparse) group stores whose token span exceeds the
  hash span (vLLM reachable_block_mask output); re-chunking them would
  fabricate presence. Metadata is still learned.
- Record phase-1 weights as prefix sums so phase-2 truncation is O(1).

Tests cover the canonical-vs-engine divisor, indexer wiring end to end,
many:1 grouped writes, the masked-store guard scope, joint-AND and
fixed-point semantics, SWA-only scoring, and a hot-path benchmark across
the legacy / warm-HMA / worst-case regimes.

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@github-actions github-actions Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 16, 2026
@github-actions github-actions Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 16, 2026
sagearc added 2 commits June 16, 2026 14:19
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@sagearc sagearc changed the title feat(kvcache): HMA-aware KV block scoring with window-aware SWA (homogeneous block sizes) feat(kvcache): HMA-aware KV block scoring with window-aware SWA Jun 16, 2026
sagearc added 2 commits June 16, 2026 17:36
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@yankay yankay requested a review from Copilot June 16, 2026 14:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates KV cache event handling and scoring to support hybrid attention (HMA) models by stamping engine-agnostic attention metadata onto indexed entries, adding sliding-window-aware scoring, and handling masked (sparse) grouped store events safely.

Changes:

  • Map vLLM KV cache spec kinds into an engine-agnostic AttentionKind, and stamp that metadata onto PodEntry for both store and remove events.
  • Add sliding-window-aware longest-prefix scoring (plus benchmarks/tests) and wire canonical block size from the token processor into the scorer.
  • Skip indexing for masked/sparse grouped store events while still learning group metadata.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/kvevents/pool.go Stamps HMA metadata onto entries, learns group metadata, and skips masked group-store indexing.
pkg/kvevents/events.go Documents KVCacheSpecKind semantics and adds kind classification helpers.
pkg/kvevents/pool_test.go Updates/extends tests for HMA metadata, masked stores, and block-size mismatches.
pkg/kvcache/kvblock_scorer.go Implements two-phase HMA-aware scoring using stamped entry metadata.
pkg/kvcache/indexer.go Wires token processor block size into scorer for SWA token→block conversion.
pkg/kvcache/kvblock/index.go Adds AttentionKind and stamped attention fields to PodEntry.
pkg/kvcache/kvblock/hma.go Updates group metadata to store engine-agnostic attention info and makes catalog Get nil-safe.
pkg/kvcache/kvblock_scorer_hma_test.go Adds unit tests covering HMA scoring behavior and indexer wiring.
pkg/kvcache/kvblock_scorer_bench_test.go Adds benchmarks for scoring hot-path scenarios (legacy vs HMA).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/kvevents/pool.go Outdated
Comment on lines +351 to +364
// Masked (sparse) group stores: vLLM's reachable_block_mask
// (mixed-page-size hybrids, retention-interval checkpointing)
// emits token_ids spanning the full block range while
// block_hashes covers only the kept tail blocks. Re-chunking
// those tokens would fabricate presence for spans the engine
// never cached, so skip indexing the event; group metadata
// above is still learned.
if len(ev.Tokens) > 0 && ev.BlockSize > 0 && len(ev.Tokens) != len(ev.BlockHashes)*ev.BlockSize {
debugLogger.Info("skipping masked group store: token span != hash span",
"podIdentifier", podIdentifier, "groupIdx", groupID,
"numTokens", len(ev.Tokens), "numHashes", len(ev.BlockHashes),
"blockSize", ev.BlockSize)
continue
}
Comment thread pkg/kvevents/pool.go Outdated
Comment on lines +359 to +362
debugLogger.Info("skipping masked group store: token span != hash span",
"podIdentifier", podIdentifier, "groupIdx", groupID,
"numTokens", len(ev.Tokens), "numHashes", len(ev.BlockHashes),
"blockSize", ev.BlockSize)
Comment thread pkg/kvcache/kvblock_scorer.go Outdated
Comment on lines 61 to 66
// HMA group-aware scoring needs no extra wiring: the pool stamps each PodEntry
// with its own group's attention kind and window, and the scorer reads them off
// the entry. Set CanonicalBlockSize to enable the sliding-window reduction.
func NewKVBlockScorer(config *KVBlockScorerConfig) (*LongestPrefixScorer, error) {
switch config.ScoringStrategy {
case LongestPrefixMatch:
Comment on lines +201 to 231
// Phase 1: per-pod main-attention prefix, recording cumulative per-block
// weights so phase 2 can truncate to the converged hit without re-summing.
cumWeights := make(map[string][]float64)

// Scratch map reused across iterations to avoid per-key allocation.
curWeights := make(map[string]float64)
s.fillMainWeights(curWeights, keyToPods[keys[0]])

// Build weight index for the first key in a single pass over entries.
fillMaxWeights(curWeights, keyToPods[keys[0]], s.MediumWeights)

// activePods tracks pods still in the consecutive prefix chain.
// Using a plain map and in-place deletion avoids allocating new sets
// on every iteration.
activePods := make(map[string]struct{}, len(curWeights))
for pod, w := range curWeights {
activePods[pod] = struct{}{}
podScores[pod] = w
cumWeights[pod] = []float64{w}
}

for i := 1; i < len(keys); i++ {
if len(activePods) == 0 {
break
}

// Reuse scratch map: clear and refill for current key.
clear(curWeights)
fillMaxWeights(curWeights, keyToPods[keys[i]], s.MediumWeights)
s.fillMainWeights(curWeights, keyToPods[keys[i]])

// In-place intersection: delete pods from activePods that are not
// in the current key, and accumulate scores for those that remain.
for pod := range activePods {
if w, exists := curWeights[pod]; exists {
podScores[pod] += w
cum := cumWeights[pod]
cumWeights[pod] = append(cum, cum[len(cum)-1]+w)
} else {
delete(activePods, pod)
}
}
}
Comment thread pkg/kvcache/kvblock_scorer.go Outdated
Comment on lines +108 to +127
// collectPodAttention builds the per-pod attention view from the entries at the
// scored keys.
func collectPodAttention(keyToPods map[kvblock.BlockHash][]kvblock.PodEntry) map[string]podAttention {
meta := make(map[string]podAttention)
for _, entries := range keyToPods {
for _, e := range entries {
m := meta[e.PodIdentifier]
// Non-HMA entries (no group) and main-attention groups both anchor
// the main-prefix path.
if !e.HasGroup || e.AttentionKind == kvblock.AttentionMain {
m.hasMain = true
}
if e.AttentionKind == kvblock.AttentionSlidingWindow && e.SlidingWindowSize > m.slidingWindowSize {
m.slidingWindowSize = e.SlidingWindowSize
}
meta[e.PodIdentifier] = m
}
}
return meta
}
Comment thread pkg/kvevents/pool_test.go Outdated
Comment on lines +690 to +693
meta, ok := pool.groupCatalog.Get("pod-hma", kvblock.GroupID(0))
require.True(t, ok)
assert.Equal(t, string(KVCacheSpecKindSlidingWindow), meta.Kind)
assert.Equal(t, 16, meta.BlockSize)
require.NotNil(t, meta.SlidingWindowSize)
assert.Equal(t, 128, *meta.SlidingWindowSize)
assert.Equal(t, kvblock.AttentionSlidingWindow, meta.Kind, "sliding-window group is not main attention")
assert.Equal(t, 128, meta.SlidingWindowSize)
sagearc added 2 commits June 16, 2026 18:12
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@github-actions github-actions Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold PRs that are blocked on design, other features, release cycle, etc. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants