feat(kvevents): reference-count duplicate KV-event removals#680
Merged
github-actions[bot] merged 5 commits intoJun 25, 2026
Merged
Conversation
vLLM's OffloadingConnector publishes self-describing block-granular KV events. In chunk mode, overlapping offloaded chunks legitimately re-announce the same constituent block hash, so a hash is Stored and Removed more than once on the wire. BlockRemoved was forwarded straight to index.Evict per hash, so the first such duplicate remove evicted a block a sibling chunk still referenced (premature eviction that under-credits the lower tier in routing). Add an eventDedupFilter in pkg/kvevents that mirrors the wire stream: every BlockStored increments a per-(pod, device tier, KV-cache group) reference count, a BlockRemoved is forwarded to the index only once that count returns to zero, unknown removes pass through defensively, and AllBlocksCleared resets a pod. This matches the dimensions of the PodEntry identity an eviction targets, so gpu/cpu copies and distinct groups of a hash are counted independently. The scope's data-parallel rank is a sentinel on current main because the index identity is pod-level and does not distinguish ranks, so counts aggregate across ranks; a future DP-aware index (llm-d#370) can wire the real rank with no change to the filter. No changes to EventBatch or the engine adapters. Signed-off-by: Change72 <changg@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a small reference-counting de-duplication layer to KV event processing so that duplicate BlockRemoved events (legitimately produced by vLLM chunk-mode offloading for overlapping chunks) do not prematurely evict blocks that are still referenced by another outstanding announcement.
Changes:
- Introduces an
eventDedupFilterthat reference-counts(pod, deviceTier, groupIdx, dpRank-sentinel, blockHash)acrossBlockStored/BlockRemoved. - Wires the filter into
Pool.processEventBatch: stores increment after successful index add; removes are filtered so eviction is forwarded only when the refcount reaches zero;AllBlocksClearedresets counts for the pod. - Adds focused unit + pool-level integration tests covering duplicate store/remove behavior, tier/group isolation, defensive unknown removes, and clear/reset behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/kvevents/pool.go | Normalizes device tier consistently, tracks stores/removes through the new dedup filter, and clears filter state on AllBlocksCleared. |
| pkg/kvevents/event_dedup_filter.go | Implements the reference-counting filter keyed by pod/tier/group/dp-rank-sentinel/hash with defensive pass-through behavior. |
| pkg/kvevents/event_dedup_filter_test.go | Adds unit + end-to-end tests validating duplicate-remove suppression, isolation across tiers/groups, and clear/reset behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Change72 <changg@nvidia.com>
Address review feedback on the KV-event dedup filter:
- Add kvcache_kvevents_dedup_removed_hashes_{suppressed,forwarded}_total
counters (block-hash granularity, not event count) so suppressed removals are
observable, following the pkg/kvcache/metrics global-collector pattern; emit a
TRACE log when removals are suppressed.
- Document why the filter lives in the Pool rather than as an Index decorator.
- Tighten the concurrency doc: the mutex guards the cross-pod map and same-pod
events are serialized by Pool.AddTask sharding.
- Move the pool-level tests into pool_test.go and add a negative test that a
device-tier update resolving no keys is not reference-counted.
Signed-off-by: Change72 <changg@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The test relocation in the previous commit dropped TestPool_AllBlocksClearedResetsDedup instead of moving it into pool_test.go. Restore it next to TestAllBlocksCleared_Dispatch: it is the only regression guard that the AllBlocksCleared handler resets the dedup refcount (Dispatch only proves Index.Clear ran), so a post-clear store/remove cycle is not wrongly suppressed. Signed-off-by: Change72 <changg@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…guard
Address second-round review feedback:
- Add DedupRemovedHashes{Suppressed,Forwarded} to TestCollectorsIncludesAllMetrics
so the registration-completeness guard fails if either counter is ever dropped
from Collectors() (previously only the 9 pre-existing collectors were guarded).
- Add TestPool_DedupMetricsCountBlockHashes verifying the counter arithmetic at
block-hash granularity: two duplicate stores then two removes record 4
suppressed (first remove) and 4 forwarded (second). Values are read via the
dto.Metric Write pattern already used by metrics.logMetrics, so no new
dependency is introduced.
- Document that the noGroupIdx (-1) sentinel cannot collide with a real group
index, since the engine adapters reject a negative group_idx.
Signed-off-by: Change72 <changg@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
|
Thanks, this looks consistent with Dynamo’s dedup behavior and the current llm-d index semantics. The tier/group/offload coverage looks good to me. /lgtm |
Member
|
/approve |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an
eventDedupFilterinpkg/kveventsthat reference-countsBlockStored/BlockRemovedannouncements so a duplicate remove does not evict ablock that another announcement still references.
Why
vLLM's
OffloadingConnectorpublishes self-describing, block-granular KV events.In chunk mode (
block_size> engine block size), when a shared prefix is notaligned to the offloaded chunk size, two overlapping chunks legitimately list the
same constituent block hash, so the same hash is
StoredandRemovedmore thanonce on the wire.
Today
BlockRemovedis forwarded straight toindex.Evictper hash, so thefirst of these duplicate removes evicts a block the sibling chunk still
references — premature eviction that under-credits the lower tier in routing.
There is currently no de-duplication anywhere in
pkg/kvevents. This mirrors therole of Dynamo's
EventDedupFilter(ai-dynamo/dynamo#8012), which llm-d has noequivalent for.
How
A small filter that mirrors the wire event stream (not index state):
BlockStoredincrements the count for its hashes (duplicates included),after the store has been applied to the index;
BlockRemovedis forwarded toindex.Evictonly once its count returns tozero; removes for never-seen hashes pass through defensively (an unknown evict
is a no-op) and a count never goes negative;
AllBlocksClearedresets a pod's counts in lockstep with the index's pod-wideclear.
The reference-count scope is
(podIdentifier, deviceTier, groupIdx), matching thedimensions of the
PodEntryidentity that an eviction actually targets, sogpu/cpucopies and distinct KV-cache groups of the same hash are countedindependently.
Data-parallel rank
The scope carries a
dataParallelRankfield that is fixed to a sentinel onmain, because the current index identity (PodEntry) is pod-level and doesnot distinguish data-parallel ranks — so reference counts correctly aggregate
across ranks (a block is still resident on the pod until every rank has removed
it). This is intentionally not a change to DP semantics. Once a DP-aware
index lands (e.g. #370, which adds
DataParallelRanktoPodEntryandEventBatch), the rank can be sourced from the event at the call site so thededup scope becomes DP-aware in lockstep, with no change to the filter itself (a
TODO(#370)marks the single line).Scope of this change
EventBatchor the engine adapters — DP-rank propagation is leftto Add DP-aware routing support to KVEvents and indexing pipeline #370, which owns it.
remove forwards exactly as before).
Observability
Because a suppressed
BlockRemovedis otherwise invisible, two counters are addedto
pkg/kvcache/metricsfollowing the existing global-collector pattern (countsare at block-hash granularity, not
BlockRemovedevents):kvcache_kvevents_dedup_removed_hashes_suppressed_totalkvcache_kvevents_dedup_removed_hashes_forwarded_totalA
TRACElog is also emitted when removals are suppressed. A gauge for theoutstanding per-pod reference-count map size is a planned follow-up.
Testing
New tests in
pkg/kvevents/event_dedup_filter_test.go:aggregation across sources;
gpu/cputier independence; KV-cache groupindependence; defensive unknown-remove pass-through; partial forward; clear
resets and isolates pods; nil-safe.
pool_test.go,processEventBatch+ realInMemoryIndex):duplicate store survives the first remove and is evicted by the second;
CPU/offload duplicate removal through the empty-token device-tier path (the
GPU copy stays untouched); a device-tier update that resolves no keys is not
reference-counted;
AllBlocksClearedresets the filter.AI assistance
Developed with AI assistance. The submitter has reviewed every changed line, ran
the tests, and owns the change.