Add blog post: HiCache L3 for Hybrid Attention Models by stmatengss · Pull Request #19 · kvcache-ai/kvcache-blog

stmatengss · 2026-06-20T05:49:47Z

Summary

New blog post covering the full HiCache L3 optimization stack for hybrid attention architectures (MLA + SWA)
Topics: periodic SWA checkpoints, window-aware LRU refresh, eviction correctness fixes, C4/C128 compress-state offload, NSA indexer serialization frontier, draft KV pool registration, and Mooncake group semantics
Includes benchmark data: 18 GB → 4 GB per million tokens, agent cache hit 8% → 94%, TTFT 163s → 5.68s
Both English (index.md) and Chinese (index.zh.md) versions included

Test plan

Verify Hugo builds cleanly with the new post (hugo --minify)
Check frontmatter renders correctly on blog index
Confirm both EN and ZH versions display properly
Review PR links in post body resolve correctly

🤖 Generated with [Qoder][https://qoder.com]

Covers the full HiCache L3 optimization stack for hybrid attention architectures: periodic SWA checkpoints in the Unified Radix Tree, window-aware LRU refresh, leaf-level lock pruning, C4/C128 compress-state offload, draft KV pool registration, and Mooncake group semantics for coherent multi-object eviction. Includes both English (index.md) and Chinese (index.zh.md) versions. 🤖 Generated with [Qoder][https://qoder.com]

gemini-code-assist

Code Review

This pull request adds a new blog post in both English and Chinese detailing the HiCache L3 optimization stack for hybrid attention models like DeepSeek V4, covering SWA checkpoints, window-aware LRU refresh, and Mooncake group semantics. The review feedback correctly identifies minor capitalization inconsistencies where "Mooncake Store" was written as "MoonCake Store" in both language versions, providing suggestions to fix them.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T05:50:31Z

+
+DeepSeek V3.2 introduced Native Sparse Attention (NSA) — a model-native sparsity pattern where the model itself decides at training time which tokens to attend to. The NSA indexer maintains a `index_k_with_scale_buffer` that records per-layer indexer state.
+
+Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.


Inconsistent capitalization of "Mooncake". It should be "Mooncake Store" instead of "MoonCake Store" to match the rest of the post.

Suggested change

Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.

Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added Mooncake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.

gemini-code-assist · 2026-06-20T05:50:32Z

+
+DeepSeek V3.2 引入了原生稀疏注意力（NSA）——一种模型原生的稀疏模式，模型在训练时自行决定对哪些 token 做注意力。NSA 索引器维护一个 `index_k_with_scale_buffer`，记录逐层的索引器状态。
+
+早期对 NSA 模型的 L3 卸载工作（[#18637](https://github.com/sgl-project/sglang/pull/18637)）为 DeepSeek V3.2 添加了 MoonCake Store 集成，包括索引器 key 的序列化。但出现了安全顾虑（[#20880](https://github.com/sgl-project/sglang/pull/20880)）：L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式，不通用地序列化 NSA 索引器缓存。不正确处理时，L3 预取可能导致静默数据损坏和形状不匹配崩溃。


Inconsistent capitalization of "Mooncake". It should be "Mooncake Store" instead of "MoonCake Store" to match the rest of the post.

Suggested change

早期对 NSA 模型的 L3 卸载工作（[#18637](https://github.com/sgl-project/sglang/pull/18637)）为 DeepSeek V3.2 添加了 MoonCake Store 集成，包括索引器 key 的序列化。但出现了安全顾虑（[#20880](https://github.com/sgl-project/sglang/pull/20880)）：L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式，不通用地序列化 NSA 索引器缓存。不正确处理时，L3 预取可能导致静默数据损坏和形状不匹配崩溃。

早期对 NSA 模型的 L3 卸载工作（[#18637](https://github.com/sgl-project/sglang/pull/18637)）为 DeepSeek V3.2 添加了 Mooncake Store 集成，包括索引器 key 的序列化。但出现了安全顾虑（[#20880](https://github.com/sgl-project/sglang/pull/20880)）：L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式，不通用地序列化 NSA 索引器缓存。不正确处理时，L3 预取可能导致静默数据损坏和形状不匹配崩溃。

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog post: HiCache L3 for Hybrid Attention Models#19

Add blog post: HiCache L3 for Hybrid Attention Models#19
stmatengss wants to merge 1 commit into
mainfrom
blog/hicache-l3-hybrid-attention

stmatengss commented Jun 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		DeepSeek V3.2 introduced Native Sparse Attention (NSA) — a model-native sparsity pattern where the model itself decides at training time which tokens to attend to. The NSA indexer maintains a `index_k_with_scale_buffer` that records per-layer indexer state.

		Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.


		DeepSeek V3.2 引入了原生稀疏注意力（NSA）——一种模型原生的稀疏模式，模型在训练时自行决定对哪些 token 做注意力。NSA 索引器维护一个 `index_k_with_scale_buffer`，记录逐层的索引器状态。

		早期对 NSA 模型的 L3 卸载工作（[#18637](https://github.com/sgl-project/sglang/pull/18637)）为 DeepSeek V3.2 添加了 MoonCake Store 集成，包括索引器 key 的序列化。但出现了安全顾虑（[#20880](https://github.com/sgl-project/sglang/pull/20880)）：L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式，不通用地序列化 NSA 索引器缓存。不正确处理时，L3 预取可能导致静默数据损坏和形状不匹配崩溃。

Conversation

stmatengss commented Jun 20, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant