Skip to content

Add blog post: HiCache L3 for Hybrid Attention Models#19

Open
stmatengss wants to merge 1 commit into
mainfrom
blog/hicache-l3-hybrid-attention
Open

Add blog post: HiCache L3 for Hybrid Attention Models#19
stmatengss wants to merge 1 commit into
mainfrom
blog/hicache-l3-hybrid-attention

Conversation

@stmatengss

Copy link
Copy Markdown
Collaborator

Summary

  • New blog post covering the full HiCache L3 optimization stack for hybrid attention architectures (MLA + SWA)
  • Topics: periodic SWA checkpoints, window-aware LRU refresh, eviction correctness fixes, C4/C128 compress-state offload, NSA indexer serialization frontier, draft KV pool registration, and Mooncake group semantics
  • Includes benchmark data: 18 GB → 4 GB per million tokens, agent cache hit 8% → 94%, TTFT 163s → 5.68s
  • Both English (index.md) and Chinese (index.zh.md) versions included

Test plan

  • Verify Hugo builds cleanly with the new post (hugo --minify)
  • Check frontmatter renders correctly on blog index
  • Confirm both EN and ZH versions display properly
  • Review PR links in post body resolve correctly

🤖 Generated with [Qoder][https://qoder.com]

Covers the full HiCache L3 optimization stack for hybrid attention
architectures: periodic SWA checkpoints in the Unified Radix Tree,
window-aware LRU refresh, leaf-level lock pruning, C4/C128
compress-state offload, draft KV pool registration, and Mooncake
group semantics for coherent multi-object eviction.

Includes both English (index.md) and Chinese (index.zh.md) versions.

🤖 Generated with [Qoder][https://qoder.com]

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new blog post in both English and Chinese detailing the HiCache L3 optimization stack for hybrid attention models like DeepSeek V4, covering SWA checkpoints, window-aware LRU refresh, and Mooncake group semantics. The review feedback correctly identifies minor capitalization inconsistencies where "Mooncake Store" was written as "MoonCake Store" in both language versions, providing suggestions to fix them.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


DeepSeek V3.2 introduced Native Sparse Attention (NSA) — a model-native sparsity pattern where the model itself decides at training time which tokens to attend to. The NSA indexer maintains a `index_k_with_scale_buffer` that records per-layer indexer state.

Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Inconsistent capitalization of "Mooncake". It should be "Mooncake Store" instead of "MoonCake Store" to match the rest of the post.

Suggested change
Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.
Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added Mooncake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes.


DeepSeek V3.2 引入了原生稀疏注意力(NSA)——一种模型原生的稀疏模式,模型在训练时自行决定对哪些 token 做注意力。NSA 索引器维护一个 `index_k_with_scale_buffer`,记录逐层的索引器状态。

早期对 NSA 模型的 L3 卸载工作([#18637](https://github.com/sgl-project/sglang/pull/18637))为 DeepSeek V3.2 添加了 MoonCake Store 集成,包括索引器 key 的序列化。但出现了安全顾虑([#20880](https://github.com/sgl-project/sglang/pull/20880)):L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式,不通用地序列化 NSA 索引器缓存。不正确处理时,L3 预取可能导致静默数据损坏和形状不匹配崩溃。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Inconsistent capitalization of "Mooncake". It should be "Mooncake Store" instead of "MoonCake Store" to match the rest of the post.

Suggested change
早期对 NSA 模型的 L3 卸载工作([#18637](https://github.com/sgl-project/sglang/pull/18637))为 DeepSeek V3.2 添加了 MoonCake Store 集成,包括索引器 key 的序列化。但出现了安全顾虑([#20880](https://github.com/sgl-project/sglang/pull/20880)):L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式,不通用地序列化 NSA 索引器缓存。不正确处理时,L3 预取可能导致静默数据损坏和形状不匹配崩溃。
早期对 NSA 模型的 L3 卸载工作([#18637](https://github.com/sgl-project/sglang/pull/18637))为 DeepSeek V3.2 添加了 Mooncake Store 集成,包括索引器 key 的序列化。但出现了安全顾虑([#20880](https://github.com/sgl-project/sglang/pull/20880)):L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式,不通用地序列化 NSA 索引器缓存。不正确处理时,L3 预取可能导致静默数据损坏和形状不匹配崩溃。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant