Add blog post: HiCache L3 for Hybrid Attention Models#19
Conversation
Covers the full HiCache L3 optimization stack for hybrid attention architectures: periodic SWA checkpoints in the Unified Radix Tree, window-aware LRU refresh, leaf-level lock pruning, C4/C128 compress-state offload, draft KV pool registration, and Mooncake group semantics for coherent multi-object eviction. Includes both English (index.md) and Chinese (index.zh.md) versions. 🤖 Generated with [Qoder][https://qoder.com]
There was a problem hiding this comment.
Code Review
This pull request adds a new blog post in both English and Chinese detailing the HiCache L3 optimization stack for hybrid attention models like DeepSeek V4, covering SWA checkpoints, window-aware LRU refresh, and Mooncake group semantics. The review feedback correctly identifies minor capitalization inconsistencies where "Mooncake Store" was written as "MoonCake Store" in both language versions, providing suggestions to fix them.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| DeepSeek V3.2 introduced Native Sparse Attention (NSA) — a model-native sparsity pattern where the model itself decides at training time which tokens to attend to. The NSA indexer maintains a `index_k_with_scale_buffer` that records per-layer indexer state. | ||
|
|
||
| Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes. |
There was a problem hiding this comment.
Inconsistent capitalization of "Mooncake". It should be "Mooncake Store" instead of "MoonCake Store" to match the rest of the post.
| Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added MoonCake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes. | |
| Early work on L3 offloading for NSA models ([#18637](https://github.com/sgl-project/sglang/pull/18637)) added Mooncake Store integration for DeepSeek V3.2, including indexer key serialization. However, a safety concern emerged ([#20880](https://github.com/sgl-project/sglang/pull/20880)): the L3 storage layer only handles MHA K/V and MLA latent-K formats but does not generically serialize the NSA indexer cache. Without proper handling, L3 prefetch could cause silent data corruption and shape mismatch crashes. |
|
|
||
| DeepSeek V3.2 引入了原生稀疏注意力(NSA)——一种模型原生的稀疏模式,模型在训练时自行决定对哪些 token 做注意力。NSA 索引器维护一个 `index_k_with_scale_buffer`,记录逐层的索引器状态。 | ||
|
|
||
| 早期对 NSA 模型的 L3 卸载工作([#18637](https://github.com/sgl-project/sglang/pull/18637))为 DeepSeek V3.2 添加了 MoonCake Store 集成,包括索引器 key 的序列化。但出现了安全顾虑([#20880](https://github.com/sgl-project/sglang/pull/20880)):L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式,不通用地序列化 NSA 索引器缓存。不正确处理时,L3 预取可能导致静默数据损坏和形状不匹配崩溃。 |
There was a problem hiding this comment.
Inconsistent capitalization of "Mooncake". It should be "Mooncake Store" instead of "MoonCake Store" to match the rest of the post.
| 早期对 NSA 模型的 L3 卸载工作([#18637](https://github.com/sgl-project/sglang/pull/18637))为 DeepSeek V3.2 添加了 MoonCake Store 集成,包括索引器 key 的序列化。但出现了安全顾虑([#20880](https://github.com/sgl-project/sglang/pull/20880)):L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式,不通用地序列化 NSA 索引器缓存。不正确处理时,L3 预取可能导致静默数据损坏和形状不匹配崩溃。 | |
| 早期对 NSA 模型的 L3 卸载工作([#18637](https://github.com/sgl-project/sglang/pull/18637))为 DeepSeek V3.2 添加了 Mooncake Store 集成,包括索引器 key 的序列化。但出现了安全顾虑([#20880](https://github.com/sgl-project/sglang/pull/20880)):L3 存储层仅处理 MHA K/V 和 MLA latent-K 格式,不通用地序列化 NSA 索引器缓存。不正确处理时,L3 预取可能导致静默数据损坏和形状不匹配崩溃。 |
Summary
index.md) and Chinese (index.zh.md) versions includedTest plan
hugo --minify)🤖 Generated with [Qoder][https://qoder.com]