Official website source for KVCache.ai — the home of open-source projects and research on KV cache management and LLM serving optimization.
KVCache.ai advances the state of the art in Large Language Model (LLM) inference optimization. In decoder-only Transformer models, data from diverse modalities can ultimately be transformed into KV cache, making it a central component of modern LLM serving systems. As a result, KV cache has become a key focus for improving inference efficiency through techniques such as caching, scheduling, compression, offloading, and disaggregated serving architectures.
Through open-source projects and academic research, KVCache.ai develops effective, practical, and high-performance solutions for KV cache management and LLM serving optimization. The goal is to make LLM deployment more accessible, efficient, and cost-effective for organizations of all sizes.
- Mooncake — A KV cache-centric disaggregated architecture for LLM serving.
- KTransformers — A CPU/GPU heterogeneous LLM inference and fine-tuning framework for running and tuning 100B+ models on accessible workstation hardware.
- TrEnv-X — An open-source runtime platform designed for AI Agent applications.
- KV Cache Size Calculator — Estimate KV cache capacity for common production LLM families, including DeepSeek, GLM, Kimi, Qwen, MiniMax, MiMo, and others.
- KV Cache Hit Rate Simulator — Calculate KV cache hit rate of preset or your own trace, under different memory budgets.
This site is built with:
- Hugo (v0.126.3) — static site generator
- Hugo Blox — theme and page builder (Tailwind CSS)
- Pagefind — static search (used in production builds on Netlify)
Content is written in Markdown with YAML front matter. Custom layouts and shortcodes live under layouts/.
- Website: https://kvcache.ai
- GitHub Organization: https://github.com/kvcache-ai
- X (Twitter): @KVCache_AI