Exploiting Attention Haze for Structure-Preserving KV-Cache Compression
ElasticKV is a training-free KV-cache sparsification method that removes Attention Haze — the high-density, low-magnitude activation regime that dominates LLM KV caches — via per-head adaptive thresholding with sink token protection. It operates as a post-attention CUDA hook with zero modifications to model weights or architecture.
Operationally, we use attention haze as shorthand for low-magnitude, low-selectivity KV mass outside the sink region. For each KV head, ElasticKV defines a local threshold
and suppresses the sub-threshold region while preserving sink tokens. The
updated paper does not treat haze as only a visual intuition: on real
Qwen2.5-0.5B K caches at \tau_h frame, replayed H2O and ScissorHands evictions remove 46.3% and
75.2% of available L2, respectively, showing the difference between
haze-selective sparsification and token eviction.
| Metric | Value |
|---|---|
| Compression | 1.48–1.57× |
| PPL degradation (GQA models) | < 0.31% |
| Needle-in-a-Haystack | 45/45 perfect |
| Decode overhead | 0.4% |
| EKV vs random ablation ratio | 3,400× |
This repository ships the ElasticKV hook, the paper source, the archived raw outputs behind the reported tables and figures, and the scripts used to generate those artifacts:
cuda_hook/— The complete CUDA hook (568 lines) that implements ElasticKV as a patch for llama.cpp.cuda_hook/experimental/— The patchedllama-context.cppsnapshot used for the supplementary local H2O and ScissorHands head-to-head runs.scripts/— All evaluation, analysis, and plotting scripts used to generate every table and figure in the paper.results/— Raw experimental data (JSON, logs) behind every claim.results/supplementary/local_baselines/— Same-run H2O, ScissorHands, and EKV local comparisons on GTX 1070, including JSON records, the real-KVτ_h-band analyses, and stderr logs.results/supplementary/throughput_gtx1070_llama_bench/— Local GTX 1070 llama-bench sweeps, raw outputs, and English summaries forevict_everysensitivity.paper/— LaTeX source, compiled PDF, and all figures.
The release is artifact-complete, not one-click end-to-end: rerunning the
full paper sweep still requires the matching GGUF models and a local
llama.cpp checkout at the tested commit.
The supplementary source snapshot under cuda_hook/experimental/ is a
redistributed derivative of llama-context.cpp; its provenance note and a
copy of the upstream MIT license are shipped alongside that file.
- Sparse-aware attention backends (FlashAttention/PagedAttention integration)
- Production inference server integration
- Quantization extensions
These are directions for future work discussed in the paper.
This repository uses attention haze as a mechanistic term, not as a purely illustrative label.
- High-density: many KV entries fall below the per-head threshold.
- Low-energy: that dense sub-threshold region contributes little L1/L2 mass relative to its element count.
- Low-selectivity: removing it sharpens attention slightly, while matched random zeroing destroys attention structure.
- Not a token-eviction synonym: the H2O/ScissorHands supplementary replay
shows that token eviction removes much more high-energy cache structure than
EKV under the same
|K|/\tau_hdiagnostic frame.
The paper therefore uses three evidence tiers together: an illustrative toy figure, controlled entropy/selectivity simulations, and real-KV diagnostics released in this repository.
Concurrent work (Mao et al., 2026) identifies Q/K concentration in pre-RoPE space, showing that attention follows structured distance preferences. This offers a mechanistic perspective on Attention Haze: positions outside these preferred regions accumulate diffuse, low-magnitude KV activations — precisely the regime targeted by ElasticKV's per-head adaptive threshold.
| Component | Version |
|---|---|
| NVIDIA GPU | Any CUDA-capable (tested: RTX 5090, GTX 1070) |
| CUDA Toolkit | ≥ 11.8 |
| llama.cpp | commit b5390 (March 2026) |
| Python | ≥ 3.10 (for experiment scripts only) |
| OS | Linux (tested on Ubuntu 22.04 WSL2) |
git clone https://github.com/infolake/elastickv.git
cd elastickv
pip install -r requirements.txt# Clone llama.cpp at the tested commit
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b5390
# Apply the ElasticKV patch in-place
bash ../cuda_hook/patch_llamacpp.sh "$(pwd)"
# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
cd ..# Enable ElasticKV with sweet-spot configuration
export ELASTICKV=1
export ELASTICKV_SCALE_K=0.45
export ELASTICKV_SCALE_V=0.50
export ELASTICKV_SINK=64
# Run perplexity evaluation
./llama.cpp/build/bin/llama-perplexity \
-m /path/to/model.gguf \
-f /path/to/wikitext-2-raw/wiki.test.raw \
-b 2048ElasticKV is controlled entirely via environment variables:
| Variable | Default | Description |
|---|---|---|
ELASTICKV |
0 |
Enable hook (set to 1) |
ELASTICKV_SCALE_K |
0.35 |
Key threshold scale |
ELASTICKV_SCALE_V |
0.17 |
Value threshold scale |
ELASTICKV_SINK |
64 |
Sink tokens to protect |
ELASTICKV_EVICT_EVERY |
64 |
Sparsification interval (tokens) |
ELASTICKV_WARM |
128 |
Warmup tokens before first sparsification |
ELASTICKV_HEAD_DIM |
auto |
KV head dimension (auto-detected) |
ELASTICKV_VERBOSE |
0 |
Print diagnostics to stderr |
Preset configurations:
| Name | SCALE_K |
SCALE_V |
Use case |
|---|---|---|---|
| Sweet spot | 0.45 | 0.50 | Minimal quality impact (< 0.31% PPL) |
| Aggressive | 0.50 | 0.50 | Maximum compression (1.57×) |
| K-only | 0.35 | 0.00 | Conservative, keys only |
This repository contains the released raw outputs behind every table and
figure in the paper, plus the scripts used to regenerate the individual
analyses. The full paper sweep is not a single turnkey command: most scripts
still require explicit model paths and a local llama.cpp checkout at commit
b5390. See results/MANIFEST.md for the complete
mapping (13 tables, 8 figures, 11 inline claims, plus supplementary local
baseline artifacts).
The supplementary H2O/ScissorHands head-to-head released with the paper is
archived under results/supplementary/local_baselines/. The local GTX 1070
llama-bench sweeps used to study frequency sensitivity are archived under
results/supplementary/throughput_gtx1070_llama_bench/.
Some archived JSON and log artifacts intentionally preserve the original
runner/model paths as sanitized placeholders such as <models>/... and
<llama.cpp>/..., rather than workstation-specific absolute paths.
That same local-baseline package now also includes the real-KV
attention_haze_tau_bands_qwen05b.json and
attention_h2o_scissor_evictions_qwen05b.json artifacts, with reproduction
scripts under scripts/analysis/. These two files are the cleanest empirical
anchor for the updated haze claim: the first shows EKV's sub-threshold region
is dense but low-energy, and the second shows that matched H2O/ScissorHands
evictions remove substantially more high-energy structure under the same
diagnostic frame.
| Paper artifact | Script | Output |
|---|---|---|
| Table 1 (Threshold ablation) | scripts/analysis/f4_threshold_ablation.py |
results/f4_thresh_ablation_results.json |
| Table 2 (PPL, 8 models) | scripts/eval/run_ppl.sh |
results/results_all.json |
| Table 3 (NIAH 45/45) | scripts/eval/f5_niah.py |
results/logs/final_tests_full.log |
| Table 4 (Long context) | scripts/eval/run_ppl.sh |
results/long_context_results.json |
| Table 5 (Sink ablation) | scripts/eval/run_ppl.sh |
results/logs/final_tests_full.log |
| Table 6 (Throughput) | scripts/eval/throughput_bench.sh |
results/throughput/bench_*.md |
| Table 7 (LongBench-v2) | scripts/eval/longbench_v2_eval.py |
results/longbench/ |
| Table 8 (Per-domain LB) | scripts/eval/longbench_v2_eval.py |
results/longbench/qwen25_seed*/ |
| Table 9 (PPL vs accuracy) | scripts/analysis/f1_ppl_analysis.py |
results/ |
| Table 10 (Entropy sim) | scripts/analysis/simulate_attention_haze.py |
results/attention_entropy_v2_results.json |
| Table 11 (EKV vs random) | scripts/analysis/f4_threshold_ablation.py |
results/ablation_random_vs_ekv_results.json |
| Table 12 (Paired chunks) | scripts/analysis/f1_deep_analysis.py |
results/ppl/ |
| Table 13 (SPR/ASI) | scripts/analysis/f2_spr_real_kv.py |
results/f2_spr_real_kv_results.json |
| Fig. 1 (Sparsity curve) | scripts/plot/gen_fig65_sparsity_curve.py |
paper/figures/fig65_sparsity_curve.pdf |
| Fig. 2 (Architecture) | scripts/plot/generate_paper_figures.py |
paper/figures/fig0_architecture.pdf |
| Fig. 3 (KV regression) | scripts/analysis/f1_ppl_analysis.py |
paper/figures/f1_kv_heads_vs_ppl.png |
| Fig. 4 (Three-seed) | scripts/plot/generate_paper_figures.py |
paper/figures/fig25_three_seed.pdf |
| Fig. 5 (KV head depend.) | scripts/plot/generate_paper_figures.py |
paper/figures/fig35_kv_head_dependence.pdf |
| Fig. 6 (Attention haze) | scripts/plot/generate_paper_figures.py |
paper/figures/fig15_attention_haze.pdf |
| Fig. 7 (Entropy vs spar.) | scripts/plot/generate_paper_figures.py |
paper/figures/fig45_entropy_vs_sparsity.pdf |
| Fig. 8 (Per-domain Δ) | scripts/plot/generate_paper_figures.py |
paper/figures/fig55_precision_recall.pdf |
| Compression proof | scripts/analysis/f7_compression_proof.py |
results/f7_compression_proof.json |
| SPR real KV | scripts/analysis/f2_spr_real_kv.py |
results/f2_spr_real_kv_results.json |
# Download WikiText-2
python scripts/data/download_wikitext.py data/wiki.test.raw
# Run baseline + sweet + aggressive for one model
bash scripts/eval/run_ppl.sh \
--llama-cpp-dir ./llama.cpp \
--model /path/to/model.gguf \
--corpus ./data/wiki.test.rawpython scripts/eval/f5_niah.py \
--llamacpp ./llama.cpp/build/bin/llama-cli \
--model /path/to/model.gguf \
--model-name "My Model" \
--ctx 4096 \
--template chatmlpython scripts/analysis/f4_threshold_ablation.py| Model | Parameters | KV Heads | GQA | head_dim | GPU |
|---|---|---|---|---|---|
| Qwen2.5-0.5B | 0.5B | 2 | 14:1 | 64 | GTX 1070 |
| Qwen2.5-1.5B | 1.5B | 2 | 6:1 | 128 | GTX 1070 |
| TinyLlama-1.1B | 1.1B | 4 | 8:1 | 64 | GTX 1070 |
| Phi-2 | 2.78B | 32 | 1:1 (MHA) | 80 | GTX 1070 |
| Qwen3-4B | 4B | 8 | 4:1 | 128 | RTX 5090 |
| Mistral-7B | 7.24B | 8 | 4:1 | 128 | RTX 5090 |
| Qwen2.5-7B | 7.62B | 4 | 7:1 | 128 | RTX 5090 |
| Llama-3-8B | 8B | 8 | 4:1 | 128 | RTX 5090 |
| Llama-3.1-8B | 8B | 8 | 4:1 | 128 | GTX 1070 |
Primary models evaluated as FP16/BF16; supplementary (†F1) and GTX 1070 models as Q3_K_M / Q4_K_M GGUF quantizations.
ElasticKV computes per-head adaptive thresholds based on magnitude statistics:
where
Key design choices:
- Per-head (not per-layer): prevents cross-head magnitude bias at long contexts
- Sink protection: preserves attention anchors (first 64 tokens by default)
- K/V asymmetry: separate scales exploit that keys tolerate ~2× more aggressive sparsification than values
For full details, see the paper.
This release targets artifact-backed reproducibility of the paper. The
raw outputs behind the claims are bundled here, while full reruns still depend
on matching model files and a compatible llama.cpp checkout. The following
constraints apply:
- Hardware: Results were obtained on RTX 5090 (32 GB) and GTX 1070 (8 GB). Different GPUs may produce numerically identical PPL values (the hook is deterministic), but latency measurements will differ.
- llama.cpp version: Commit
b5390(March 2026). Other versions may work but are not tested. - GGUF models: Use the model families and quantizations listed in this README together with the archived result files. A Zenodo archival release will add a consolidated provenance table for the exact model artifacts.
- Seeds: LongBench-v2 experiments use seeds 42, 123, 456. PPL evaluation is deterministic (no seed dependence).
Zenodo concept DOI for all versions: 10.5281/zenodo.19503351
Machine-readable archival metadata is included in CITATION.cff and .zenodo.json.
@misc{camargo2026elastickv,
title = {{ElasticKV}: Exploiting Attention Haze for Structure-Preserving {KV}-Cache Compression},
author = {Camargo, Guilherme de},
year = {2026},
howpublished = {Zenodo software release},
note = {Concept DOI for the ElasticKV software record},
version = {v0.1.1},
doi = {10.5281/zenodo.19503351},
url = {https://doi.org/10.5281/zenodo.19503351}
}If you find this work useful, consider supporting:
For questions about the project: camargo@phiq.io
Most of this repository is licensed under the Apache License 2.0 — see
LICENSE. The redistributed llama.cpp-derived snapshot under
cuda_hook/experimental/ also carries the upstream MIT license copy in
cuda_hook/experimental/LICENSE.llama.cpp.


