ElasticKV

Exploiting Attention Haze for Structure-Preserving KV-Cache Compression

ElasticKV is a training-free KV-cache sparsification method that removes Attention Haze — the high-density, low-magnitude activation regime that dominates LLM KV caches — via per-head adaptive thresholding with sink token protection. It operates as a post-attention CUDA hook with zero modifications to model weights or architecture.

Operationally, we use attention haze as shorthand for low-magnitude, low-selectivity KV mass outside the sink region. For each KV head, ElasticKV defines a local threshold

$$\tau_h = \max!\left(\frac{\sigma_h}{2},; \mu_h\right) \times s$$

and suppresses the sub-threshold region while preserving sink tokens. The updated paper does not treat haze as only a visual intuition: on real Qwen2.5-0.5B K caches at $s=0.45$, the removed region contains 36.6% of non-sink elements but only 0.53% of L2 energy, while the $> 2\tau_h$ tail contains 39.6% of elements and 98.0% of L2 energy. Under the same \tau_h frame, replayed H2O and ScissorHands evictions remove 46.3% and 75.2% of available L2, respectively, showing the difference between haze-selective sparsification and token eviction.

Metric	Value
Compression	1.48–1.57×
PPL degradation (GQA models)	< 0.31%
Needle-in-a-Haystack	45/45 perfect
Decode overhead	0.4%
EKV vs random ablation ratio	3,400×

What Is Released

This repository ships the ElasticKV hook, the paper source, the archived raw outputs behind the reported tables and figures, and the scripts used to generate those artifacts:

cuda_hook/ — The complete CUDA hook (568 lines) that implements ElasticKV as a patch for llama.cpp.
cuda_hook/experimental/ — The patched llama-context.cpp snapshot used for the supplementary local H2O and ScissorHands head-to-head runs.
scripts/ — All evaluation, analysis, and plotting scripts used to generate every table and figure in the paper.
results/ — Raw experimental data (JSON, logs) behind every claim.
results/supplementary/local_baselines/ — Same-run H2O, ScissorHands, and EKV local comparisons on GTX 1070, including JSON records, the real-KV τ_h-band analyses, and stderr logs.
results/supplementary/throughput_gtx1070_llama_bench/ — Local GTX 1070 llama-bench sweeps, raw outputs, and English summaries for evict_every sensitivity.
paper/ — LaTeX source, compiled PDF, and all figures.

The release is artifact-complete, not one-click end-to-end: rerunning the full paper sweep still requires the matching GGUF models and a local llama.cpp checkout at the tested commit. The supplementary source snapshot under cuda_hook/experimental/ is a redistributed derivative of llama-context.cpp; its provenance note and a copy of the upstream MIT license are shipped alongside that file.

What Is Not Included

Sparse-aware attention backends (FlashAttention/PagedAttention integration)
Production inference server integration
Quantization extensions

These are directions for future work discussed in the paper.

What We Mean By Attention Haze

This repository uses attention haze as a mechanistic term, not as a purely illustrative label.

High-density: many KV entries fall below the per-head threshold.
Low-energy: that dense sub-threshold region contributes little L1/L2 mass relative to its element count.
Low-selectivity: removing it sharpens attention slightly, while matched random zeroing destroys attention structure.
Not a token-eviction synonym: the H2O/ScissorHands supplementary replay shows that token eviction removes much more high-energy cache structure than EKV under the same |K|/\tau_h diagnostic frame.

The paper therefore uses three evidence tiers together: an illustrative toy figure, controlled entropy/selectivity simulations, and real-KV diagnostics released in this repository.

Concurrent work (Mao et al., 2026) identifies Q/K concentration in pre-RoPE space, showing that attention follows structured distance preferences. This offers a mechanistic perspective on Attention Haze: positions outside these preferred regions accumulate diffuse, low-magnitude KV activations — precisely the regime targeted by ElasticKV's per-head adaptive threshold.

Quick Start

Requirements

Component	Version
NVIDIA GPU	Any CUDA-capable (tested: RTX 5090, GTX 1070)
CUDA Toolkit	≥ 11.8
llama.cpp	commit `b5390` (March 2026)
Python	≥ 3.10 (for experiment scripts only)
OS	Linux (tested on Ubuntu 22.04 WSL2)

1. Clone and install Python dependencies

git clone https://github.com/infolake/elastickv.git
cd elastickv
pip install -r requirements.txt

2. Build llama.cpp with ElasticKV

# Clone llama.cpp at the tested commit
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b5390

# Apply the ElasticKV patch in-place
bash ../cuda_hook/patch_llamacpp.sh "$(pwd)"

# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
cd ..

3. Run ElasticKV

# Enable ElasticKV with sweet-spot configuration
export ELASTICKV=1
export ELASTICKV_SCALE_K=0.45
export ELASTICKV_SCALE_V=0.50
export ELASTICKV_SINK=64

# Run perplexity evaluation
./llama.cpp/build/bin/llama-perplexity \
    -m /path/to/model.gguf \
    -f /path/to/wikitext-2-raw/wiki.test.raw \
    -b 2048

Configuration

ElasticKV is controlled entirely via environment variables:

Variable	Default	Description
`ELASTICKV`	`0`	Enable hook (set to `1`)
`ELASTICKV_SCALE_K`	`0.35`	Key threshold scale
`ELASTICKV_SCALE_V`	`0.17`	Value threshold scale
`ELASTICKV_SINK`	`64`	Sink tokens to protect
`ELASTICKV_EVICT_EVERY`	`64`	Sparsification interval (tokens)
`ELASTICKV_WARM`	`128`	Warmup tokens before first sparsification
`ELASTICKV_HEAD_DIM`	`auto`	KV head dimension (auto-detected)
`ELASTICKV_VERBOSE`	`0`	Print diagnostics to stderr

Preset configurations:

Name	`SCALE_K`	`SCALE_V`	Use case
Sweet spot	0.45	0.50	Minimal quality impact (< 0.31% PPL)
Aggressive	0.50	0.50	Maximum compression (1.57×)
K-only	0.35	0.00	Conservative, keys only

Reproducing Paper Results

This repository contains the released raw outputs behind every table and figure in the paper, plus the scripts used to regenerate the individual analyses. The full paper sweep is not a single turnkey command: most scripts still require explicit model paths and a local llama.cpp checkout at commit b5390. See results/MANIFEST.md for the complete mapping (13 tables, 8 figures, 11 inline claims, plus supplementary local baseline artifacts).

The supplementary H2O/ScissorHands head-to-head released with the paper is archived under results/supplementary/local_baselines/. The local GTX 1070 llama-bench sweeps used to study frequency sensitivity are archived under results/supplementary/throughput_gtx1070_llama_bench/. Some archived JSON and log artifacts intentionally preserve the original runner/model paths as sanitized placeholders such as <models>/... and <llama.cpp>/..., rather than workstation-specific absolute paths. That same local-baseline package now also includes the real-KV attention_haze_tau_bands_qwen05b.json and attention_h2o_scissor_evictions_qwen05b.json artifacts, with reproduction scripts under scripts/analysis/. These two files are the cleanest empirical anchor for the updated haze claim: the first shows EKV's sub-threshold region is dense but low-energy, and the second shows that matched H2O/ScissorHands evictions remove substantially more high-energy structure under the same diagnostic frame.

Key experiments

Paper artifact	Script	Output
Table 1 (Threshold ablation)	`scripts/analysis/f4_threshold_ablation.py`	`results/f4_thresh_ablation_results.json`
Table 2 (PPL, 8 models)	`scripts/eval/run_ppl.sh`	`results/results_all.json`
Table 3 (NIAH 45/45)	`scripts/eval/f5_niah.py`	`results/logs/final_tests_full.log`
Table 4 (Long context)	`scripts/eval/run_ppl.sh`	`results/long_context_results.json`
Table 5 (Sink ablation)	`scripts/eval/run_ppl.sh`	`results/logs/final_tests_full.log`
Table 6 (Throughput)	`scripts/eval/throughput_bench.sh`	`results/throughput/bench_*.md`
Table 7 (LongBench-v2)	`scripts/eval/longbench_v2_eval.py`	`results/longbench/`
Table 8 (Per-domain LB)	`scripts/eval/longbench_v2_eval.py`	`results/longbench/qwen25_seed*/`
Table 9 (PPL vs accuracy)	`scripts/analysis/f1_ppl_analysis.py`	`results/`
Table 10 (Entropy sim)	`scripts/analysis/simulate_attention_haze.py`	`results/attention_entropy_v2_results.json`
Table 11 (EKV vs random)	`scripts/analysis/f4_threshold_ablation.py`	`results/ablation_random_vs_ekv_results.json`
Table 12 (Paired chunks)	`scripts/analysis/f1_deep_analysis.py`	`results/ppl/`
Table 13 (SPR/ASI)	`scripts/analysis/f2_spr_real_kv.py`	`results/f2_spr_real_kv_results.json`
Fig. 1 (Sparsity curve)	`scripts/plot/gen_fig65_sparsity_curve.py`	`paper/figures/fig65_sparsity_curve.pdf`
Fig. 2 (Architecture)	`scripts/plot/generate_paper_figures.py`	`paper/figures/fig0_architecture.pdf`
Fig. 3 (KV regression)	`scripts/analysis/f1_ppl_analysis.py`	`paper/figures/f1_kv_heads_vs_ppl.png`
Fig. 4 (Three-seed)	`scripts/plot/generate_paper_figures.py`	`paper/figures/fig25_three_seed.pdf`
Fig. 5 (KV head depend.)	`scripts/plot/generate_paper_figures.py`	`paper/figures/fig35_kv_head_dependence.pdf`
Fig. 6 (Attention haze)	`scripts/plot/generate_paper_figures.py`	`paper/figures/fig15_attention_haze.pdf`
Fig. 7 (Entropy vs spar.)	`scripts/plot/generate_paper_figures.py`	`paper/figures/fig45_entropy_vs_sparsity.pdf`
Fig. 8 (Per-domain Δ)	`scripts/plot/generate_paper_figures.py`	`paper/figures/fig55_precision_recall.pdf`
Compression proof	`scripts/analysis/f7_compression_proof.py`	`results/f7_compression_proof.json`
SPR real KV	`scripts/analysis/f2_spr_real_kv.py`	`results/f2_spr_real_kv_results.json`

Example: reproduce PPL results

# Download WikiText-2
python scripts/data/download_wikitext.py data/wiki.test.raw

# Run baseline + sweet + aggressive for one model
bash scripts/eval/run_ppl.sh \
  --llama-cpp-dir ./llama.cpp \
  --model /path/to/model.gguf \
  --corpus ./data/wiki.test.raw

Example: reproduce NIAH

python scripts/eval/f5_niah.py \
  --llamacpp ./llama.cpp/build/bin/llama-cli \
    --model /path/to/model.gguf \
  --model-name "My Model" \
  --ctx 4096 \
  --template chatml

Example: reproduce ablation (EKV vs random zeroing)

python scripts/analysis/f4_threshold_ablation.py

Models Tested

Model	Parameters	KV Heads	GQA	head_dim	GPU
Qwen2.5-0.5B	0.5B	2	14:1	64	GTX 1070
Qwen2.5-1.5B	1.5B	2	6:1	128	GTX 1070
TinyLlama-1.1B	1.1B	4	8:1	64	GTX 1070
Phi-2	2.78B	32	1:1 (MHA)	80	GTX 1070
Qwen3-4B	4B	8	4:1	128	RTX 5090
Mistral-7B	7.24B	8	4:1	128	RTX 5090
Qwen2.5-7B	7.62B	4	7:1	128	RTX 5090
Llama-3-8B	8B	8	4:1	128	RTX 5090
Llama-3.1-8B	8B	8	4:1	128	GTX 1070

Primary models evaluated as FP16/BF16; supplementary (†F1) and GTX 1070 models as Q3_K_M / Q4_K_M GGUF quantizations.

Method Overview

ElasticKV computes per-head adaptive thresholds based on magnitude statistics:

$$\tau_h = \max!\left(\frac{\sigma_h}{2},; \mu_h\right) \times s$$

where $\mu_h$ and $\sigma_h$ are the mean and standard deviation of absolute values for KV head $h$ (excluding sink positions), and $s$ is the user-specified scale factor. Elements below $\tau_h$ are zeroed in-place. The first $n_\text{sink}$ positions are never modified.

Key design choices:

Per-head (not per-layer): prevents cross-head magnitude bias at long contexts
Sink protection: preserves attention anchors (first 64 tokens by default)
K/V asymmetry: separate scales exploit that keys tolerate ~2× more aggressive sparsification than values

For full details, see the paper.

Reproducibility Scope

This release targets artifact-backed reproducibility of the paper. The raw outputs behind the claims are bundled here, while full reruns still depend on matching model files and a compatible llama.cpp checkout. The following constraints apply:

Hardware: Results were obtained on RTX 5090 (32 GB) and GTX 1070 (8 GB). Different GPUs may produce numerically identical PPL values (the hook is deterministic), but latency measurements will differ.
llama.cpp version: Commit b5390 (March 2026). Other versions may work but are not tested.
GGUF models: Use the model families and quantizations listed in this README together with the archived result files. A Zenodo archival release will add a consolidated provenance table for the exact model artifacts.
Seeds: LongBench-v2 experiments use seeds 42, 123, 456. PPL evaluation is deterministic (no seed dependence).

Citation

Zenodo concept DOI for all versions: 10.5281/zenodo.19503351

Machine-readable archival metadata is included in CITATION.cff and .zenodo.json.

@misc{camargo2026elastickv,
  title        = {{ElasticKV}: Exploiting Attention Haze for Structure-Preserving {KV}-Cache Compression},
  author       = {Camargo, Guilherme de},
  year         = {2026},
  howpublished = {Zenodo software release},
  note         = {Concept DOI for the ElasticKV software record},
  version      = {v0.1.1},
  doi          = {10.5281/zenodo.19503351},
  url          = {https://doi.org/10.5281/zenodo.19503351}
}

Support

If you find this work useful, consider supporting:

👉 GitHub Sponsors

For questions about the project: camargo@phiq.io

License

Most of this repository is licensed under the Apache License 2.0 — see LICENSE. The redistributed llama.cpp-derived snapshot under cuda_hook/experimental/ also carries the upstream MIT license copy in cuda_hook/experimental/LICENSE.llama.cpp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ElasticKV

What Is Released

What Is Not Included

What We Mean By Attention Haze

Quick Start

Requirements

1. Clone and install Python dependencies

2. Build llama.cpp with ElasticKV

3. Run ElasticKV

Configuration

Reproducing Paper Results

Key experiments

Example: reproduce PPL results

Example: reproduce NIAH

Example: reproduce ablation (EKV vs random zeroing)

Models Tested

Method Overview

Reproducibility Scope

Citation

Support

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
cuda_hook		cuda_hook
docs		docs
paper		paper
results		results
scripts		scripts
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ElasticKV

What Is Released

What Is Not Included

What We Mean By Attention Haze

Quick Start

Requirements

1. Clone and install Python dependencies

2. Build llama.cpp with ElasticKV

3. Run ElasticKV

Configuration

Reproducing Paper Results

Key experiments

Example: reproduce PPL results

Example: reproduce NIAH

Example: reproduce ablation (EKV vs random zeroing)

Models Tested

Method Overview

Reproducibility Scope

Citation

Support

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages