Skip to content

infolake/elastickv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ElasticKV

Exploiting Attention Haze for Structure-Preserving KV-Cache Compression

License DOI

ElasticKV is a training-free KV-cache sparsification method that removes Attention Haze — the high-density, low-magnitude activation regime that dominates LLM KV caches — via per-head adaptive thresholding with sink token protection. It operates as a post-attention CUDA hook with zero modifications to model weights or architecture.

ElasticKV Pipeline: Per-Head Adaptive Sparsification

Operationally, we use attention haze as shorthand for low-magnitude, low-selectivity KV mass outside the sink region. For each KV head, ElasticKV defines a local threshold

$$\tau_h = \max!\left(\frac{\sigma_h}{2},; \mu_h\right) \times s$$

and suppresses the sub-threshold region while preserving sink tokens. The updated paper does not treat haze as only a visual intuition: on real Qwen2.5-0.5B K caches at $s=0.45$, the removed region contains 36.6% of non-sink elements but only 0.53% of L2 energy, while the $> 2\tau_h$ tail contains 39.6% of elements and 98.0% of L2 energy. Under the same \tau_h frame, replayed H2O and ScissorHands evictions remove 46.3% and 75.2% of available L2, respectively, showing the difference between haze-selective sparsification and token eviction.

Metric Value
Compression 1.48–1.57×
PPL degradation (GQA models) < 0.31%
Needle-in-a-Haystack 45/45 perfect
Decode overhead 0.4%
EKV vs random ablation ratio 3,400×

What Is Released

This repository ships the ElasticKV hook, the paper source, the archived raw outputs behind the reported tables and figures, and the scripts used to generate those artifacts:

  • cuda_hook/ — The complete CUDA hook (568 lines) that implements ElasticKV as a patch for llama.cpp.
  • cuda_hook/experimental/ — The patched llama-context.cpp snapshot used for the supplementary local H2O and ScissorHands head-to-head runs.
  • scripts/ — All evaluation, analysis, and plotting scripts used to generate every table and figure in the paper.
  • results/ — Raw experimental data (JSON, logs) behind every claim.
  • results/supplementary/local_baselines/ — Same-run H2O, ScissorHands, and EKV local comparisons on GTX 1070, including JSON records, the real-KV τ_h-band analyses, and stderr logs.
  • results/supplementary/throughput_gtx1070_llama_bench/ — Local GTX 1070 llama-bench sweeps, raw outputs, and English summaries for evict_every sensitivity.
  • paper/ — LaTeX source, compiled PDF, and all figures.

The release is artifact-complete, not one-click end-to-end: rerunning the full paper sweep still requires the matching GGUF models and a local llama.cpp checkout at the tested commit. The supplementary source snapshot under cuda_hook/experimental/ is a redistributed derivative of llama-context.cpp; its provenance note and a copy of the upstream MIT license are shipped alongside that file.

What Is Not Included

  • Sparse-aware attention backends (FlashAttention/PagedAttention integration)
  • Production inference server integration
  • Quantization extensions

These are directions for future work discussed in the paper.

What We Mean By Attention Haze

This repository uses attention haze as a mechanistic term, not as a purely illustrative label.

  • High-density: many KV entries fall below the per-head threshold.
  • Low-energy: that dense sub-threshold region contributes little L1/L2 mass relative to its element count.
  • Low-selectivity: removing it sharpens attention slightly, while matched random zeroing destroys attention structure.
  • Not a token-eviction synonym: the H2O/ScissorHands supplementary replay shows that token eviction removes much more high-energy cache structure than EKV under the same |K|/\tau_h diagnostic frame.

The paper therefore uses three evidence tiers together: an illustrative toy figure, controlled entropy/selectivity simulations, and real-KV diagnostics released in this repository.

Attention Haze: Diffuse Low-Magnitude Activations Removed by ElasticKV

Concurrent work (Mao et al., 2026) identifies Q/K concentration in pre-RoPE space, showing that attention follows structured distance preferences. This offers a mechanistic perspective on Attention Haze: positions outside these preferred regions accumulate diffuse, low-magnitude KV activations — precisely the regime targeted by ElasticKV's per-head adaptive threshold.


Quick Start

Requirements

Component Version
NVIDIA GPU Any CUDA-capable (tested: RTX 5090, GTX 1070)
CUDA Toolkit ≥ 11.8
llama.cpp commit b5390 (March 2026)
Python ≥ 3.10 (for experiment scripts only)
OS Linux (tested on Ubuntu 22.04 WSL2)

1. Clone and install Python dependencies

git clone https://github.com/infolake/elastickv.git
cd elastickv
pip install -r requirements.txt

2. Build llama.cpp with ElasticKV

# Clone llama.cpp at the tested commit
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout b5390

# Apply the ElasticKV patch in-place
bash ../cuda_hook/patch_llamacpp.sh "$(pwd)"

# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
cd ..

3. Run ElasticKV

# Enable ElasticKV with sweet-spot configuration
export ELASTICKV=1
export ELASTICKV_SCALE_K=0.45
export ELASTICKV_SCALE_V=0.50
export ELASTICKV_SINK=64

# Run perplexity evaluation
./llama.cpp/build/bin/llama-perplexity \
    -m /path/to/model.gguf \
    -f /path/to/wikitext-2-raw/wiki.test.raw \
    -b 2048

Configuration

ElasticKV is controlled entirely via environment variables:

Variable Default Description
ELASTICKV 0 Enable hook (set to 1)
ELASTICKV_SCALE_K 0.35 Key threshold scale
ELASTICKV_SCALE_V 0.17 Value threshold scale
ELASTICKV_SINK 64 Sink tokens to protect
ELASTICKV_EVICT_EVERY 64 Sparsification interval (tokens)
ELASTICKV_WARM 128 Warmup tokens before first sparsification
ELASTICKV_HEAD_DIM auto KV head dimension (auto-detected)
ELASTICKV_VERBOSE 0 Print diagnostics to stderr

Preset configurations:

Name SCALE_K SCALE_V Use case
Sweet spot 0.45 0.50 Minimal quality impact (< 0.31% PPL)
Aggressive 0.50 0.50 Maximum compression (1.57×)
K-only 0.35 0.00 Conservative, keys only

Reproducing Paper Results

This repository contains the released raw outputs behind every table and figure in the paper, plus the scripts used to regenerate the individual analyses. The full paper sweep is not a single turnkey command: most scripts still require explicit model paths and a local llama.cpp checkout at commit b5390. See results/MANIFEST.md for the complete mapping (13 tables, 8 figures, 11 inline claims, plus supplementary local baseline artifacts).

The supplementary H2O/ScissorHands head-to-head released with the paper is archived under results/supplementary/local_baselines/. The local GTX 1070 llama-bench sweeps used to study frequency sensitivity are archived under results/supplementary/throughput_gtx1070_llama_bench/. Some archived JSON and log artifacts intentionally preserve the original runner/model paths as sanitized placeholders such as <models>/... and <llama.cpp>/..., rather than workstation-specific absolute paths. That same local-baseline package now also includes the real-KV attention_haze_tau_bands_qwen05b.json and attention_h2o_scissor_evictions_qwen05b.json artifacts, with reproduction scripts under scripts/analysis/. These two files are the cleanest empirical anchor for the updated haze claim: the first shows EKV's sub-threshold region is dense but low-energy, and the second shows that matched H2O/ScissorHands evictions remove substantially more high-energy structure under the same diagnostic frame.

Key experiments

Paper artifact Script Output
Table 1 (Threshold ablation) scripts/analysis/f4_threshold_ablation.py results/f4_thresh_ablation_results.json
Table 2 (PPL, 8 models) scripts/eval/run_ppl.sh results/results_all.json
Table 3 (NIAH 45/45) scripts/eval/f5_niah.py results/logs/final_tests_full.log
Table 4 (Long context) scripts/eval/run_ppl.sh results/long_context_results.json
Table 5 (Sink ablation) scripts/eval/run_ppl.sh results/logs/final_tests_full.log
Table 6 (Throughput) scripts/eval/throughput_bench.sh results/throughput/bench_*.md
Table 7 (LongBench-v2) scripts/eval/longbench_v2_eval.py results/longbench/
Table 8 (Per-domain LB) scripts/eval/longbench_v2_eval.py results/longbench/qwen25_seed*/
Table 9 (PPL vs accuracy) scripts/analysis/f1_ppl_analysis.py results/
Table 10 (Entropy sim) scripts/analysis/simulate_attention_haze.py results/attention_entropy_v2_results.json
Table 11 (EKV vs random) scripts/analysis/f4_threshold_ablation.py results/ablation_random_vs_ekv_results.json
Table 12 (Paired chunks) scripts/analysis/f1_deep_analysis.py results/ppl/
Table 13 (SPR/ASI) scripts/analysis/f2_spr_real_kv.py results/f2_spr_real_kv_results.json
Fig. 1 (Sparsity curve) scripts/plot/gen_fig65_sparsity_curve.py paper/figures/fig65_sparsity_curve.pdf
Fig. 2 (Architecture) scripts/plot/generate_paper_figures.py paper/figures/fig0_architecture.pdf
Fig. 3 (KV regression) scripts/analysis/f1_ppl_analysis.py paper/figures/f1_kv_heads_vs_ppl.png
Fig. 4 (Three-seed) scripts/plot/generate_paper_figures.py paper/figures/fig25_three_seed.pdf
Fig. 5 (KV head depend.) scripts/plot/generate_paper_figures.py paper/figures/fig35_kv_head_dependence.pdf
Fig. 6 (Attention haze) scripts/plot/generate_paper_figures.py paper/figures/fig15_attention_haze.pdf
Fig. 7 (Entropy vs spar.) scripts/plot/generate_paper_figures.py paper/figures/fig45_entropy_vs_sparsity.pdf
Fig. 8 (Per-domain Δ) scripts/plot/generate_paper_figures.py paper/figures/fig55_precision_recall.pdf
Compression proof scripts/analysis/f7_compression_proof.py results/f7_compression_proof.json
SPR real KV scripts/analysis/f2_spr_real_kv.py results/f2_spr_real_kv_results.json

Example: reproduce PPL results

# Download WikiText-2
python scripts/data/download_wikitext.py data/wiki.test.raw

# Run baseline + sweet + aggressive for one model
bash scripts/eval/run_ppl.sh \
  --llama-cpp-dir ./llama.cpp \
  --model /path/to/model.gguf \
  --corpus ./data/wiki.test.raw

Example: reproduce NIAH

python scripts/eval/f5_niah.py \
  --llamacpp ./llama.cpp/build/bin/llama-cli \
    --model /path/to/model.gguf \
  --model-name "My Model" \
  --ctx 4096 \
  --template chatml

Example: reproduce ablation (EKV vs random zeroing)

python scripts/analysis/f4_threshold_ablation.py

Models Tested

Model Parameters KV Heads GQA head_dim GPU
Qwen2.5-0.5B 0.5B 2 14:1 64 GTX 1070
Qwen2.5-1.5B 1.5B 2 6:1 128 GTX 1070
TinyLlama-1.1B 1.1B 4 8:1 64 GTX 1070
Phi-2 2.78B 32 1:1 (MHA) 80 GTX 1070
Qwen3-4B 4B 8 4:1 128 RTX 5090
Mistral-7B 7.24B 8 4:1 128 RTX 5090
Qwen2.5-7B 7.62B 4 7:1 128 RTX 5090
Llama-3-8B 8B 8 4:1 128 RTX 5090
Llama-3.1-8B 8B 8 4:1 128 GTX 1070

Primary models evaluated as FP16/BF16; supplementary (†F1) and GTX 1070 models as Q3_K_M / Q4_K_M GGUF quantizations.


Method Overview

ElasticKV computes per-head adaptive thresholds based on magnitude statistics:

$$\tau_h = \max!\left(\frac{\sigma_h}{2},; \mu_h\right) \times s$$

where $\mu_h$ and $\sigma_h$ are the mean and standard deviation of absolute values for KV head $h$ (excluding sink positions), and $s$ is the user-specified scale factor. Elements below $\tau_h$ are zeroed in-place. The first $n_\text{sink}$ positions are never modified.

Key design choices:

  • Per-head (not per-layer): prevents cross-head magnitude bias at long contexts
  • Sink protection: preserves attention anchors (first 64 tokens by default)
  • K/V asymmetry: separate scales exploit that keys tolerate ~2× more aggressive sparsification than values

For full details, see the paper.

EKV Sparsity–Quality Trade-off


Reproducibility Scope

This release targets artifact-backed reproducibility of the paper. The raw outputs behind the claims are bundled here, while full reruns still depend on matching model files and a compatible llama.cpp checkout. The following constraints apply:

  • Hardware: Results were obtained on RTX 5090 (32 GB) and GTX 1070 (8 GB). Different GPUs may produce numerically identical PPL values (the hook is deterministic), but latency measurements will differ.
  • llama.cpp version: Commit b5390 (March 2026). Other versions may work but are not tested.
  • GGUF models: Use the model families and quantizations listed in this README together with the archived result files. A Zenodo archival release will add a consolidated provenance table for the exact model artifacts.
  • Seeds: LongBench-v2 experiments use seeds 42, 123, 456. PPL evaluation is deterministic (no seed dependence).

Citation

Zenodo concept DOI for all versions: 10.5281/zenodo.19503351

Machine-readable archival metadata is included in CITATION.cff and .zenodo.json.

@misc{camargo2026elastickv,
  title        = {{ElasticKV}: Exploiting Attention Haze for Structure-Preserving {KV}-Cache Compression},
  author       = {Camargo, Guilherme de},
  year         = {2026},
  howpublished = {Zenodo software release},
  note         = {Concept DOI for the ElasticKV software record},
  version      = {v0.1.1},
  doi          = {10.5281/zenodo.19503351},
  url          = {https://doi.org/10.5281/zenodo.19503351}
}

Support

If you find this work useful, consider supporting:

👉 GitHub Sponsors

For questions about the project: camargo@phiq.io


License

Most of this repository is licensed under the Apache License 2.0 — see LICENSE. The redistributed llama.cpp-derived snapshot under cuda_hook/experimental/ also carries the upstream MIT license copy in cuda_hook/experimental/LICENSE.llama.cpp.

About

ElasticKV: structure-preserving KV-cache sparsification for llama.cpp with paper artifacts, raw results, and reproducible analysis scripts.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors