Skip to content

Latest commit

 

History

History
117 lines (99 loc) · 9.31 KB

File metadata and controls

117 lines (99 loc) · 9.31 KB

Results Manifest

Maps the main paper tables, figures, and inline claims to the scripts that produced them and the raw data files that back them. The raw artifacts listed here are bundled with the repository; some scripts are turnkey, while others still require explicit GGUF model paths, cached Hugging Face weights, or a local llama.cpp checkout at the tested commit. Supplementary local artifacts released with the repository are listed separately below. Supplementary JSON/log files use sanitized placeholder runner/model paths where provenance fields were preserved from the original capture environment.

LaTeX table/figure numbers follow source order (all floats use [ht] or [t]).


Tables (13)

# Label Description Script Data
1 thresh_ablation Threshold formula ablation (Qwen2.5-0.5B real K) scripts/analysis/f4_threshold_ablation.py results/f4_thresh_ablation_results.json
2 ppl_crossmodel PPL across 7+2 models, 3 configs scripts/eval/run_ppl.sh results/results_all.json, results/ppl/*/
3 niah NIAH retrieval 45/45 (3 models) scripts/eval/f5_niah.py results/logs/final_tests_full.log
4 longctx Long context PPL (4K/8K/16K) scripts/eval/run_ppl.sh results/long_context_results.json
5 sink Sink token ablation (Mistral-7B) scripts/eval/run_ppl.sh results/logs/final_tests_full.log
6 throughput RTX 5090 throughput (llama-bench) scripts/eval/throughput_bench.sh results/throughput/bench_*.md
7 longbench_overall LongBench-v2 overall accuracy (503 ex, 3 seeds) scripts/eval/longbench_v2_eval.py results/longbench/*/results_all.json
8 longbench_domain Qwen2.5-7B per-domain LongBench-v2 scripts/eval/longbench_v2_eval.py results/longbench/qwen25_seed*/
9 divergence PPL degradation vs task accuracy scripts/analysis/f1_ppl_analysis.py results/results_sweet.json, results/longbench/
10 entropy Attention entropy under simulated EKV scripts/analysis/simulate_attention_haze.py results/attention_entropy_v2_results.json
11 ablation_random EKV vs random zeroing (matched sparsity) scripts/analysis/f4_threshold_ablation.py results/ablation_random_vs_ekv_results.json
12 paired_chunks Paired per-chunk analysis (†F1 models) scripts/analysis/f1_deep_analysis.py results/ppl/*/, results/logs/final_tests_full.log
13 spr_asi SPR and ASI across architectures scripts/analysis/f2_spr_real_kv.py results/f2_spr_real_kv_results.json, paper Table 13

Figures (8)

# Label Description Script File
1 sparsity_curve Sparsity–quality trade-off (Llama-3.1-8B) scripts/plot/gen_fig65_sparsity_curve.py paper/figures/fig65_sparsity_curve.pdf
2 architecture ElasticKV pipeline diagram scripts/plot/generate_paper_figures.py paper/figures/fig0_architecture.pdf
3 f1_kv_regression KV heads vs PPL regression (7 models) scripts/analysis/f1_ppl_analysis.py paper/figures/f1_kv_heads_vs_ppl.png
4 three_seed Three-seed LongBench reproducibility scripts/plot/generate_paper_figures.py paper/figures/fig25_three_seed.pdf
5 kv_head_dependence KV head count moderation effect scripts/plot/generate_paper_figures.py paper/figures/fig35_kv_head_dependence.pdf
6 attention_haze Attention haze visualization (64-token) scripts/plot/generate_paper_figures.py paper/figures/fig15_attention_haze.pdf
7 entropy_sparsity Entropy change vs sparsity (EKV + random) scripts/plot/generate_paper_figures.py paper/figures/fig45_entropy_vs_sparsity.pdf
8 precision_recall Per-domain accuracy change (Qwen2.5-7B) scripts/plot/generate_paper_figures.py paper/figures/fig55_precision_recall.pdf

Inline Claims

Claim Location Script Data
Compression 1.48× (Mistral sweet) Abstract, §4.1 scripts/eval/run_ppl.sh results/results_sweet.json
<0.31% PPL degradation (GQA) Abstract, §4.1 scripts/eval/run_ppl.sh results/results_all.json
3,400× ablation divergence Abstract, §5.3 scripts/analysis/f4_threshold_ablation.py results/ablation_random_vs_ekv_results.json
SPR_cond = 54.6% (real KV) §5.2 scripts/analysis/f2_spr_real_kv.py results/f2_spr_real_kv_results.json
EKV removes 36.6% of elements but only 0.53% of L2 at s=0.45 §4.3, §5.2 scripts/analysis/f4b_tau_band_analysis.py results/supplementary/local_baselines/attention_haze_tau_bands_qwen05b.json
H2O/ScissorHands remove 46.3% / 75.2% of L2 under the same τ_h frame §4.3, §5.2 scripts/analysis/f4c_h2o_scissor_eviction_analysis.py results/supplementary/local_baselines/attention_h2o_scissor_evictions_qwen05b.json
zlib 1.52× / lz4 1.23× §7 scripts/analysis/f7_compression_proof.py results/f7_compression_proof.json
+0.8pp Qwen2.5-7B regularization §4.4, §5.3 scripts/eval/longbench_v2_eval.py results/longbench/qwen25_seed*/
ΔPP regression R²=0.53 §4.1, §5.1 scripts/analysis/f1_ppl_analysis.py results/results_all.json
0.4% decode overhead §4.3 scripts/eval/throughput_bench.sh results/throughput/bench_*.md
Paired t>100 (chunk-level) §5.1 scripts/analysis/f1_deep_analysis.py results/ppl/*/

Supplementary Local Artifacts

Artifact Description Script or patch Data
local_baselines Same-run local H2O, ScissorHands, and EKV comparison on GTX 1070 (llama-perplexity, flash_attn off, c=256, b=256, ub=32, 8 chunks) cuda_hook/experimental/llama-context_attention_baselines.cpp results/supplementary/local_baselines/h2o_wsl_compatible_models_comparison.json, results/supplementary/local_baselines/logs/
h2o_smoke Short smoke validation showing score capture and heavy+recent eviction in the local H2O path cuda_hook/experimental/llama-context_attention_baselines.cpp results/supplementary/local_baselines/h2o_smoke_qwen05b_comparison.json
tau_band_qwen05b Real-KV pre-threshold ` K /τ_h` band analysis showing that EKV's removed region is high-density but low-energy on Qwen2.5-0.5B
baseline_eviction_geometry Real-KV replay that projects H2O and ScissorHands evictions onto the same ` K /τ_h` bands for mechanistic comparison with EKV
gtx1070_llama_bench Local GTX 1070 throughput sweep for evict_every sensitivity and sweet/aggressive decode trade-offs scripts/eval/run_gtx1070_llama_bench.sh results/supplementary/throughput_gtx1070_llama_bench/

Data Files

Top-level aggregates (results/)

File Format Description
results_all.json JSON All PPL results across models and configs (Mistral primary)
results_baseline.json JSON Baseline (no EKV) PPL
results_sweet.json JSON Sweet-spot config (sK=0.45/sV=0.50)
results_aggressive.json JSON Aggressive config (sK=sV=0.50)
ablation_random_vs_ekv_results.json JSON Matched-sparsity EKV vs random ablation
attention_entropy_v2_results.json JSON Entropy simulation results
f4_thresh_ablation_results.json JSON Threshold-formula ablation archive for Table 1
f2_spr_real_kv_results.json JSON Archived SPR/ASI empirical summary for Table 13 and real-KV claims
f7_compression_proof.json JSON Archived compression-proof summary for Section 7
long_context_results.json JSON 4K/8K/16K context PPL scaling

Per-model PPL (results/ppl/)

Directory Model Contents
ppl/llama3/ Llama-3-8B FP16 results_{all,baseline,sweet,aggressive}.json
ppl/qwen25/ Qwen2.5-7B FP16 results_{all,baseline,sweet,aggressive}.json
ppl/qwen3/ Qwen3-4B BF16 results_{all,baseline,sweet,aggressive}.json

LongBench-v2 3-seed (results/longbench/)

Directory Model × Seed Contents
longbench/qwen25_seed{1,2,3}/ Qwen2.5-7B × 3 seeds results_{all,baseline,sweet,aggressive}.json
longbench/qwen3_seed{1,2,3}/ Qwen3-4B × 3 seeds results_{all,baseline,sweet,aggressive}.json

Throughput (results/throughput/)

File Description
bench_baseline.md llama-bench output, ELASTICKV=0
bench_sweet.md llama-bench output, sweet config
bench_aggressive.md llama-bench output, aggressive config

Logs (results/logs/)

File Description
final_tests_full.log Complete PPL+NIAH+sink test run
ablation_results.log Ablation experiment log
longbench_3seeds_full.log 3-seed LongBench Qwen2.5-7B run
longbench_3seeds_qwen3_full.log 3-seed LongBench Qwen3-4B run