Which cache type to use? #727

Mithadon · 2025-01-29T00:41:55Z

Mithadon
Jan 29, 2025

What's the state of cache quantization in exl2? I've been using Q4 for a long time, but Q6 and Q8 have since been added. The cache eval post is very old now (https://github.com/turboderp-org/exllamav2/blob/master/doc/qcache_eval.md), but that showed essentially equal results between fp16 and Q4. So, is there any reason to use Q6 or Q8?

Thanks for the advice

Answered by turboderp

Feb 2, 2025

There is always an accuracy loss with cache quantization, but whether it matters is difficult to say, and it varies between models. I have some updated data which is still a bit old by now:

Model	Quant	Cache	pass@1	pass@10	Wikitext 5x1k
Qwen2-7B	FP16	Q4	19.74%	46.34%	40.72
Qwen2-7B	FP16	Q6	61.65%	81.70%	15.20
Qwen2-7B	FP16	Q8	62.37%	81.09%	15.18
Qwen2-7B	FP16	FP16	61.16%	82.31%	15.16
Qwen2-72B	6.0bpw	Q4	70.36%	87.19%	10.31
Qwen2-72B	6.0bpw	Q6	69.32%	85.36%	10.26
Qwen2-72B	6.0bpw	Q8	71.28%	85.36%	10.23
Qwen2-72B	6.0bpw	FP16	70.80%	83.50%	10.17
Llama3-8B-instruct	FP16	Q4	58.29%	78.65%	17.76
Llama3-8B-instruct	FP16	Q6	61.58%	77.43%	17.70
Llama3-8B-instruct	FP16	Q8	6…

View full answer

turboderp · 2025-02-02T13:41:47Z

turboderp
Feb 2, 2025
Maintainer

There is always an accuracy loss with cache quantization, but whether it matters is difficult to say, and it varies between models. I have some updated data which is still a bit old by now:

Model	Quant	Cache	pass@1	pass@10	Wikitext 5x1k
Qwen2-7B	FP16	Q4	19.74%	46.34%	40.72
Qwen2-7B	FP16	Q6	61.65%	81.70%	15.20
Qwen2-7B	FP16	Q8	62.37%	81.09%	15.18
Qwen2-7B	FP16	FP16	61.16%	82.31%	15.16
Qwen2-72B	6.0bpw	Q4	70.36%	87.19%	10.31
Qwen2-72B	6.0bpw	Q6	69.32%	85.36%	10.26
Qwen2-72B	6.0bpw	Q8	71.28%	85.36%	10.23
Qwen2-72B	6.0bpw	FP16	70.80%	83.50%	10.17
Llama3-8B-instruct	FP16	Q4	58.29%	78.65%	17.76
Llama3-8B-instruct	FP16	Q6	61.58%	77.43%	17.70
Llama3-8B-instruct	FP16	Q8	61.58%	81.09%	17.70
Llama3-8B-instruct	FP16	FP16	61.04%	78.65%	17.70
Llama3-8B	FP16	Q4	34.02%	65.85%	12.88
Llama3-8B	FP16	Q6	35.42%	67.07%	12.70
Llama3-8B	FP16	Q8	35.06%	66.46%	12.70
Llama3-8B	FP16	FP16	35.18%	68.90%	12.69

As evident, all models show a slight but probably insignificant improvement on perplexity going from Q4 to Q6, while the HumanEval scores remain within margin of error. Qwen2-7B is an exception, which I think comes down to its aggressive use of GQA (not much data in the cache to begin with), but there could be other models that are similarly sensitive to quantization.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which cache type to use? #727

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Which cache type to use? #727

Mithadon Jan 29, 2025

Replies: 1 comment

turboderp Feb 2, 2025 Maintainer

Mithadon
Jan 29, 2025

turboderp
Feb 2, 2025
Maintainer