Skip to content

Which cache type to use? #727

Answered by turboderp
Mithadon asked this question in Q&A
Discussion options

You must be logged in to vote

There is always an accuracy loss with cache quantization, but whether it matters is difficult to say, and it varies between models. I have some updated data which is still a bit old by now:

Model Quant Cache pass@1 pass@10 Wikitext 5x1k
Qwen2-7B FP16 Q4 19.74% 46.34% 40.72
Qwen2-7B FP16 Q6 61.65% 81.70% 15.20
Qwen2-7B FP16 Q8 62.37% 81.09% 15.18
Qwen2-7B FP16 FP16 61.16% 82.31% 15.16
Qwen2-72B 6.0bpw Q4 70.36% 87.19% 10.31
Qwen2-72B 6.0bpw Q6 69.32% 85.36% 10.26
Qwen2-72B 6.0bpw Q8 71.28% 85.36% 10.23
Qwen2-72B 6.0bpw FP16 70.80% 83.50% 10.17
Llama3-8B-instruct FP16 Q4 58.29% 78.65% 17.76
Llama3-8B-instruct FP16 Q6 61.58% 77.43% 17.70
Llama3-8B-instruct FP16 Q8 6…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Mithadon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants