-
What's the state of cache quantization in exl2? I've been using Q4 for a long time, but Q6 and Q8 have since been added. The cache eval post is very old now (https://github.com/turboderp-org/exllamav2/blob/master/doc/qcache_eval.md), but that showed essentially equal results between fp16 and Q4. So, is there any reason to use Q6 or Q8? Thanks for the advice |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
There is always an accuracy loss with cache quantization, but whether it matters is difficult to say, and it varies between models. I have some updated data which is still a bit old by now:
As evident, all models show a slight but probably insignificant improvement on perplexity going from Q4 to Q6, while the HumanEval scores remain within margin of error. Qwen2-7B is an exception, which I think comes down to its aggressive use of GQA (not much data in the cache to begin with), but there could be other models that are similarly sensitive to quantization. |
Beta Was this translation helpful? Give feedback.
There is always an accuracy loss with cache quantization, but whether it matters is difficult to say, and it varies between models. I have some updated data which is still a bit old by now: