For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743
Unanswered
BishmoyPaul
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was looking into
ggml_cuda_mul_mat
(ggml-cuda.cu L1844) to understand how it works for quantized models. It seems for quantized models, if the src1 (which is often the input/hidden state) is in FP32 format, it is first converted to Q8_1 first before the actual operation - for example, in L1899,quantize_row_q8_1_cuda
is being passed as an argument.Am I correct in assuming it is indeed using Q8_1 for src1? If so, my question is, why are we using Q8_1 for src1? I mean, for simpler quantizations like Q4_0, where the model weights are Q4_0 quantized, why would we convert src1 to Q8_1? Why not Q8_0?
Beta Was this translation helpful? Give feedback.
All reactions