For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743

BishmoyPaul · 2025-02-07T23:35:56Z

BishmoyPaul
Feb 7, 2025

I was looking into ggml_cuda_mul_mat (ggml-cuda.cu L1844) to understand how it works for quantized models. It seems for quantized models, if the src1 (which is often the input/hidden state) is in FP32 format, it is first converted to Q8_1 first before the actual operation - for example, in L1899, quantize_row_q8_1_cuda is being passed as an argument.

Am I correct in assuming it is indeed using Q8_1 for src1? If so, my question is, why are we using Q8_1 for src1? I mean, for simpler quantizations like Q4_0, where the model weights are Q4_0 quantized, why would we convert src1 to Q8_1? Why not Q8_0?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743

{{title}}

Replies: 0 comments

Select a reply

For matrix multiplication operations with quantized models, is the fp32 src1 first quantized to Q8_1? #11743

BishmoyPaul Feb 7, 2025

Replies: 0 comments

BishmoyPaul
Feb 7, 2025