Why not all tensors are quantized in the GGUF format? #2937
-
I have loaded a Llama-2 quantized GGUF version (
meaning that I still have 65 tensors in
and so on. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
One dimensional tensors are tiny so there's no point in quantizing them. Just for example an f32 4096 element 1D tensor is about 16KB while a 4096x4096 Q3_K tensor would be around 5MB if it was exactly 3 bits per element (I think the real number is a bit higher). Anyway, even in a 40GB 70B model all the f32 tensors probably add up to less than 20MB. It's the multi-dimensional ones that are actually big enough to care about quantizing. |
Beta Was this translation helpful? Give feedback.
-
My (kindergaten) understanding is that the "K" series quantizations resize most of the tensors... but if a tensors is exceptional, an outlier, then it will be less quantized or left as is. Would be very happy if someone could provide a better one liner... |
Beta Was this translation helpful? Give feedback.
-
@KerfuffleV2 thank you very much that was very informative! I leave TheBloke quantizations table here as a reference!
|
Beta Was this translation helpful? Give feedback.
One dimensional tensors are tiny so there's no point in quantizing them. Just for example an f32 4096 element 1D tensor is about 16KB while a 4096x4096 Q3_K tensor would be around 5MB if it was exactly 3 bits per element (I think the real number is a bit higher).
Anyway, even in a 40GB 70B model all the f32 tensors probably add up to less than 20MB. It's the multi-dimensional ones that are actually big enough to care about quantizing.