Why not all tensors are quantized in the GGUF format? #2937

loretoparisi · 2023-08-31T17:06:20Z

loretoparisi
Aug 31, 2023

I have loaded a Llama-2 quantized GGUF version (LLaMA-2-7B-32K-Q3_K_L.gguf ), and I see that I get

llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q3_K:  129 tensors
llama_model_loader: - type q5_K:   96 tensors
llama_model_loader: - type q6_K:    1 tensors

meaning that I still have 65 tensors in f32, infact we can see that

llama_model_loader: - tensor    0:                token_embd.weight q3_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q5_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q3_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q5_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q3_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q3_K     [  4096,  4096,     1,     1 ]
...

and so on.
Why these tensors are not quantized?

Thank you!

Answered by KerfuffleV2

Aug 31, 2023

One dimensional tensors are tiny so there's no point in quantizing them. Just for example an f32 4096 element 1D tensor is about 16KB while a 4096x4096 Q3_K tensor would be around 5MB if it was exactly 3 bits per element (I think the real number is a bit higher).

Anyway, even in a 40GB 70B model all the f32 tensors probably add up to less than 20MB. It's the multi-dimensional ones that are actually big enough to care about quantizing.

View full answer

KerfuffleV2 · 2023-08-31T17:32:03Z

KerfuffleV2
Aug 31, 2023
Collaborator

One dimensional tensors are tiny so there's no point in quantizing them. Just for example an f32 4096 element 1D tensor is about 16KB while a 4096x4096 Q3_K tensor would be around 5MB if it was exactly 3 bits per element (I think the real number is a bit higher).

Anyway, even in a 40GB 70B model all the f32 tensors probably add up to less than 20MB. It's the multi-dimensional ones that are actually big enough to care about quantizing.

0 replies

ianscrivener · 2023-08-31T23:16:17Z

ianscrivener
Aug 31, 2023

My (kindergaten) understanding is that the "K" series quantizations resize most of the tensors... but if a tensors is exceptional, an outlier, then it will be less quantized or left as is.

Would be very happy if someone could provide a better one liner...

3 replies

KerfuffleV2 Aug 31, 2023
Collaborator

For all quantizations, the 1d tensors are just left 32bit (or 16bit if that's what's in the original model they got converted from). I'm pretty sure there is no other case tensors are left 32 or 16bit.

You're right about some tensors getting quantized less in k-quants. If you look at the user's quantize output above, you can see they selected q3_k_l and some tensors are quantized to q3_k and some are quantized to q5_k. In the development of the k-quants stuff, research was done to determine which tensors were most important to quality: it can be a specific type of tensor or also certain layers in the model.

For non-k-quants quantizations, you generally get what you ask for (with the 1d tensor exception). However, the tool supports a --leave-output-tensor where you can manually request the output tensor gets left unquantized. That's because quantizing this tensor was shown to have a pretty significant effect on model quality and one may choose to make the fairly small size/quality tradeoff of leaving that tensor alone.

loretoparisi Sep 7, 2023
Author

@KerfuffleV2 thank you for the details. Could you point me to the documentation about Q3_K, Q5_K "K" series quantization as well as the "L" series quantization etc.? I would like to understand the right quantization to choose given the number of parameters e..g let's say a Llama-2 7B quantized Q3_K or Q3_K_M, Q3_K_L, and so on.. (like LLaMA-2-7B-32K-Q3_K_L.gguf, yarn-llama-2-7b-64k.Q3_K_L.gguf or yarn-llama-2-7b-64k.Q3_K_M.gguf, etc.)
I was following Llama.cpp before these quantization approach, when it was a very early stage with GGML format, thank you in advance!

KerfuffleV2 Sep 7, 2023
Collaborator

There isn't really "documentation", but TheBloke includes a pretty good description with the models he publishes. Just picking a link to a random model I like: https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGUF#provided-files

The quantization you choose doesn't directly relate to the number of parameters (although there's a relation). Quantization is a trade off between size and quality. If we had unlimited resources, we'd just run unquantized models. Unfortunately, most of us don't live in that ideal situation and sacrifices have to be made to make the model fit in the amount of RAM or VRAM that we have available.

The QxK quantizations generally outperform the older ones like q4_0, q4_1, etc. I personally don't use those anymore. Comparing within QxK, basically bigger size = higher quality. So pick the lowest quantization that you have the resources to run is generally the right approach. The lower the number (like Q2) the more it was quantized and the higher the quality loss will be. The higher the number (like Q6_K, Q8_0) the less quantized it is, the higher the quality but also the bigger the file and memory required becomes. My personal opinion is Q4_K_M is a great compromise between size and quality, it feels like the sweet spot.

Also, a side note: Generally increasing parameters is better than quantizing less. So if you could run a Q3_K 70B model or an unquantized 30B model you'd generally be better off using the quantized 70B.

loretoparisi · 2023-09-07T22:29:09Z

loretoparisi
Sep 7, 2023
Author

@KerfuffleV2 thank you very much that was very informative! I leave TheBloke quantizations table here as a reference!

Quant method	Bits	Use case
Q2_K	2	smallest, significant quality loss - not recommended for most purposes
Q3_K_S	3	very small, high quality loss
Q3_K_M	3	very small, high quality loss
Q3_K_L	3	small, substantial quality loss
Q4_0	4	legacy; small, very high quality loss - prefer using Q3_K_M
Q4_K_S	4	small, greater quality loss
Q4_K_M	4	medium, balanced quality - recommended
Q5_0	5	legacy; medium, balanced quality - prefer using Q4_K_M
Q5_K_S	5	large, low quality loss - recommended
Q5_K_M	5	large, very low quality loss - recommended
Q6_K	6	very large, extremely low quality loss
Q8_0	8	very large, extremely low quality loss - not recommended

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not all tensors are quantized in the GGUF format? #2937

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why not all tensors are quantized in the GGUF format? #2937

loretoparisi Aug 31, 2023

Replies: 3 comments · 3 replies

KerfuffleV2 Aug 31, 2023 Collaborator

ianscrivener Aug 31, 2023

KerfuffleV2 Aug 31, 2023 Collaborator

loretoparisi Sep 7, 2023 Author

KerfuffleV2 Sep 7, 2023 Collaborator

loretoparisi Sep 7, 2023 Author

loretoparisi
Aug 31, 2023

Replies: 3 comments 3 replies

KerfuffleV2
Aug 31, 2023
Collaborator

ianscrivener
Aug 31, 2023

KerfuffleV2 Aug 31, 2023
Collaborator

loretoparisi Sep 7, 2023
Author

KerfuffleV2 Sep 7, 2023
Collaborator

loretoparisi
Sep 7, 2023
Author