Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications #12727

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

bartowski1182
Copy link
Contributor

@bartowski1182 bartowski1182 commented Apr 3, 2025

In draft because I'm still working on some numbers (sizes and PPL) but want to post so it can start being looked at

After some discussion with ikawrakow (and with some inspiration from some of Unsloth's work) it feels that the llama_tensor_get_type needs some love for MoE models, in particular DeepSeek.

There are several places where we check in n_expert == 8, which means we forgo optimizations for other numbers of experts on models.

There are 2 sections of improvements, a large change to models below 2 BPW (LLAMA_FTYPE_MOSTLY_IQ2_M and lower), and then another set of changes for the rest of the BPWs that's less impactful but brings it more in line with the general practices for those sizes.

Huge thanks to ikawrakow, Unsloth team for their initial investigations, and Artus for helping me crunch some of these PPL numbers.

Performance updates:

Quant type Metric Main This branch Difference
Q2_K_L* Size 244.93GB 248.90GB 3.97GB (~1.6%)
Perplexity 3.9012 +/- 0.02243 3.6025 +/- 0.02111 -0.2987 (~7.7%)
IQ2_M Size 217.43GB 224.49GB 7.06GB (~3.2%)
Perplexity 4.1846 +/- 0.02583 3.7678 +/- 0.02179 -0.4168 (9.9%)
IQ2_XXS Size 174.43GB 179.78GB 5.35GB (~3.1%)
Perplexity 5.4764 +/- 0.03602 4.3861 +/- 0.02597 -1.0903 (~20%)
IQ1_M Size 148.88GB 154.78GB 5.9GB (~4%)
Perplexity 6.0219 +/- 0.11989** 4.7366 +/- 0.09601** -1.2853 (~21.3%)

*Q2_K_L is Q2_K but with embedding and output tensors kept at q8_0, it's the only size I made originally so that's why I'm comparing it here as well
**Only first 50 chunks, get NaN after for some reason on main

We get a massive change to IQ1_M with a pretty large size increase, however, comparing it to dense models the final BPW is actually lower than other IQ1_M.

For example, Cohere's 111B model at IQ1_M ends up around ~1.93 BPW. DeepSeek with my changes comes to around ~1.84 BPW, so I probably even could have let it go a bit higher and still been within "IQ1_M" territory.

Also added a bit of love to the early ffn_gate and ffn_up for MoE models but I could take it or leave it if people are opposed, but i doubt the first 1/16 layers at Q4 will affect much (and the next 1/16 at IQ3_S or Q2_K)

Overall I think these changes move MoE, in particular DeepSeek, towards a more appropriate and logical place for overall BPW

Perplexity was calculated against wiki.text.raw with the following command:

./llama-perplexity -m /models/deepseek-ai_DeepSeek-V3-0324-Q2_K_L/deepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-0007.gguf -ctk q8_0 -fa --ctx-size 512 --ubatch-size 512 -f wiki.test.raw --seed 1337 --threads 120

These settings were chosen in part for a bit of speedup as well as for repeatability when comparing against another existing source of DeepSeek quant perplexities

#define IQ3S_N_SCALE QK_K/64
// 3.4375 bpw
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was bothering me that my IDE couldn't see the BPW from the docstring

@@ -119,6 +131,23 @@ static void llama_tensor_dequantize_impl(
workers.clear();
}

// Check if ftype is specifically IQ2_S or IQ2_M
static inline bool is_iq2s_or_iq2m(llama_ftype ftype) {
Copy link
Contributor Author

@bartowski1182 bartowski1182 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is used all over the place, made it an inline helper, happy to change it back if changes like these are unwanted (same below with is_iq1_group and get_expert_exps_type)

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 3, 2025
@ubergarm
Copy link

ubergarm commented Apr 3, 2025

I pulled and built this branch and successfully tested quantizing a V3-0324-Q2_K with default options. Given my bf16 has the MLA tensors, here is what we end up with:

quantization tensors names
f32 361
q8_0 64 attn_kv_a_mqa, [0-2].attn_kv_b
q2_K 361 token_embd*, attn_q_b, attn_v_b, ffn_gate, ffn_up
q3_K 119 ffn_down
q4_K 31 [??].attn_kv_b
q5_K 122 attn_output, attn_q_a
q6_K 28 output*, [??].attn_kv_b
iq4_nl 61 attn_k_b

I noticed attn_kv_b is changing every third layer between q6_K and q4_K though starts out as q8_0

I added note about specifying token_embd and output in actual usage.

Quantization Logs
# probably should pass --token-embedding-type and  --output-tensor-type explicitly for better quality output without much increase in size

./build/bin/llama-quantize \
    /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q2_K-bartowski-mainline.gguf \
    Q2_K \
    24

main: build = 5029 (feae28b8)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf' to '/mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q2_K-bartowski-mainline.gguf' as Q2_K using 24 threads
llama_model_loader: additional 29 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 49 key-value pairs and 1147 tensors from /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 32
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                                   split.no u16              = 0
llama_model_loader: - kv  47:                                split.count u16              = 30
llama_model_loader: - kv  48:                        split.tensors.count i32              = 1147
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type bf16:  786 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
[   1/1147]                        output.weight - [ 7168, 129280,     1,     1], type =   bf16, converting to q6_K .. size =  1767.50 MiB ->   724.95 MiB
[   2/1147]                   output_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   3/1147]                    token_embd.weight - [ 7168, 129280,     1,     1], type =   bf16, converting to q2_K .. size =  1767.50 MiB ->   289.98 MiB
[   4/1147]                blk.0.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 65536 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =    16.00 MiB ->     4.50 MiB
[   5/1147]           blk.0.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, converting to q8_0 .. size =     7.88 MiB ->     4.18 MiB
[   6/1147]          blk.0.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[   7/1147]               blk.0.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[   8/1147]               blk.0.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[   9/1147]             blk.0.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, converting to q5_K .. size =   224.00 MiB ->    77.00 MiB
[  10/1147]                blk.0.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, converting to q5_K .. size =    21.00 MiB ->     7.22 MiB
[  11/1147]           blk.0.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  12/1147]                blk.0.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, converting to q2_K .. size =    72.00 MiB ->    11.81 MiB
[  13/1147]                blk.0.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, converting to q2_K .. size =    16.00 MiB ->     2.62 MiB
[  14/1147]                blk.0.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, converting to q3_K .. size =   252.00 MiB ->    54.14 MiB
[  15/1147]                blk.0.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, converting to q2_K .. size =   252.00 MiB ->    41.34 MiB
[  16/1147]                blk.0.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  17/1147]                  blk.0.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, converting to q2_K .. size =   252.00 MiB ->    41.34 MiB
[  18/1147]                blk.1.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 65536 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =    16.00 MiB ->     4.50 MiB
[  19/1147]           blk.1.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, converting to q8_0 .. size =     7.88 MiB ->     4.18 MiB
[  20/1147]          blk.1.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  21/1147]               blk.1.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  22/1147]               blk.1.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  23/1147]             blk.1.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, converting to q5_K .. size =   224.00 MiB ->    77.00 MiB
[  24/1147]                blk.1.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, converting to q5_K .. size =    21.00 MiB ->     7.22 MiB
[  25/1147]           blk.1.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  26/1147]                blk.1.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, converting to q2_K .. size =    72.00 MiB ->    11.81 MiB
[  27/1147]                blk.1.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, converting to q2_K .. size =    16.00 MiB ->     2.62 MiB
[  28/1147]                blk.1.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, converting to q3_K .. size =   252.00 MiB ->    54.14 MiB
[  29/1147]                blk.1.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, converting to q2_K .. size =   252.00 MiB ->    41.34 MiB
[  30/1147]                blk.1.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  31/1147]                  blk.1.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, converting to q2_K .. size =   252.00 MiB ->    41.34 MiB
[  32/1147]                blk.2.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 65536 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =    16.00 MiB ->     4.50 MiB
[  33/1147]           blk.2.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, converting to q8_0 .. size =     7.88 MiB ->     4.18 MiB
[  34/1147]          blk.2.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  35/1147]               blk.2.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  36/1147]               blk.2.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  37/1147]             blk.2.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, converting to q5_K .. size =   224.00 MiB ->    77.00 MiB
[  38/1147]                blk.2.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, converting to q5_K .. size =    21.00 MiB ->     7.22 MiB
[  39/1147]           blk.2.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  40/1147]                blk.2.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, converting to q2_K .. size =    72.00 MiB ->    11.81 MiB
[  41/1147]                blk.2.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, converting to q2_K .. size =    16.00 MiB ->     2.62 MiB
[  42/1147]                blk.2.ffn_down.weight - [18432,  7168,     1,     1], type =   bf16, converting to q3_K .. size =   252.00 MiB ->    54.14 MiB
[  43/1147]                blk.2.ffn_gate.weight - [ 7168, 18432,     1,     1], type =   bf16, converting to q2_K .. size =   252.00 MiB ->    41.34 MiB
[  44/1147]                blk.2.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  45/1147]                  blk.2.ffn_up.weight - [ 7168, 18432,     1,     1], type =   bf16, converting to q2_K .. size =   252.00 MiB ->    41.34 MiB
[  46/1147]                blk.3.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 65536 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =    16.00 MiB ->     4.50 MiB
[  47/1147]           blk.3.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, converting to q8_0 .. size =     7.88 MiB ->     4.18 MiB
[  48/1147]          blk.3.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  49/1147]               blk.3.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[  50/1147]               blk.3.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  51/1147]             blk.3.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, converting to q5_K .. size =   224.00 MiB ->    77.00 MiB
[  52/1147]                blk.3.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, converting to q5_K .. size =    21.00 MiB ->     7.22 MiB
[  53/1147]           blk.3.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  54/1147]                blk.3.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, converting to q2_K .. size =    72.00 MiB ->    11.81 MiB
[  55/1147]                blk.3.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, converting to q2_K .. size =    16.00 MiB ->     2.62 MiB
[  56/1147]               blk.3.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  57/1147]           blk.3.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, converting to q3_K .. size =  7168.00 MiB ->  1540.00 MiB
[  58/1147]          blk.3.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, converting to q3_K .. size =    28.00 MiB ->     6.02 MiB
[  59/1147]           blk.3.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, converting to q2_K .. size =  7168.00 MiB ->  1176.00 MiB
[  60/1147]            blk.3.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[  61/1147]          blk.3.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, converting to q2_K .. size =    28.00 MiB ->     4.59 MiB
[  62/1147]                blk.3.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  63/1147]             blk.3.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, converting to q2_K .. size =  7168.00 MiB ->  1176.00 MiB
[  64/1147]            blk.3.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, converting to q2_K .. size =    28.00 MiB ->     4.59 MiB
[  65/1147]                blk.4.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 65536 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =    16.00 MiB ->     4.50 MiB
[  66/1147]           blk.4.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, converting to q8_0 .. size =     7.88 MiB ->     4.18 MiB
[  67/1147]          blk.4.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[  68/1147]               blk.4.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[  69/1147]               blk.4.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  70/1147]             blk.4.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, converting to q5_K .. size =   224.00 MiB ->    77.00 MiB
[  71/1147]                blk.4.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, converting to q5_K .. size =    21.00 MiB ->     7.22 MiB
[  72/1147]           blk.4.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[  73/1147]                blk.4.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, converting to q2_K .. size =    72.00 MiB ->    11.81 MiB
[  74/1147]                blk.4.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, converting to q2_K .. size =    16.00 MiB ->     2.62 MiB
[  75/1147]               blk.4.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[  76/1147]           blk.4.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, converting to q3_K .. size =  7168.00 MiB ->  1540.00 MiB
[  77/1147]          blk.4.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, converting to q3_K .. size =    28.00 MiB ->     6.02 MiB
[  78/1147]           blk.4.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, converting to q2_K .. size =  7168.00 MiB ->  1176.00 MiB
[  79/1147]            blk.4.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[  80/1147]          blk.4.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, converting to q2_K .. size =    28.00 MiB ->     4.59 MiB
[  81/1147]                blk.4.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[  82/1147]             blk.4.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, converting to q2_K .. size =  7168.00 MiB ->  1176.00 MiB
[  83/1147]            blk.4.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, converting to q2_K .. size =    28.00 MiB ->     4.59 MiB

.
.
.

[1129/1147]               blk.60.attn_k_b.weight - [  128, 65536,     1,     1], type =   bf16, 

llama_tensor_get_type : tensor cols 128 x 65536 are not divisible by 256, required for q2_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =    16.00 MiB ->     4.50 MiB
[1130/1147]          blk.60.attn_kv_a_mqa.weight - [ 7168,   576,     1,     1], type =   bf16, converting to q8_0 .. size =     7.88 MiB ->     4.18 MiB
[1131/1147]         blk.60.attn_kv_a_norm.weight - [  512,     1,     1,     1], type =    f32, size =    0.002 MB
[1132/1147]              blk.60.attn_kv_b.weight - [  512, 32768,     1,     1], type =   bf16, converting to q6_K .. size =    32.00 MiB ->    13.12 MiB
[1133/1147]              blk.60.attn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1134/1147]            blk.60.attn_output.weight - [16384,  7168,     1,     1], type =   bf16, converting to q5_K .. size =   224.00 MiB ->    77.00 MiB
[1135/1147]               blk.60.attn_q_a.weight - [ 7168,  1536,     1,     1], type =   bf16, converting to q5_K .. size =    21.00 MiB ->     7.22 MiB
[1136/1147]          blk.60.attn_q_a_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[1137/1147]               blk.60.attn_q_b.weight - [ 1536, 24576,     1,     1], type =   bf16, converting to q2_K .. size =    72.00 MiB ->    11.81 MiB
[1138/1147]               blk.60.attn_v_b.weight - [  512, 16384,     1,     1], type =   bf16, converting to q2_K .. size =    16.00 MiB ->     2.62 MiB
[1139/1147]              blk.60.exp_probs_b.bias - [  256,     1,     1,     1], type =    f32, size =    0.001 MB
[1140/1147]          blk.60.ffn_down_exps.weight - [ 2048,  7168,   256,     1], type =   bf16, converting to q3_K .. size =  7168.00 MiB ->  1540.00 MiB
[1141/1147]         blk.60.ffn_down_shexp.weight - [ 2048,  7168,     1,     1], type =   bf16, converting to q3_K .. size =    28.00 MiB ->     6.02 MiB
[1142/1147]          blk.60.ffn_gate_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, converting to q2_K .. size =  7168.00 MiB ->  1176.00 MiB
[1143/1147]           blk.60.ffn_gate_inp.weight - [ 7168,   256,     1,     1], type =    f32, size =    7.000 MB
[1144/1147]         blk.60.ffn_gate_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, converting to q2_K .. size =    28.00 MiB ->     4.59 MiB
[1145/1147]               blk.60.ffn_norm.weight - [ 7168,     1,     1,     1], type =    f32, size =    0.027 MB
[1146/1147]            blk.60.ffn_up_exps.weight - [ 7168,  2048,   256,     1], type =   bf16, converting to q2_K .. size =  7168.00 MiB ->  1176.00 MiB
[1147/1147]           blk.60.ffn_up_shexp.weight - [ 7168,  2048,     1,     1], type =   bf16, converting to q2_K .. size =    28.00 MiB ->     4.59 MiB
llama_model_quantize_impl: model size  = 1282038.27 MB
llama_model_quantize_impl: quant size  = 235685.20 MB
llama_model_quantize_impl: WARNING: 61 of 786 tensor(s) required fallback quantization

main: quantize time = 1333334.02 ms
main:    total time = 1333334.02 ms

Also ran perplexity check on the resulting file over on ik_llama.cpp given I don't know how to prune out the MLA tensors and this branch is still open so didn't use that.

Perplexity Logs
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q2_K-bartowski-mainline.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
main: build = 3620 (2ee6263e)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: seed  = 1337
llama_model_loader: loaded meta data with 46 key-value pairs and 1147 tensors from /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q2_K-bartowski-mainline.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  17:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  18:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  19:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  22:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  23:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  24:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  25:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  26:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  30:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  31: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  32: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,129280]  = ["...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,129280]  = [3...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,127741]  = ["...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 10
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:   64 tensors
llama_model_loader: - type q2_K:  361 tensors
llama_model_loader: - type q3_K:  119 tensors
llama_model_loader: - type q4_K:   31 tensors
llama_model_loader: - type q5_K:  122 tensors
llama_model_loader: - type q6_K:   28 tensors
llama_model_loader: - type iq4_nl:   61 tensors
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 672.050 B
llm_load_print_meta: model size       = 230.161 GiB (2.942 BPW) 
llm_load_print_meta: repeating layers = 229.170 GiB (2.937 BPW, 670.196 B parameters)
llm_load_print_meta: general.name     = DeepSeek V3 0324
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml ctx size =    0.93 MiB
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 233761.45 MiB
llm_load_tensors:        CPU buffer size =   289.98 MiB
llm_load_tensors:      CUDA0 buffer size =  9659.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init:      CUDA0 KV buffer size =    72.94 MiB
llama_new_context_with_model: KV self size  =   72.91 MiB, c^KV (q8_0):   72.91 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.97 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1787.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    50.01 MiB
llama_new_context_with_model: graph nodes  = 3548
llama_new_context_with_model: graph splits = 118

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 601.086 ms
perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 33.22 seconds per pass - ETA 1 hours 17.65 minutes
[1]2.8025,[2]3.4950,[3]2.5232,[4]2.1165,[5]1.9401,[6]1.8205,[7]1.7292,[8]1.6705,[9]1.6357,[10]1.5911,[11]1.5945,[12]1.6806,[13]1.7071,[14]1.8407,[15]1.9910,[16]2.0411,[17]2.2135,[18]2.3464,[19]2.2972,[20]2.3034,[21]2.4065,[22]2.3677,[23]2.3250,[24]2.3390,[25]2.3089,[26]2.2750,[27]2.3238,[28]2.3344,[29]2.3897,[30]2.4185,[31]2.4574,[32]2.4768,[33]2.5232,[34]2.5756,[35]2.6311,[36]2.6908,[37]2.7204,[38]2.7764,[39]2.8207,[40]2.8853,[41]2.9276,[42]2.9347,[43]2.9889,[44]3.0013,[45]3.0868,[46]3.1387,[47]3.1005,[48]3.0553,[49]3.0419,[50]3.0599,[51]3.1112,[52]3.1245,[53]3.1799,[54]3.1975,[55]3.2318,[56]3.2628,[57]3.2862,[58]3.3234,[59]3.3280,[60]3.3765,[61]3.4200,[62]3.4797,[63]3.5140,[64]3.5619,[65]3.5723,[66]3.5617,[67]3.5417,[68]3.5698,[69]3.5717,[70]3.5880,[71]3.6039,[72]3.6192,[73]3.6316,[74]3.6530,[75]3.6277,[76]3.5748,[77]3.5320,[78]3.5274,[79]3.5120,[80]3.5074,[81]3.4670,[82]3.4734,[83]3.4499,[84]3.4181,[85]3.3916,[86]3.3697,[87]3.3781,[88]3.3542,[89]3.3433,[90]3.3220,[91]3.3046,[92]3.2844,[93]3.2595,[94]3.2435,[95]3.2253,[96]3.2312,[97]3.2416,[98]3.2290,[99]3.2105,[100]3.2125,[101]3.2031,[102]3.2207,[103]3.2487,[104]3.2686,[105]3.2637,[106]3.2917,[107]3.3176,[108]3.3380,[109]3.3719,[110]3.4076,[111]3.4303,[112]3.4012,[113]3.3860,[114]3.3654,[115]3.3477,[116]3.3417,[117]3.3169,[118]3.2918,[119]3.2711,[120]3.2502,[121]3.2330,[122]3.2097,[123]3.1909,[124]3.1687,[125]3.1495,[126]3.1300,[127]3.1173,[128]3.1132,[129]3.1027,[130]3.0943,[131]3.0871,[132]3.0914,[133]3.1001,[134]3.1062,[135]3.1179,[136]3.1354,[137]3.1507,[138]3.1584,[139]3.1699,[140]3.1668,[141]3.1660,[142]3.1615,[143]3.1603,[144]3.1538,[145]3.1428,[146]3.1390,[147]3.1412,[148]3.1392,[149]3.1389,[150]3.1298,[151]3.1261,[152]3.1217,[153]3.1146,[154]3.1123,[155]3.1153,[156]3.1140,[157]3.1184,[158]3.1260,[159]3.1284,[160]3.1368,[161]3.1447,[162]3.1542,[163]3.1616,[164]3.1846,[165]3.2110,[166]3.2306,[167]3.2450,[168]3.2722,[169]3.2975,[170]3.3224,[171]3.3468,[172]3.3273,[173]3.3069,[174]3.2921,[175]3.2808,[176]3.2698,[177]3.2591,[178]3.2451,[179]3.2316,[180]3.2353,[181]3.2506,[182]3.2674,[183]3.2837,[184]3.2976,[185]3.3071,[186]3.3236,[187]3.3398,[188]3.3553,[189]3.3663,[190]3.3662,[191]3.3728,[192]3.3742,[193]3.3768,[194]3.3982,[195]3.4077,[196]3.4212,[197]3.4303,[198]3.4336,[199]3.4378,[200]3.4339,[201]3.4490,[202]3.4417,[203]3.4468,[204]3.4485,[205]3.4484,[206]3.4503,[207]3.4589,[208]3.4691,[209]3.4778,[210]3.4764,[211]3.4694,[212]3.4689,[213]3.4771,[214]3.4778,[215]3.4829,[216]3.4817,[217]3.4751,[218]3.4746,[219]3.4745,[220]3.4724,[221]3.4728,[222]3.4711,[223]3.4711,[224]3.4767,[225]3.4782,[226]3.4684,[227]3.4672,[228]3.4674,[229]3.4703,[230]3.4773,[231]3.4831,[232]3.4726,[233]3.4663,[234]3.4689,[235]3.4711,[236]3.4806,[237]3.4895,[238]3.4992,[239]3.5100,[240]3.5196,[241]3.5314,[242]3.5471,[243]3.5604,[244]3.5688,[245]3.5817,[246]3.5931,[247]3.5907,[248]3.5852,[249]3.5815,[250]3.5738,[251]3.5697,[252]3.5701,[253]3.5728,[254]3.5794,[255]3.5853,[256]3.5873,[257]3.5891,[258]3.5897,[259]3.5921,[260]3.5937,[261]3.5946,[262]3.5920,[263]3.5970,[264]3.5998,[265]3.5994,[266]3.6004,[267]3.6019,[268]3.6055,[269]3.6084,[270]3.6057,[271]3.6033,[272]3.5948,[273]3.5959,[274]3.5889,[275]3.5772,[276]3.5667,[277]3.5678,[278]3.5789,[279]3.5853,[280]3.5933,[281]3.6007,[282]3.6066,[283]3.6143,[284]3.6199,[285]3.6349,[286]3.6366,[287]3.6388,[288]3.6434,[289]3.6446,[290]3.6356,[291]3.6274,[292]3.6293,[293]3.6306,[294]3.6292,[295]3.6279,[296]3.6305,[297]3.6308,[298]3.6368,[299]3.6442,[300]3.6463,[301]3.6498,[302]3.6523,[303]3.6535,[304]3.6518,[305]3.6639,[306]3.6710,[307]3.6829,[308]3.6701,[309]3.6652,[310]3.6562,[311]3.6603,[312]3.6635,[313]3.6690,[314]3.6707,[315]3.6736,[316]3.6742,[317]3.6759,[318]3.6765,[319]3.6767,[320]3.6809,[321]3.6803,[322]3.6810,[323]3.6868,[324]3.6874,[325]3.6926,[326]3.6974,[327]3.7014,[328]3.7038,[329]3.7050,[330]3.7116,[331]3.7153,[332]3.7189,[333]3.7170,[334]3.7163,[335]3.7158,[336]3.7148,[337]3.7152,[338]3.7151,[339]3.7177,[340]3.7209,[341]3.7261,[342]3.7350,[343]3.7449,[344]3.7502,[345]3.7429,[346]3.7355,[347]3.7329,[348]3.7260,[349]3.7228,[350]3.7218,[351]3.7274,[352]3.7436,[353]3.7531,[354]3.7673,[355]3.7773,[356]3.7835,[357]3.7955,[358]3.8071,[359]3.8099,[360]3.8165,[361]3.8268,[362]3.8362,[363]3.8415,[364]3.8488,[365]3.8552,[366]3.8663,[367]3.8753,[368]3.8829,[369]3.8911,[370]3.9003,[371]3.9156,[372]3.9243,[373]3.9270,[374]3.9300,[375]3.9348,[376]3.9479,[377]3.9598,[378]3.9619,[379]3.9613,[380]3.9579,[381]3.9624,[382]3.9680,[383]3.9715,[384]3.9762,[385]3.9802,[386]3.9862,[387]3.9924,[388]3.9954,[389]3.9831,[390]3.9732,[391]3.9619,[392]3.9558,[393]3.9467,[394]3.9379,[395]3.9289,[396]3.9184,[397]3.9090,[398]3.8984,[399]3.8866,[400]3.8782,[401]3.8672,[402]3.8555,[403]3.8457,[404]3.8340,[405]3.8231,[406]3.8114,[407]3.8008,[408]3.7918,[409]3.7829,[410]3.7758,[411]3.7773,[412]3.7727,[413]3.7771,[414]3.7805,[415]3.7778,[416]3.7785,[417]3.7806,[418]3.7750,[419]3.7767,[420]3.7732,[421]3.7722,[422]3.7741,[423]3.7737,[424]3.7781,[425]3.7777,[426]3.7782,[427]3.7772,[428]3.7802,[429]3.7815,[430]3.7842,[431]3.7857,[432]3.7847,[433]3.7809,[434]3.7816,[435]3.7750,[436]3.7691,[437]3.7649,[438]3.7627,[439]3.7620,[440]3.7674,[441]3.7729,[442]3.7806,[443]3.7787,[444]3.7781,[445]3.7790,[446]3.7841,[447]3.7867,[448]3.7888,[449]3.7918,[450]3.7954,[451]3.7986,[452]3.8010,[453]3.8030,[454]3.8016,[455]3.8040,[456]3.8037,[457]3.8060,[458]3.8110,[459]3.8111,[460]3.8108,[461]3.8064,[462]3.8098,[463]3.8177,[464]3.8227,[465]3.8164,[466]3.8154,[467]3.8146,[468]3.8173,[469]3.8144,[470]3.8118,[471]3.8124,[472]3.8135,[473]3.8124,[474]3.8114,[475]3.8126,[476]3.8113,[477]3.8101,[478]3.8109,[479]3.8134,[480]3.8158,[481]3.8113,[482]3.8153,[483]3.8140,[484]3.8172,[485]3.8239,[486]3.8273,[487]3.8312,[488]3.8368,[489]3.8389,[490]3.8438,[491]3.8505,[492]3.8553,[493]3.8553,[494]3.8561,[495]3.8583,[496]3.8603,[497]3.8634,[498]3.8638,[499]3.8631,[500]3.8670,[501]3.8716,[502]3.8706,[503]3.8687,[504]3.8710,[505]3.8738,[506]3.8821,[507]3.8850,[508]3.8886,[509]3.8796,[510]3.8752,[511]3.8695,[512]3.8650,[513]3.8596,[514]3.8589,[515]3.8613,[516]3.8567,[517]3.8572,[518]3.8567,[519]3.8570,[520]3.8618,[521]3.8596,[522]3.8579,[523]3.8641,[524]3.8628,[525]3.8611,[526]3.8564,[527]3.8504,[528]3.8483,[529]3.8444,[530]3.8411,[531]3.8378,[532]3.8309,[533]3.8243,[534]3.8206,[535]3.8217,[536]3.8242,[537]3.8277,[538]3.8308,[539]3.8344,[540]3.8399,[541]3.8442,[542]3.8476,[543]3.8426,[544]3.8388,[545]3.8383,[546]3.8305,[547]3.8246,[548]3.8174,[549]3.8109,[550]3.8055,[551]3.8000,[552]3.7940,[553]3.7892,[554]3.7893,[555]3.7871,[556]3.7900,[557]3.7944,[558]3.8006,[559]3.8050,[560]3.8109,[561]3.8081,
llama_print_timings:        load time =   62019.65 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 4620095.85 ms / 287232 tokens (   16.08 ms per token,    62.17 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 4623706.39 ms / 287233 tokens

Final estimate: PPL = 3.8081 +/- 0.02183

@bartowski1182
Copy link
Contributor Author

@ubergarm right I should have mentioned that Q2_K in my comparison is made with embeddings and output tensors kept at q8_0, which may account for the PPL difference

@bartowski1182
Copy link
Contributor Author

@ubergarm

Re: attn_kv_b, yes it's using the helper function use_more_bits, I'm not sure why it elects to use more bits every third layer, I assume there's some method to the madness:

https://github.com/ggml-org/llama.cpp/pull/12727/files#diff-bcc71d02e033731949e3b56973150c7385b3e2a1f6d62d34557fce3d4c0dec22R158

it uses more for the first 1/8, every third layer, and last 1/8

I also added a few extra provisions to use even MORE bits for the first 1/16 layers, since those seem to really be impactful to deepseek

@ubergarm
Copy link

ubergarm commented Apr 3, 2025

@ubergarm right I should have mentioned that Q2_K in my comparison is made with embeddings and output tensors kept at q8_0, which may account for the PPL difference

Aye, makes sense! So the new mix definitely looks better and fits in just 10GiB VRAM using -ot exps=CPU. (Compared to my 17.33 GiB VRAM version that uses q8_0 for all GPU tensors).

The only odd bit that stuck out to me in the logs was dealing with the attention tensors (specifically attn_kv_b seemed to get bounced around between 3 different quant levels, and attn_k_b is falling back to iq4_nl due to smal 128 cols.). Though MLA and best attention quant levels is still an ongoing discussion at the moment: #12725 (comment)

The only other thing I'd consider is, maybe bump up the shexp shared expert quality as they take relatively very small percentage of overall size yet are included in compute path for every token.

Great progress!

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Apr 3, 2025

The only odd bit that stuck out to me in the logs was dealing with the attention tensors (specifically attn_kv_b seemed to get bounced around between 3 different quant levels, and attn_k_b is falling back to iq4_nl due to smal 128 cols.).

Didn't notice the iq4_nl but i must not have been paying enough attention (heh)

The bouncing I explained above, I force the first 1/16 of layers to high precision, then use the use_more_bits helper function which forces the first 1/8, last 1/8, and every third layer to be higher bits. Why every third? Couldn't tell you..

As for shexp, I did consider that a little, the problem is that currently they're being caught in the overall ffn_up/down/gate name matcher, so I could add 3 more catches for the ffn_up_shexp/ffn_down_shexp/ffn_gate_shexp, especially since as you said they contribute a negligible amount overall to the final weight.. but I've already added SO many lines of code to this function that was already bloated to begin with..

I'll do it locally and run another PPL (which will take another ~6 hours of course) to see what can be gained, I know my Q2_K is a bit below the BPW of most dense models (~2.95 versus dense seems to have ~3.06), if the performance per size increase is massive then I'll make the argument for including

@jukofyork
Copy link
Contributor

Subscribed - glad someone is looking into this!

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Apr 3, 2025

okay finished another quant for Q2_K_L with a bit more weight on the shexp, size increased by 1.3GB, PPL score went down another 0.07

Questionable if it's worth it, but it is now 2.97 BPW which again is even closer to what Q2_K should be comparing it to dense models..

means the difference in size is now 3.97GB for 0.2987 improvement to PPL, so 1.6% size increase for 7.7% improvement.. sounds worth it..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants