Summary
ggml_mul_mat_id produces degenerate output when the per-expert weight stack (as) is K-quantized (Q4_K, Q5_K, Q6_K). F32, F16, and Q8_0 sources work correctly. Reproduced on CPU with vendored ggml; we believe CUDA is also affected but haven't isolated.
What we see
Two GGUFs of OLMoE-1B-7B-Instruct, identical model, identical inference code, only the expert weight dtype changes:
| GGUF |
expert dtype |
output for "The capital of France is" |
Q4_K_M (from Meshwa/OLMoE-1b-7b-0924-Instruct-gguf) |
mostly Q4_K, some Q6_K |
"Dub Dub Dub Dub Dub Dub Dub Dub" |
q8_0 (same repo) |
Q8_0 |
"called Paris." |
Both runs use:
- Same router weights (F16)
- Same attention weights (Q4_K for one, Q8_0 for the other)
- Same model graph:
router · x → softmax → top_k → 3× mul_mat_id → silu·up → broadcast-weight → sum-across-K
Only the expert weight dtype passed to the mul_mat_id's as parameter differs. Q4_K produces degenerate output; Q8_0 produces coherent factual text.
Why we suspect a kernel gap
tests/test-backend-ops.cpp::test_mul_mat_id only exercises mul_mat_id with F32, F16, and Q8_0 source weights (see test registrations around lines 8264, 8408, 8438 in HEAD as of this writing). K-quants aren't in the test matrix. Our observation is consistent with mul_mat_id either:
- Falling back to a wrong arithmetic path for K-quant sources, or
- Mis-indexing into the K-quant block layout (K-quants have a hierarchical scale + per-sub-block offset structure that's significantly different from the Q4_0 / Q8_0 per-block scale).
ggml_mul_mat (non-id) does work with K-quant sources on similar tensor shapes — we have OLMoE's attention weights at Q4_K on the same model and the per-head Q/K/V matmuls produce correct output. So K-quants aren't broken in general; just specifically through the mul_mat_id dispatch.
Suggested test addition
// tests/test-backend-ops.cpp, in init_gguf_tests or similar:
for (auto t : {GGML_TYPE_Q4_K, GGML_TYPE_Q5_K, GGML_TYPE_Q6_K}) {
test_cases.emplace_back(new test_mul_mat_id(t, GGML_TYPE_F32,
/*n_mats*/ 8, /*n_used*/ 2, /*b_transposed*/ false,
/*m*/ 2048, /*n*/ 1, /*k*/ 4096));
}
These cases would surface the bug we hit. (Concrete shape numbers above are OLMoE-shaped.)
Reproducer
Project: https://github.com/OriPekelman/toy. The full inference path is wrapped in Ruby + ggml FFI; the failing path is documented in our notes at docs/notes/mul_mat_id_quants.md (commit upcoming). Happy to extract a minimal C reproducer if useful.
The user-side repro is:
# Setup: download both GGUFs from Meshwa/OLMoE-1b-7b-0924-Instruct-gguf
# Build: make example_inference
# Run:
GGUF=OLMoE-1b-7b-0924-Instruct-Q4_K_M.gguf \
PROMPT='The capital of France is' N_NEW=8 ./examples/example_inference
# → "The capital of France is Dub Dub Dub Dub Dub Dub Dub Dub"
GGUF=OLMoE-1b-7b-0924-Instruct-q8_0.gguf \
PROMPT='The capital of France is' N_NEW=8 ./examples/example_inference
# → "The capital of France is called Paris."
Workaround
Convert MoE models at Q8_0 for the expert weights. Non-expert weights (embeddings, attention, norms) can stay at K-quants.
Summary
ggml_mul_mat_idproduces degenerate output when the per-expert weight stack (as) is K-quantized (Q4_K, Q5_K, Q6_K). F32, F16, and Q8_0 sources work correctly. Reproduced on CPU with vendored ggml; we believe CUDA is also affected but haven't isolated.What we see
Two GGUFs of OLMoE-1B-7B-Instruct, identical model, identical inference code, only the expert weight dtype changes:
Q4_K_M(fromMeshwa/OLMoE-1b-7b-0924-Instruct-gguf)q8_0(same repo)Both runs use:
router · x → softmax → top_k → 3× mul_mat_id → silu·up → broadcast-weight → sum-across-KOnly the expert weight dtype passed to the
mul_mat_id'sasparameter differs. Q4_K produces degenerate output; Q8_0 produces coherent factual text.Why we suspect a kernel gap
tests/test-backend-ops.cpp::test_mul_mat_idonly exercises mul_mat_id with F32, F16, and Q8_0 source weights (see test registrations around lines 8264, 8408, 8438 in HEAD as of this writing). K-quants aren't in the test matrix. Our observation is consistent with mul_mat_id either:ggml_mul_mat(non-id) does work with K-quant sources on similar tensor shapes — we have OLMoE's attention weights at Q4_K on the same model and the per-head Q/K/V matmuls produce correct output. So K-quants aren't broken in general; just specifically through the mul_mat_id dispatch.Suggested test addition
These cases would surface the bug we hit. (Concrete shape numbers above are OLMoE-shaped.)
Reproducer
Project: https://github.com/OriPekelman/toy. The full inference path is wrapped in Ruby + ggml FFI; the failing path is documented in our notes at
docs/notes/mul_mat_id_quants.md(commit upcoming). Happy to extract a minimal C reproducer if useful.The user-side repro is:
Workaround
Convert MoE models at Q8_0 for the expert weights. Non-expert weights (embeddings, attention, norms) can stay at K-quants.