Skip to content

mul_mat_id produces wrong output with K-quantized source weights #1506

@OriPekelman

Description

@OriPekelman

Summary

ggml_mul_mat_id produces degenerate output when the per-expert weight stack (as) is K-quantized (Q4_K, Q5_K, Q6_K). F32, F16, and Q8_0 sources work correctly. Reproduced on CPU with vendored ggml; we believe CUDA is also affected but haven't isolated.

What we see

Two GGUFs of OLMoE-1B-7B-Instruct, identical model, identical inference code, only the expert weight dtype changes:

GGUF expert dtype output for "The capital of France is"
Q4_K_M (from Meshwa/OLMoE-1b-7b-0924-Instruct-gguf) mostly Q4_K, some Q6_K "Dub Dub Dub Dub Dub Dub Dub Dub"
q8_0 (same repo) Q8_0 "called Paris."

Both runs use:

  • Same router weights (F16)
  • Same attention weights (Q4_K for one, Q8_0 for the other)
  • Same model graph: router · x → softmax → top_k → 3× mul_mat_id → silu·up → broadcast-weight → sum-across-K

Only the expert weight dtype passed to the mul_mat_id's as parameter differs. Q4_K produces degenerate output; Q8_0 produces coherent factual text.

Why we suspect a kernel gap

tests/test-backend-ops.cpp::test_mul_mat_id only exercises mul_mat_id with F32, F16, and Q8_0 source weights (see test registrations around lines 8264, 8408, 8438 in HEAD as of this writing). K-quants aren't in the test matrix. Our observation is consistent with mul_mat_id either:

  • Falling back to a wrong arithmetic path for K-quant sources, or
  • Mis-indexing into the K-quant block layout (K-quants have a hierarchical scale + per-sub-block offset structure that's significantly different from the Q4_0 / Q8_0 per-block scale).

ggml_mul_mat (non-id) does work with K-quant sources on similar tensor shapes — we have OLMoE's attention weights at Q4_K on the same model and the per-head Q/K/V matmuls produce correct output. So K-quants aren't broken in general; just specifically through the mul_mat_id dispatch.

Suggested test addition

// tests/test-backend-ops.cpp, in init_gguf_tests or similar:
for (auto t : {GGML_TYPE_Q4_K, GGML_TYPE_Q5_K, GGML_TYPE_Q6_K}) {
    test_cases.emplace_back(new test_mul_mat_id(t, GGML_TYPE_F32,
                            /*n_mats*/ 8, /*n_used*/ 2, /*b_transposed*/ false,
                            /*m*/ 2048, /*n*/ 1, /*k*/ 4096));
}

These cases would surface the bug we hit. (Concrete shape numbers above are OLMoE-shaped.)

Reproducer

Project: https://github.com/OriPekelman/toy. The full inference path is wrapped in Ruby + ggml FFI; the failing path is documented in our notes at docs/notes/mul_mat_id_quants.md (commit upcoming). Happy to extract a minimal C reproducer if useful.

The user-side repro is:

# Setup: download both GGUFs from Meshwa/OLMoE-1b-7b-0924-Instruct-gguf
# Build: make example_inference
# Run:
GGUF=OLMoE-1b-7b-0924-Instruct-Q4_K_M.gguf \
  PROMPT='The capital of France is' N_NEW=8 ./examples/example_inference
#   → "The capital of France is Dub Dub Dub Dub Dub Dub Dub Dub"

GGUF=OLMoE-1b-7b-0924-Instruct-q8_0.gguf \
  PROMPT='The capital of France is' N_NEW=8 ./examples/example_inference
#   → "The capital of France is called Paris."

Workaround

Convert MoE models at Q8_0 for the expert weights. Non-expert weights (embeddings, attention, norms) can stay at K-quants.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions