mul_mat_id produces wrong output with K-quantized source weights

## Summary

`ggml_mul_mat_id` produces degenerate output when the per-expert weight stack (`as`) is K-quantized (Q4_K, Q5_K, Q6_K). F32, F16, and Q8_0 sources work correctly. Reproduced on CPU with vendored ggml; we believe CUDA is also affected but haven't isolated.

## What we see

Two GGUFs of OLMoE-1B-7B-Instruct, identical model, identical inference code, only the expert weight dtype changes:

| GGUF | expert dtype | output for "The capital of France is" |
|---|---|---|
| `Q4_K_M` (from `Meshwa/OLMoE-1b-7b-0924-Instruct-gguf`) | mostly Q4_K, some Q6_K | "Dub Dub Dub Dub Dub Dub Dub Dub" |
| `q8_0` (same repo) | Q8_0 | "called Paris." |

Both runs use:
- Same router weights (F16)
- Same attention weights (Q4_K for one, Q8_0 for the other)
- Same model graph: `router · x → softmax → top_k → 3× mul_mat_id → silu·up → broadcast-weight → sum-across-K`

Only the **expert weight dtype** passed to the `mul_mat_id`'s `as` parameter differs. Q4_K produces degenerate output; Q8_0 produces coherent factual text.

## Why we suspect a kernel gap

`tests/test-backend-ops.cpp::test_mul_mat_id` only exercises mul_mat_id with F32, F16, and Q8_0 source weights (see test registrations around lines 8264, 8408, 8438 in HEAD as of this writing). K-quants aren't in the test matrix. Our observation is consistent with mul_mat_id either:
- Falling back to a wrong arithmetic path for K-quant sources, or
- Mis-indexing into the K-quant block layout (K-quants have a hierarchical scale + per-sub-block offset structure that's significantly different from the Q4_0 / Q8_0 per-block scale).

`ggml_mul_mat` (non-id) **does** work with K-quant sources on similar tensor shapes — we have OLMoE's attention weights at Q4_K on the same model and the per-head Q/K/V matmuls produce correct output. So K-quants aren't broken in general; just specifically through the mul_mat_id dispatch.

## Suggested test addition

```cpp
// tests/test-backend-ops.cpp, in init_gguf_tests or similar:
for (auto t : {GGML_TYPE_Q4_K, GGML_TYPE_Q5_K, GGML_TYPE_Q6_K}) {
    test_cases.emplace_back(new test_mul_mat_id(t, GGML_TYPE_F32,
                            /*n_mats*/ 8, /*n_used*/ 2, /*b_transposed*/ false,
                            /*m*/ 2048, /*n*/ 1, /*k*/ 4096));
}
```

These cases would surface the bug we hit. (Concrete shape numbers above are OLMoE-shaped.)

## Reproducer

Project: <https://github.com/OriPekelman/toy>. The full inference path is wrapped in Ruby + ggml FFI; the failing path is documented in our notes at `docs/notes/mul_mat_id_quants.md` (commit upcoming). Happy to extract a minimal C reproducer if useful.

The user-side repro is:

```bash
# Setup: download both GGUFs from Meshwa/OLMoE-1b-7b-0924-Instruct-gguf
# Build: make example_inference
# Run:
GGUF=OLMoE-1b-7b-0924-Instruct-Q4_K_M.gguf \
  PROMPT='The capital of France is' N_NEW=8 ./examples/example_inference
#   → "The capital of France is Dub Dub Dub Dub Dub Dub Dub Dub"

GGUF=OLMoE-1b-7b-0924-Instruct-q8_0.gguf \
  PROMPT='The capital of France is' N_NEW=8 ./examples/example_inference
#   → "The capital of France is called Paris."
```

## Workaround

Convert MoE models at Q8_0 for the expert weights. Non-expert weights (embeddings, attention, norms) can stay at K-quants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mul_mat_id produces wrong output with K-quantized source weights #1506

Summary

What we see

Why we suspect a kernel gap

Suggested test addition

Reproducer

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GGUF	expert dtype	output for "The capital of France is"
`Q4_K_M` (from `Meshwa/OLMoE-1b-7b-0924-Instruct-gguf`)	mostly Q4_K, some Q6_K	"Dub Dub Dub Dub Dub Dub Dub Dub"
`q8_0` (same repo)	Q8_0	"called Paris."

mul_mat_id produces wrong output with K-quantized source weights #1506

Description

Summary

What we see

Why we suspect a kernel gap

Suggested test addition

Reproducer

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions