[MLX] Add Q5_K quantization support by JaynouOliver · Pull Request #20617 · pytorch/executorch

JaynouOliver · 2026-06-30T07:18:02Z

Summary

Adds fused Q5_K Metal kernels (linear + embedding) for the MLX backend and registers Q5_K on the export side, matching the existing Q6_K (#20004) and Q4_K (#20172) support. Resolves #20614.

Q5_K = Q4_K's affine super-block (d/dmin + 6-bit packed sub-block scales/mins via get_scale_min_k4) plus a Q6_K-style high-bit array: each weight is a 5-bit code whose low 4 bits come from qs and whose 5th bit comes from qh. The kernels consume the raw block_q5_K directly — no export-time repack.

Changes

extension/llm/export/gguf.py — register GGML_Q5_K = 13, _Q5_K_BLOCK_BYTES = 176 (2 + 2 + 12 + QK_K//8 + QK_K//2), add "q5_k" to _GGML_ID_BY_TYPE / _BLOCK_BYTES_BY_TYPE. (No torchao Int*Tensor conversion path — the MLX kernels read the raw blob.)
backends/mlx/custom_kernel_ops/gguf/q5k/ (new):
- common.py — _Q5K_HEADER with the block_q5_K struct, dequant_q5k_elem (per-element, embedding) and dequantize_q5_K_16 (vectorized half4x4, matmul), reusing get_scale_min_k4 / get_scale_min_k4_just2.
- linear.py — emit_linear: mat-vec (decode), tiled simdgroup mat-mat (prefill), dynamic-M IfNode. Byte-wise decode with the 5th bit from qh; affine d*scale / dmin*min per sub-block.
- embedding.py — emit_embedding: per-element Q5_K dequant gather.
- __init__.py.
backends/mlx/custom_kernel_ops/gguf/patterns.py — add "q5_k" to _LINEAR_TYPES / _EMBEDDING_TYPES and a dispatch branch in both handlers.
Tests — make_q5_k_blob + q5_k configs in test/test_linear.py and test/test_embedding.py; Q5_K coverage in extension/llm/export/test/test_gguf.py.

Kernels ported from llama.cpp (ggml-common.h / ggml-metal.metal: block_q5_K, dequantize_q5_K, kernel_mul_mv_q5_K_f32_impl, kernel_mul_mm), MIT-licensed (Copyright (c) 2023-2024 The ggml authors); inline ported from ... notes kept as in the Q6_K / Q4_K kernels.

Validation

Done locally on Apple silicon (macOS 26.4, MLX 0.31.2):

✅ Export side — extension/llm/export/test/test_gguf.py passes (9/9), including the new Q5_K cases (dequantize == gguf.dequantize exactly; torchao::dequantize_gguf op; torch.export lowering; unsupported-type guard).
✅ Kernel math — the per-element (dequant_q5k_elem), vectorized (dequantize_q5_K_16), and mat-vec arithmetic were each transcribed and checked numerically against gguf.dequantize / x @ Wᵀ: exact for the dequant paths, ~3e-7 rel for the mat-vec (fp32 rounding).
✅ MLX backend imports with Q5_K wired into both pattern handlers.
⚠️ On-device Metal kernel execution (test_linear.py / test_embedding.py run) was not run locally: building the op_test_runner compiles MLX from source via xcrun metal, which needs full Xcode (only Command Line Tools available on the dev box). Relying on CI to run these on the GPU. The mat-mat kernel is structurally identical to the merged Q4_K/Q6_K kernels and differs only in the (numerically validated) dequantize_q5_K_16 weight decode.

Test plan

On an Apple-silicon machine:

python -m executorch.backends.mlx.custom_kernel_ops.gguf.test.test_linear run -v
python -m executorch.backends.mlx.custom_kernel_ops.gguf.test.test_embedding run -v

Add fused Q5_K Metal kernels (linear + embedding) for the MLX backend, matching the existing Q6_K (pytorch#20004) and Q4_K (pytorch#20172) support, and register Q5_K on the export side so it lowers through the MLX pattern handlers. Q5_K combines Q4_K's affine super-block (d/dmin + 6-bit packed scales/mins unpacked via get_scale_min_k4) with a Q6_K-style high-bit array: each weight is a 5-bit code whose low 4 bits come from qs and whose 5th bit comes from qh. The kernels read the raw block_q5_K directly (no export-time repack). Changes: - extension/llm/export/gguf.py: register GGML_Q5_K = 13, _Q5_K_BLOCK_BYTES = 176, and add "q5_k" to the id / block-bytes maps. - backends/mlx/custom_kernel_ops/gguf/q5k/{common,linear,embedding,__init__}.py: block_q5_K struct + per-element (embedding) and vectorized half4x4 (matmul) dequant helpers, mat-vec (decode) / mat-mat (prefill) / dynamic-M IfNode linear, and a per-element gather embedding. Ported from llama.cpp. - backends/mlx/custom_kernel_ops/gguf/patterns.py: wire q5_k into the linear / embedding handlers and the supported-type sets. - tests: add make_q5_k_blob + q5_k configs to test_linear.py / test_embedding.py, and Q5_K coverage to extension/llm/export/test/test_gguf.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pytorch-bot · 2026-06-30T07:18:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20617

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Unclassified Failure

As of commit 8aeec5f with merge base 63b2b5c ():

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

MLX / test-mlx / test-mlx (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
mlx/backend/metal/kernels/utils.h:487:15: error: cannot combine with previous 'int' declaration specifier

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla · 2026-06-30T07:18:08Z

Hi @JaynouOliver!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

linux-foundation-easycla · 2026-06-30T07:18:12Z

The committers listed above are authorized under a signed CLA.

✅ Unknown (ab1e446)
✅ login: JaynouOliver / name: Suvrakamal Das (71361ac, ab1e446)

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

github-actions · 2026-06-30T07:18:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-cla · 2026-06-30T08:35:37Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JaynouOliver · 2026-06-30T15:06:25Z

hi @metascroy , can you have a look please?

metascroy · 2026-06-30T16:47:17Z

@claude review this code

claude · 2026-06-30T16:48:03Z

Claude finished @metascroy's task in 3m 10s —— View job

Code review: Q5_K MLX support

Read the changed files
Review kernel math & correctness
Review export-side registration
Review tests

Overall this is a clean, well-structured PR that closely mirrors the merged Q4_K/Q6_K kernels. I traced the Q5_K-specific math against the llama.cpp references and it checks out. A few notes below, none blocking.

Correctness — verified ✅

I cross-checked the three places where Q5_K diverges from Q4_K (the 5th bit from qh) against llama.cpp:

dequantize_q5_K_16 (q5k/common.py:133) matches dequantize_q5_K in ggml-metal.metal exactly — is/q/qh pointer math, ul = 1 << (il/2), the il<2 ? 16 : 256 high-bit value, and the d/16 for the high nibble.
dequant_q5k_elem (q5k/common.py:109) matches dequantize_row_q5_K in ggml-quants.c: chunk c=p>>6, sub=2*c+half, low/high nibble, and qh[l] bit sub (≡ u1<<=2/u2<<=2 per chunk).
matvec masks (q5k/linear.py:92) — hm1=1<<(2*iq), hm2,hm3,hm4 and the qh = blk->qh + 8*ir indexing correctly map the four sub-blocks (2iq, 2iq+1, 2iq+4, 2iq+5) a thread touches. The qh byte index is correctly independent of iq (only the selected bit differs).

The mat-mat kernel and the if-node/grid plumbing are structurally identical to Q4_K, with only the weight decode swapped — consistent with the PR description.

Minor improvement worth noting

q5k/linear.py:117-118 adds an if (r >= N) { break; } guard inside the row loop before dereferencing blk = xrows + r*nb + ib. Q4_K's matvec lacks this and reads the block pointer for r >= N on a partial final row group (relying on the write being masked). The Q5_K version is the safer pattern here — good catch, not a regression.

Nits (non-blocking)

q5k/embedding.py is a near-verbatim copy of q6k/embedding.py (only the type name, dequant_*_elem, and the group-size comment differ). Same for the _emit_*_matvec/_emit_*_matmul scaffolding in linear.py vs q4k/linear.py. This duplication is pre-existing across q4k/q6k, so matching it is the right call for this PR — flagging only in case a future refactor wants to factor out the shared gather/emit scaffolding.
patterns.py:47-48 — _LINEAR_TYPES and _EMBEDDING_TYPES are now identical sets {"q4_k","q5_k","q6_k"}. Also pre-existing; fine to leave.
Test data: _make_q5k_raw (export test) pins scales[4:16]=0x21, while make_q5_k_blob (kernel test) randomizes them. Both are valid for their respective oracles (gguf.dequantize vs x @ Wᵀ); just noting the intentional difference.

Validation gap (the main thing for reviewers to weigh)

The PR is upfront that the on-device Metal kernels (test_linear.py run / test_embedding.py run) were not executed locally — only the export-side tests and a numerical transcription of the dequant/matvec math. The mat-mat tiling is byte-for-byte the merged Q4_K/Q6_K kernel, and the dequant paths are validated, so the residual risk is concentrated in dequantize_q5_K_16 + the matvec arithmetic (both of which I re-derived above and believe are correct). Still, the GPU run tests in CI on Apple silicon are the real gate — worth confirming those green before merge given nothing exercised the compiled kernels yet. The new configs (ragged N=300, multi-dim indices, K=5376 production shape, dynamic-M both branches) give good coverage once they run.

No code changes requested. LGTM pending the CI GPU run jobs passing.
· branch add-q5k-mlx

metascroy · 2026-06-30T17:31:55Z

Thanks @JaynouOliver!

Overall it looks good, but a few comments:

Dequant (dequantize_q5_K_16) needs no change (exact port of llama.cpp dequantize_q5_K)
Matmat is a faithful port of llama.cpp: same 64×32 tiles, NK=32, tile indexing, K-loop, simdgroup ops, and matching launch params (128 threads / 4 simdgroups, grid = ceil(M/32) × ceil(N/64)). So overall it looks good. But you should shrink sb[4096] to sb[1024] (only 1024 halves are used). [Note sa[4096] is correctly sized (reused as the 8 KB float output-staging buffer)].

Also parallelize the write out: see #20643 for similar improvements on q4k/q6k.

Matvec
Numerically equivalent to llama.cpp, but not a port (different thread to work mapping, get_scale_min_k4 instead of inline kmask, folded accumulation). It also uses N_R0=2, nsg=2 (borrowed from Q4_K) vs llama.cpp Q5_K's N_R0=1, nsg=2. Replace the Q4_K-derived matvec with a faithful port of llama.cpp's kernel_mul_mv_q5_K_f32_impl.

Merge branch 'main' into add-q5k-mlx

71361ac

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 30, 2026

Merge branch 'main' into add-q5k-mlx

c9a06e2

JaynouOliver force-pushed the add-q5k-mlx branch from c9a06e2 to f5bb7c2 Compare June 30, 2026 14:58

[MLX] Q5_K: collapse block-bytes constant (lintrunner UFMT)

e3eba84

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JaynouOliver force-pushed the add-q5k-mlx branch from f5bb7c2 to e3eba84 Compare June 30, 2026 15:01

JaynouOliver marked this pull request as ready for review June 30, 2026 15:01

JaynouOliver requested review from larryliu0820 and mergennachin as code owners June 30, 2026 15:01

Merge branch 'main' into add-q5k-mlx

8aeec5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLX] Add Q5_K quantization support#20617

[MLX] Add Q5_K quantization support#20617
JaynouOliver wants to merge 5 commits into
pytorch:mainfrom
JaynouOliver:add-q5k-mlx

JaynouOliver commented Jun 30, 2026

Uh oh!

pytorch-bot Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

meta-cla Bot commented Jun 30, 2026

Uh oh!

linux-foundation-easycla Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

meta-cla Bot commented Jun 30, 2026

Uh oh!

JaynouOliver commented Jun 30, 2026

Uh oh!

metascroy commented Jun 30, 2026

Uh oh!

claude Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

metascroy commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JaynouOliver commented Jun 30, 2026

Summary

Changes

Validation

Test plan

Uh oh!

pytorch-bot Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20617

❌ 1 Unclassified Failure

Uh oh!

meta-cla Bot commented Jun 30, 2026

Action Required

Process

Uh oh!

linux-foundation-easycla Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 30, 2026

This PR needs a release notes: label

Uh oh!

meta-cla Bot commented Jun 30, 2026

Uh oh!

JaynouOliver commented Jun 30, 2026

Uh oh!

metascroy commented Jun 30, 2026

Uh oh!

claude Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review: Q5_K MLX support

Correctness — verified ✅

Minor improvement worth noting

Nits (non-blocking)

Validation gap (the main thing for reviewers to weigh)

Uh oh!

metascroy commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented Jun 30, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 30, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 30, 2026 •

edited

Loading

metascroy commented Jun 30, 2026 •

edited

Loading