Skip to content

[MLX] Add Q5_K quantization support#20617

Open
JaynouOliver wants to merge 5 commits into
pytorch:mainfrom
JaynouOliver:add-q5k-mlx
Open

[MLX] Add Q5_K quantization support#20617
JaynouOliver wants to merge 5 commits into
pytorch:mainfrom
JaynouOliver:add-q5k-mlx

Conversation

@JaynouOliver

Copy link
Copy Markdown

Summary

Adds fused Q5_K Metal kernels (linear + embedding) for the MLX backend and registers Q5_K on the export side, matching the existing Q6_K (#20004) and Q4_K (#20172) support. Resolves #20614.

Q5_K = Q4_K's affine super-block (d/dmin + 6-bit packed sub-block scales/mins via get_scale_min_k4) plus a Q6_K-style high-bit array: each weight is a 5-bit code whose low 4 bits come from qs and whose 5th bit comes from qh. The kernels consume the raw block_q5_K directly — no export-time repack.

Changes

  • extension/llm/export/gguf.py — register GGML_Q5_K = 13, _Q5_K_BLOCK_BYTES = 176 (2 + 2 + 12 + QK_K//8 + QK_K//2), add "q5_k" to _GGML_ID_BY_TYPE / _BLOCK_BYTES_BY_TYPE. (No torchao Int*Tensor conversion path — the MLX kernels read the raw blob.)
  • backends/mlx/custom_kernel_ops/gguf/q5k/ (new):
    • common.py_Q5K_HEADER with the block_q5_K struct, dequant_q5k_elem (per-element, embedding) and dequantize_q5_K_16 (vectorized half4x4, matmul), reusing get_scale_min_k4 / get_scale_min_k4_just2.
    • linear.pyemit_linear: mat-vec (decode), tiled simdgroup mat-mat (prefill), dynamic-M IfNode. Byte-wise decode with the 5th bit from qh; affine d*scale / dmin*min per sub-block.
    • embedding.pyemit_embedding: per-element Q5_K dequant gather.
    • __init__.py.
  • backends/mlx/custom_kernel_ops/gguf/patterns.py — add "q5_k" to _LINEAR_TYPES / _EMBEDDING_TYPES and a dispatch branch in both handlers.
  • Testsmake_q5_k_blob + q5_k configs in test/test_linear.py and test/test_embedding.py; Q5_K coverage in extension/llm/export/test/test_gguf.py.

Kernels ported from llama.cpp (ggml-common.h / ggml-metal.metal: block_q5_K, dequantize_q5_K, kernel_mul_mv_q5_K_f32_impl, kernel_mul_mm), MIT-licensed (Copyright (c) 2023-2024 The ggml authors); inline ported from ... notes kept as in the Q6_K / Q4_K kernels.

Validation

Done locally on Apple silicon (macOS 26.4, MLX 0.31.2):

  • Export sideextension/llm/export/test/test_gguf.py passes (9/9), including the new Q5_K cases (dequantize == gguf.dequantize exactly; torchao::dequantize_gguf op; torch.export lowering; unsupported-type guard).
  • Kernel math — the per-element (dequant_q5k_elem), vectorized (dequantize_q5_K_16), and mat-vec arithmetic were each transcribed and checked numerically against gguf.dequantize / x @ Wᵀ: exact for the dequant paths, ~3e-7 rel for the mat-vec (fp32 rounding).
  • ✅ MLX backend imports with Q5_K wired into both pattern handlers.
  • ⚠️ On-device Metal kernel execution (test_linear.py / test_embedding.py run) was not run locally: building the op_test_runner compiles MLX from source via xcrun metal, which needs full Xcode (only Command Line Tools available on the dev box). Relying on CI to run these on the GPU. The mat-mat kernel is structurally identical to the merged Q4_K/Q6_K kernels and differs only in the (numerically validated) dequantize_q5_K_16 weight decode.

Test plan

On an Apple-silicon machine:

python -m executorch.backends.mlx.custom_kernel_ops.gguf.test.test_linear run -v
python -m executorch.backends.mlx.custom_kernel_ops.gguf.test.test_embedding run -v

Add fused Q5_K Metal kernels (linear + embedding) for the MLX backend,
matching the existing Q6_K (pytorch#20004) and Q4_K (pytorch#20172) support, and register
Q5_K on the export side so it lowers through the MLX pattern handlers.

Q5_K combines Q4_K's affine super-block (d/dmin + 6-bit packed scales/mins
unpacked via get_scale_min_k4) with a Q6_K-style high-bit array: each weight is
a 5-bit code whose low 4 bits come from qs and whose 5th bit comes from qh. The
kernels read the raw block_q5_K directly (no export-time repack).

Changes:
- extension/llm/export/gguf.py: register GGML_Q5_K = 13, _Q5_K_BLOCK_BYTES = 176,
  and add "q5_k" to the id / block-bytes maps.
- backends/mlx/custom_kernel_ops/gguf/q5k/{common,linear,embedding,__init__}.py:
  block_q5_K struct + per-element (embedding) and vectorized half4x4 (matmul)
  dequant helpers, mat-vec (decode) / mat-mat (prefill) / dynamic-M IfNode
  linear, and a per-element gather embedding. Ported from llama.cpp.
- backends/mlx/custom_kernel_ops/gguf/patterns.py: wire q5_k into the linear /
  embedding handlers and the supported-type sets.
- tests: add make_q5_k_blob + q5_k configs to test_linear.py / test_embedding.py,
  and Q5_K coverage to extension/llm/export/test/test_gguf.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pytorch-bot

pytorch-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20617

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Unclassified Failure

As of commit 8aeec5f with merge base 63b2b5c (image):

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

  • MLX / test-mlx / test-mlx (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
    mlx/backend/metal/kernels/utils.h:487:15: error: cannot combine with previous 'int' declaration specifier

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla

meta-cla Bot commented Jun 30, 2026

Copy link
Copy Markdown

Hi @JaynouOliver!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 30, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 30, 2026
@meta-cla

meta-cla Bot commented Jun 30, 2026

Copy link
Copy Markdown

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JaynouOliver JaynouOliver marked this pull request as ready for review June 30, 2026 15:01
@JaynouOliver

Copy link
Copy Markdown
Author

hi @metascroy , can you have a look please?

@metascroy

Copy link
Copy Markdown
Contributor

@claude review this code

@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 3m 10s —— View job


Code review: Q5_K MLX support

  • Read the changed files
  • Review kernel math & correctness
  • Review export-side registration
  • Review tests

Overall this is a clean, well-structured PR that closely mirrors the merged Q4_K/Q6_K kernels. I traced the Q5_K-specific math against the llama.cpp references and it checks out. A few notes below, none blocking.


Correctness — verified ✅

I cross-checked the three places where Q5_K diverges from Q4_K (the 5th bit from qh) against llama.cpp:

  • dequantize_q5_K_16 (q5k/common.py:133) matches dequantize_q5_K in ggml-metal.metal exactly — is/q/qh pointer math, ul = 1 << (il/2), the il<2 ? 16 : 256 high-bit value, and the d/16 for the high nibble.
  • dequant_q5k_elem (q5k/common.py:109) matches dequantize_row_q5_K in ggml-quants.c: chunk c=p>>6, sub=2*c+half, low/high nibble, and qh[l] bit sub (≡ u1<<=2/u2<<=2 per chunk).
  • matvec masks (q5k/linear.py:92) — hm1=1<<(2*iq), hm2,hm3,hm4 and the qh = blk->qh + 8*ir indexing correctly map the four sub-blocks (2iq, 2iq+1, 2iq+4, 2iq+5) a thread touches. The qh byte index is correctly independent of iq (only the selected bit differs).

The mat-mat kernel and the if-node/grid plumbing are structurally identical to Q4_K, with only the weight decode swapped — consistent with the PR description.

Minor improvement worth noting

q5k/linear.py:117-118 adds an if (r >= N) { break; } guard inside the row loop before dereferencing blk = xrows + r*nb + ib. Q4_K's matvec lacks this and reads the block pointer for r >= N on a partial final row group (relying on the write being masked). The Q5_K version is the safer pattern here — good catch, not a regression.

Nits (non-blocking)

  • q5k/embedding.py is a near-verbatim copy of q6k/embedding.py (only the type name, dequant_*_elem, and the group-size comment differ). Same for the _emit_*_matvec/_emit_*_matmul scaffolding in linear.py vs q4k/linear.py. This duplication is pre-existing across q4k/q6k, so matching it is the right call for this PR — flagging only in case a future refactor wants to factor out the shared gather/emit scaffolding.
  • patterns.py:47-48_LINEAR_TYPES and _EMBEDDING_TYPES are now identical sets {"q4_k","q5_k","q6_k"}. Also pre-existing; fine to leave.
  • Test data: _make_q5k_raw (export test) pins scales[4:16]=0x21, while make_q5_k_blob (kernel test) randomizes them. Both are valid for their respective oracles (gguf.dequantize vs x @ Wᵀ); just noting the intentional difference.

Validation gap (the main thing for reviewers to weigh)

The PR is upfront that the on-device Metal kernels (test_linear.py run / test_embedding.py run) were not executed locally — only the export-side tests and a numerical transcription of the dequant/matvec math. The mat-mat tiling is byte-for-byte the merged Q4_K/Q6_K kernel, and the dequant paths are validated, so the residual risk is concentrated in dequantize_q5_K_16 + the matvec arithmetic (both of which I re-derived above and believe are correct). Still, the GPU run tests in CI on Apple silicon are the real gate — worth confirming those green before merge given nothing exercised the compiled kernels yet. The new configs (ragged N=300, multi-dim indices, K=5376 production shape, dynamic-M both branches) give good coverage once they run.


No code changes requested. LGTM pending the CI GPU run jobs passing.
· branch add-q5k-mlx

@metascroy

metascroy commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Thanks @JaynouOliver!

Overall it looks good, but a few comments:

  • Dequant (dequantize_q5_K_16) needs no change (exact port of llama.cpp dequantize_q5_K)
  • Matmat is a faithful port of llama.cpp: same 64×32 tiles, NK=32, tile indexing, K-loop, simdgroup ops, and matching launch params (128 threads / 4 simdgroups, grid = ceil(M/32) × ceil(N/64)). So overall it looks good. But you should shrink sb[4096] to sb[1024] (only 1024 halves are used). [Note sa[4096] is correctly sized (reused as the 8 KB float output-staging buffer)].

Also parallelize the write out: see #20643 for similar improvements on q4k/q6k.

  • Matvec
    Numerically equivalent to llama.cpp, but not a port (different thread to work mapping, get_scale_min_k4 instead of inline kmask, folded accumulation). It also uses N_R0=2, nsg=2 (borrowed from Q4_K) vs llama.cpp Q5_K's N_R0=1, nsg=2. Replace the Q4_K-derived matvec with a faithful port of llama.cpp's kernel_mul_mv_q5_K_f32_impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MLX][Good first issue] Add Q5K quantization support

3 participants