[MLX] Add Q5_K quantization support#20617
Conversation
Add fused Q5_K Metal kernels (linear + embedding) for the MLX backend, matching the existing Q6_K (pytorch#20004) and Q4_K (pytorch#20172) support, and register Q5_K on the export side so it lowers through the MLX pattern handlers. Q5_K combines Q4_K's affine super-block (d/dmin + 6-bit packed scales/mins unpacked via get_scale_min_k4) with a Q6_K-style high-bit array: each weight is a 5-bit code whose low 4 bits come from qs and whose 5th bit comes from qh. The kernels read the raw block_q5_K directly (no export-time repack). Changes: - extension/llm/export/gguf.py: register GGML_Q5_K = 13, _Q5_K_BLOCK_BYTES = 176, and add "q5_k" to the id / block-bytes maps. - backends/mlx/custom_kernel_ops/gguf/q5k/{common,linear,embedding,__init__}.py: block_q5_K struct + per-element (embedding) and vectorized half4x4 (matmul) dequant helpers, mat-vec (decode) / mat-mat (prefill) / dynamic-M IfNode linear, and a per-element gather embedding. Ported from llama.cpp. - backends/mlx/custom_kernel_ops/gguf/patterns.py: wire q5_k into the linear / embedding handlers and the supported-type sets. - tests: add make_q5_k_blob + q5_k configs to test_linear.py / test_embedding.py, and Q5_K coverage to extension/llm/export/test/test_gguf.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20617
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Unclassified FailureAs of commit 8aeec5f with merge base 63b2b5c ( UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @JaynouOliver! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
This PR needs a
|
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
c9a06e2 to
f5bb7c2
Compare
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f5bb7c2 to
e3eba84
Compare
|
hi @metascroy , can you have a look please? |
|
@claude review this code |
|
Claude finished @metascroy's task in 3m 10s —— View job Code review: Q5_K MLX support
Overall this is a clean, well-structured PR that closely mirrors the merged Q4_K/Q6_K kernels. I traced the Q5_K-specific math against the llama.cpp references and it checks out. A few notes below, none blocking. Correctness — verified ✅I cross-checked the three places where Q5_K diverges from Q4_K (the 5th bit from
The mat-mat kernel and the if-node/grid plumbing are structurally identical to Q4_K, with only the weight decode swapped — consistent with the PR description. Minor improvement worth noting
Nits (non-blocking)
Validation gap (the main thing for reviewers to weigh)The PR is upfront that the on-device Metal kernels ( No code changes requested. LGTM pending the CI GPU |
|
Thanks @JaynouOliver! Overall it looks good, but a few comments:
Also parallelize the write out: see #20643 for similar improvements on q4k/q6k.
|
Summary
Adds fused Q5_K Metal kernels (linear + embedding) for the MLX backend and registers Q5_K on the export side, matching the existing Q6_K (#20004) and Q4_K (#20172) support. Resolves #20614.
Q5_K = Q4_K's affine super-block (
d/dmin+ 6-bit packed sub-block scales/mins viaget_scale_min_k4) plus a Q6_K-style high-bit array: each weight is a 5-bit code whose low 4 bits come fromqsand whose 5th bit comes fromqh. The kernels consume the rawblock_q5_Kdirectly — no export-time repack.Changes
extension/llm/export/gguf.py— registerGGML_Q5_K = 13,_Q5_K_BLOCK_BYTES = 176(2 + 2 + 12 + QK_K//8 + QK_K//2), add"q5_k"to_GGML_ID_BY_TYPE/_BLOCK_BYTES_BY_TYPE. (No torchaoInt*Tensorconversion path — the MLX kernels read the raw blob.)backends/mlx/custom_kernel_ops/gguf/q5k/(new):common.py—_Q5K_HEADERwith theblock_q5_Kstruct,dequant_q5k_elem(per-element, embedding) anddequantize_q5_K_16(vectorizedhalf4x4, matmul), reusingget_scale_min_k4/get_scale_min_k4_just2.linear.py—emit_linear: mat-vec (decode), tiled simdgroup mat-mat (prefill), dynamic-MIfNode. Byte-wise decode with the 5th bit fromqh; affined*scale/dmin*minper sub-block.embedding.py—emit_embedding: per-element Q5_K dequant gather.__init__.py.backends/mlx/custom_kernel_ops/gguf/patterns.py— add"q5_k"to_LINEAR_TYPES/_EMBEDDING_TYPESand a dispatch branch in both handlers.make_q5_k_blob+q5_kconfigs intest/test_linear.pyandtest/test_embedding.py; Q5_K coverage inextension/llm/export/test/test_gguf.py.Kernels ported from llama.cpp (
ggml-common.h/ggml-metal.metal:block_q5_K,dequantize_q5_K,kernel_mul_mv_q5_K_f32_impl,kernel_mul_mm), MIT-licensed (Copyright (c) 2023-2024 The ggml authors); inlineported from ...notes kept as in the Q6_K / Q4_K kernels.Validation
Done locally on Apple silicon (macOS 26.4, MLX 0.31.2):
extension/llm/export/test/test_gguf.pypasses (9/9), including the new Q5_K cases (dequantize==gguf.dequantizeexactly;torchao::dequantize_ggufop;torch.exportlowering; unsupported-type guard).dequant_q5k_elem), vectorized (dequantize_q5_K_16), and mat-vec arithmetic were each transcribed and checked numerically againstgguf.dequantize/x @ Wᵀ: exact for the dequant paths, ~3e-7 rel for the mat-vec (fp32 rounding).test_linear.py/test_embedding.py run) was not run locally: building theop_test_runnercompiles MLX from source viaxcrun metal, which needs full Xcode (only Command Line Tools available on the dev box). Relying on CI to run these on the GPU. The mat-mat kernel is structurally identical to the merged Q4_K/Q6_K kernels and differs only in the (numerically validated)dequantize_q5_K_16weight decode.Test plan
On an Apple-silicon machine: