llama.cpp

Last updated on Mar 4th, 2025.

This repo is cloned from llama.cpp commit 06c2b1561d8b882bc018554591f8c35eb04ad30e. It is compatible with llama-cpp-python commit 710e19a81284e5af0d5db93cef7a9063b3e8534f

Customize quantization group size at compilation (CPU inference only)

The only thing that is different is to add -DQK4_0 flag when cmake.

cmake -B build_cpu_g128 -DQK4_0=128
cmake --build build_cpu_g128

To quantize the model with the customized group size, run

./build_cpu_g128/bin/llama-quantize <model_path.gguf> <quantization_type>

To run the quantized model, run

./build_cpu_g128/bin/llama-cli -m <quantized_model_path.gguf>

Note:

You should make sure that the model you run is quantized to the same group size as the one you compile with. Or you'll receive a runtime error when loading the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

llama.cpp

Customize quantization group size at compilation (CPU inference only)

Note:

Files

README.md

Latest commit

History

README.md

File metadata and controls

llama.cpp

Customize quantization group size at compilation (CPU inference only)

Note: