sync : llama.cpp #1356

ggerganov · 2025-09-29T09:35:12Z

No description provided.

* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback

Signed-off-by: Xiaodong Ye <[email protected]>

* ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

…6185) * vendor : update httplib Signed-off-by: Adrien Gallouët <[email protected]> * common : use cpp-httplib as a cURL alternative for downloads The existing cURL implementation is intentionally left untouched to prevent any regressions and to allow for safe, side-by-side testing by toggling the `LLAMA_CURL` CMake option. Signed-off-by: Adrien Gallouët <[email protected]> * ggml : Bump to Windows 10 Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

* devops: move s390x and ppc64le ci build we have access to ubuntu-24.04-s390x and ppc64le images now Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le for now since they have compiler errors Signed-off-by: Aaron Teo <[email protected]> * devops: stop warnings as errors Signed-off-by: Aaron Teo <[email protected]> * devops: switch to non-macro flag Signed-off-by: Aaron Teo <[email protected]> * devops: going the llama macro route Signed-off-by: Aaron Teo <[email protected]> * devops: add big-endian gguf test models Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le to test s390x, check test build Signed-off-by: Aaron Teo <[email protected]> * devops: dup .gguf.inp files for big-endian tests Signed-off-by: Aaron Teo <[email protected]> * devops: dup .gguf.out files for big-endian too Signed-off-by: Aaron Teo <[email protected]> * devops: add python setup and endian byteswap Signed-off-by: Aaron Teo <[email protected]> * devops: pooring thing does not have s390x python3 Signed-off-by: Aaron Teo <[email protected]> * devops: add missing rust compiler for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: try rust actions runner Signed-off-by: Aaron Teo <[email protected]> * Revert "devops: try rust actions runner" This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c. Signed-off-by: Aaron Teo <[email protected]> * devops: try a different path for rust Signed-off-by: Aaron Teo <[email protected]> * devops: dump home directory and user info Signed-off-by: Aaron Teo <[email protected]> * devops: install gguf-py only Signed-off-by: Aaron Teo <[email protected]> * devops: missed relative path Signed-off-by: Aaron Teo <[email protected]> * devops: remove big-endian files since local swapping is working Signed-off-by: Aaron Teo <[email protected]> * devops: revert test-tokenizer-0 cmakelists Signed-off-by: Aaron Teo <[email protected]> * Fix unicode flags conversion from and to uint16_t Bitfields are allocated in different order on s390x Signed-off-by: Aaron Teo <[email protected]> * Simplify byteswap command Signed-off-by: Aaron Teo <[email protected]> * Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs Signed-off-by: Aaron Teo <[email protected]> * Fix endianness detection in vocab loader Signed-off-by: Aaron Teo <[email protected]> * Disable test-thread-safety on s390x In this test a model is downloaded, then immediately loaded to check if more downloads are needed, and then used for test. There is no clean way to separate all those steps to add byteswapping between them, so just skip this test. Signed-off-by: Aaron Teo <[email protected]> * Fix q8_0 test in test-quantize-fns vec_signed uses unexpected rounding mode. Explicitly use different rounding function. Signed-off-by: Aaron Teo <[email protected]> * devops: add big-endian stories260K Signed-off-by: Aaron Teo <[email protected]> * devops: add s390x test-eval-callback Signed-off-by: Aaron Teo <[email protected]> * devops: fix test does not exist Signed-off-by: Aaron Teo <[email protected]> * devops: fix model not found llama-eval-callback Signed-off-by: Aaron Teo <[email protected]> * Fix q3_K dot product error in test-quantize-fns on s390x Array q8bytes had only 4 elements allocated, but 8 elements accessed. This lead to write out of bounds and later read of overwritten values out of bounds and incorrect result. Signed-off-by: Aaron Teo <[email protected]> * devops: re-enable ppc64le for testing Signed-off-by: Aaron Teo <[email protected]> * devops: activate test-thread-safety for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le tests for some reason it keeps failing test-thread-safety tests and I do not have a machine that is able to replicate the tests. Signed-off-by: Aaron Teo <[email protected]> * devops: LLAMA_FATAL_WARNINGS=ON Signed-off-by: Aaron Teo <[email protected]> * Correct repository URL for s390x for test-thread-safety model Signed-off-by: Aaron Teo <[email protected]> * Fix fs_get_cache_directory Ensure it works even if both XDG_CACHE_HOME and HOME are unset. This might happen in containers. Signed-off-by: Aaron Teo <[email protected]> * Re-enable CI for ppc64le Signed-off-by: Aaron Teo <[email protected]> * Fortify ggml_rope_impl Only memcpy data from sections argument if it's non-NULL. Signed-off-by: Aaron Teo <[email protected]> * Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way * Update URL for big-endian model * Update .github/workflows/build.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update remaining mentions of BE models to ggml-org/models repo --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Aleksei Nikiforov <[email protected]> Co-authored-by: Aleksei Nikiforov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.

…vices (llama/16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init

* CUDA: refactor and deduplicate vector FA kernels

…lama/16277) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement

…ma/16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same

The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).

* vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes

* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks

* metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed

* vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col

…a/16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

…292)

* check cuda argsort limits and add test * add metal check

junchao-loongson and others added 23 commits September 29, 2025 12:29

ggml : fix loongarch lsx compilation error (llama/15864)

459520b

musa: fix build warnings (llama/15611)

b4f6383

Signed-off-by: Xiaodong Ye <[email protected]>

metal : report OOM errors (llama/16274)

004a5e5

vulkan: throw system error instead of SIGABRT during init on older de…

698d6e3

…vices (llama/16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init

CUDA: refactor and deduplicate vector FA kernels (llama/16208)

511afb1

* CUDA: refactor and deduplicate vector FA kernels

vulkan : make the vulkan.hpp dynamic dispatcher instance private (lla…

285ab85

…ma/16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same

metal : fuse non-sequential nodes (llama/16102)

f00fab9

* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks

ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (llam…

d647928

…a/16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

vulkan: Fix validation failure in quantized flash attention (llama/16…

9d51698

…292)

ggml : fix dependencies for ggml_set_rows (llama/16318)

7986874

ggml : check cuda and metal argsort limits and add test (llama/16323)

8514f7d

* check cuda argsort limits and add test * add metal check

sync : llama.cpp

79d1a9c

cmake : remove metal flag (llama/0)

0879f02

ggerganov merged commit f3d690c into master Sep 29, 2025
8 checks passed

ggerganov deleted the sync-llama.cpp-25-09-29 branch September 29, 2025 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : llama.cpp #1356

sync : llama.cpp #1356

ggerganov commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

sync : llama.cpp #1356

sync : llama.cpp #1356

Conversation

ggerganov commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants