opencl: split `ggml-opencl.cl` into multiple files and cleanup #12886

lhez · 2025-04-11T04:43:44Z

This PR splits ggml-opencl.cl into multiple .cl files with some cleanup. This also allows OpenCL backend to run on older Adreno GPUs such as Adreno 660. Currently, compilers newer than E031.38.01.00 should work.

--------- Co-authored-by: Shangqing Gu <[email protected]>

max-krasnyansky

Very nice! Love the new clean kernel names and things.

zhouwg · 2025-04-11T16:22:38Z

it's a good idea and I borrowed this idea to my candidate PR accordingly: make the highly-complexity and frequently-changing codes more clear when stable code can be put into a single self-contained source file.

thanks!

zhouwg · 2025-04-12T07:23:14Z

Very nice! Love the new clean kernel names and things.

Max, sorry to bother you. I know your time is valuable and I know you are a staff tech expert from the threadpool PR and thanks for your breakthrough reminder on 03/18/2025 again.

I observed that the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone and much faster than QNN-NPU(latest QNN SDK) on Snapdragon 8Elite phone. could you help to verify this in your team with latest source code in that PR? I'd like to contribute that PR to your team(I'm not sure your team's relationship with Linaro because I see your team's codebase is CodeLinaro) and collaborate in the further dev activities of related topic as an volunteer programmer if this is really the correct direction.

tomaszduda23 · 2025-04-14T14:15:19Z

Cool work. It seems to work on Adreno 650. The performance results are kind of strange though. It seems slower than CPU.

OpenCL

LD_LIBRARY_PATH="/vendor/lib64" llama-bench -m qwen2.5-0.5b-instruct-q4_0.gguf -t 4 -p 32 -n 32 
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) (OpenCL 2.0 Adreno(TM) 650)'
ggml_opencl: OpenCL driver: OpenCL 2.0 QUALCOMM build: commit #b213cd5627 changeid #I42f35bf1e0 Date: 06/11/23 Sun Local Branch:  Remote Branch:  Compiler E031.37.12.07
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................
load_backend: loaded OpenCL backend from /data/data/com.termux/files/usr/bin/../lib/libggml-opencl.so
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
| model                          |       size |     params | backend    | ngl | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |  99 |       4 |          pp32 |         85.34 ± 4.14 |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |  99 |       4 |          tg32 |         18.05 ± 0.74 |

CPU

LD_LIBRARY_PATH="/vendor/lib64" llama-bench -m qwen2.5-0.5b-instruct-q4_0.gguf -t 4 -p 32 -n 32 -ngl 0
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) (OpenCL 2.0 Adreno(TM) 650)'
ggml_opencl: OpenCL driver: OpenCL 2.0 QUALCOMM build: commit #b213cd5627 changeid #I42f35bf1e0 Date: 06/11/23 Sun Local Branch:  Remote Branch:  Compiler E031.37.12.07
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................
load_backend: loaded OpenCL backend from /data/data/com.termux/files/usr/bin/../lib/libggml-opencl.so
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
| model                          |       size |     params | backend    | ngl | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |   0 |       4 |          pp32 |         83.18 ± 0.06 |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |   0 |       4 |          tg32 |         49.03 ± 0.50 |

max-krasnyansky · 2025-04-15T19:28:04Z

Very nice! Love the new clean kernel names and things.

Max, sorry to bother you. I know your time is valuable and I know you are a staff tech expert from the threadpool PR and thanks for your breakthrough reminder on 03/18/2025 again.

I observed that the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone and much faster than QNN-NPU(latest QNN SDK) on Snapdragon 8Elite phone. could you help to verify this in your team with latest source code in that PR? I'd like to contribute that PR to your team(I'm not sure your team's relationship with Linaro because I see your team's codebase is CodeLinaro) and collaborate in the further dev activities of related topic as an volunteer programmer if this is really the correct direction.

Sorry for the delayed feedback. I'll try to spend some time reviewing that PR later this week.
(a bit too much going on right now).

max-krasnyansky · 2025-04-15T19:30:23Z

Cool work. It seems to work on Adreno 650. The performance results are kind of strange though. It seems slower than CPU.

Please try the pure Q4_0 model. i.e --pure option for llama-quantize.
Q6_K layers that we add by default for the Q4_0 models are not fully optimized at this point.

…org#12886) * opencl: refactor - split the kernel files --------- Co-authored-by: Shangqing Gu <[email protected]> * opencl: split more kernels into separate files * opencl: specify subgroup size instead of querying it * opencl: refine Adreno cl compiler version parsing * opencl: skip some kernels not used by Adreno on old compilers * opencl: refine logic for selecting Adreno kernels * opencl: refine Adreno cl compiler version * opencl: cleanup preprocessor for kernels * opencl: consider Adreno CL compiler on Windows * opencl: add final newline for `mul_mv_f16_f16.cl` --------- Co-authored-by: Shangqing Gu <[email protected]>

lhez and others added 9 commits April 9, 2025 22:11

opencl: refactor - split the kernel files

32761b8

--------- Co-authored-by: Shangqing Gu <[email protected]>

opencl: split more kernels into separate files

a423f7c

opencl: specify subgroup size instead of querying it

a06517e

opencl: refine Adreno cl compiler version parsing

5fb23ce

opencl: skip some kernels not used by Adreno on old compilers

b29328a

opencl: refine logic for selecting Adreno kernels

8a291af

opencl: refine Adreno cl compiler version

e4ab469

opencl: cleanup preprocessor for kernels

a4a0c24

opencl: consider Adreno CL compiler on Windows

fbd3b4b

lhez changed the title ~~opencl: break ggml-opencl.cl into multiple files and cleanup~~ opencl: split ggml-opencl.cl into multiple files and cleanup Apr 11, 2025

max-krasnyansky approved these changes Apr 11, 2025

View reviewed changes

max-krasnyansky mentioned this pull request Apr 11, 2025

opencl: Add support for multiple devices #12622

Merged

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 11, 2025

opencl: add final newline for mul_mv_f16_f16.cl

9266581

lhez marked this pull request as ready for review April 11, 2025 18:42

rmatif mentioned this pull request Apr 15, 2025

problems with gguf models rmatif/Local-Diffusion#9

Open

max-krasnyansky merged commit 80f19b4 into ggml-org:master Apr 15, 2025
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opencl: split `ggml-opencl.cl` into multiple files and cleanup #12886

opencl: split `ggml-opencl.cl` into multiple files and cleanup #12886

Uh oh!

lhez commented Apr 11, 2025 •

edited

Loading

Uh oh!

max-krasnyansky left a comment

Uh oh!

zhouwg commented Apr 11, 2025 •

edited

Loading

Uh oh!

zhouwg commented Apr 12, 2025 •

edited

Loading

Uh oh!

tomaszduda23 commented Apr 14, 2025

Uh oh!

Uh oh!

max-krasnyansky commented Apr 15, 2025

Uh oh!

max-krasnyansky commented Apr 15, 2025

Uh oh!

Uh oh!

opencl: split ggml-opencl.cl into multiple files and cleanup #12886

opencl: split ggml-opencl.cl into multiple files and cleanup #12886

Uh oh!

Conversation

lhez commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

zhouwg commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhouwg commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaszduda23 commented Apr 14, 2025

Uh oh!

Uh oh!

max-krasnyansky commented Apr 15, 2025

Uh oh!

max-krasnyansky commented Apr 15, 2025

Uh oh!

Uh oh!

opencl: split `ggml-opencl.cl` into multiple files and cleanup #12886

opencl: split `ggml-opencl.cl` into multiple files and cleanup #12886

lhez commented Apr 11, 2025 •

edited

Loading

zhouwg commented Apr 11, 2025 •

edited

Loading

zhouwg commented Apr 12, 2025 •

edited

Loading