Skip to content

opencl: split ggml-opencl.cl into multiple files and cleanup #12886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 15, 2025

Conversation

lhez
Copy link
Contributor

@lhez lhez commented Apr 11, 2025

This PR splits ggml-opencl.cl into multiple .cl files with some cleanup. This also allows OpenCL backend to run on older Adreno GPUs such as Adreno 660. Currently, compilers newer than E031.38.01.00 should work.

@lhez lhez changed the title opencl: break ggml-opencl.cl into multiple files and cleanup opencl: split ggml-opencl.cl into multiple files and cleanup Apr 11, 2025
Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Love the new clean kernel names and things.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 11, 2025
@zhouwg
Copy link
Contributor

zhouwg commented Apr 11, 2025

it's a good idea and I borrowed this idea to my candidate PR accordingly: make the highly-complexity and frequently-changing codes more clear when stable code can be put into a single self-contained source file.

thanks!

@lhez lhez marked this pull request as ready for review April 11, 2025 18:42
@zhouwg
Copy link
Contributor

zhouwg commented Apr 12, 2025

Very nice! Love the new clean kernel names and things.

Max, sorry to bother you. I know your time is valuable and I know you are a staff tech expert from the threadpool PR and thanks for your breakthrough reminder on 03/18/2025 again.

I observed that the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone and much faster than QNN-NPU(latest QNN SDK) on Snapdragon 8Elite phone. could you help to verify this in your team with latest source code in that PR? I'd like to contribute that PR to your team(I'm not sure your team's relationship with Linaro because I see your team's codebase is CodeLinaro) and collaborate in the further dev activities of related topic as an volunteer programmer if this is really the correct direction.

@tomaszduda23
Copy link

Cool work. It seems to work on Adreno 650. The performance results are kind of strange though. It seems slower than CPU.

OpenCL

LD_LIBRARY_PATH="/vendor/lib64" llama-bench -m qwen2.5-0.5b-instruct-q4_0.gguf -t 4 -p 32 -n 32 
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) (OpenCL 2.0 Adreno(TM) 650)'
ggml_opencl: OpenCL driver: OpenCL 2.0 QUALCOMM build: commit #b213cd5627 changeid #I42f35bf1e0 Date: 06/11/23 Sun Local Branch:  Remote Branch:  Compiler E031.37.12.07
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................
load_backend: loaded OpenCL backend from /data/data/com.termux/files/usr/bin/../lib/libggml-opencl.so
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
| model                          |       size |     params | backend    | ngl | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |  99 |       4 |          pp32 |         85.34 ± 4.14 |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |  99 |       4 |          tg32 |         18.05 ± 0.74 |

CPU

LD_LIBRARY_PATH="/vendor/lib64" llama-bench -m qwen2.5-0.5b-instruct-q4_0.gguf -t 4 -p 32 -n 32 -ngl 0
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) (OpenCL 2.0 Adreno(TM) 650)'
ggml_opencl: OpenCL driver: OpenCL 2.0 QUALCOMM build: commit #b213cd5627 changeid #I42f35bf1e0 Date: 06/11/23 Sun Local Branch:  Remote Branch:  Compiler E031.37.12.07
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................
load_backend: loaded OpenCL backend from /data/data/com.termux/files/usr/bin/../lib/libggml-opencl.so
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
| model                          |       size |     params | backend    | ngl | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |   0 |       4 |          pp32 |         83.18 ± 0.06 |
| qwen2 1B Q4_0                  | 403.20 MiB |   630.17 M | OpenCL     |   0 |       4 |          tg32 |         49.03 ± 0.50 |

@max-krasnyansky max-krasnyansky merged commit 80f19b4 into ggml-org:master Apr 15, 2025
51 checks passed
@max-krasnyansky
Copy link
Collaborator

Very nice! Love the new clean kernel names and things.

Max, sorry to bother you. I know your time is valuable and I know you are a staff tech expert from the threadpool PR and thanks for your breakthrough reminder on 03/18/2025 again.

I observed that the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone and much faster than QNN-NPU(latest QNN SDK) on Snapdragon 8Elite phone. could you help to verify this in your team with latest source code in that PR? I'd like to contribute that PR to your team(I'm not sure your team's relationship with Linaro because I see your team's codebase is CodeLinaro) and collaborate in the further dev activities of related topic as an volunteer programmer if this is really the correct direction.

Sorry for the delayed feedback. I'll try to spend some time reviewing that PR later this week.
(a bit too much going on right now).

@max-krasnyansky
Copy link
Collaborator

Cool work. It seems to work on Adreno 650. The performance results are kind of strange though. It seems slower than CPU.

Please try the pure Q4_0 model. i.e --pure option for llama-quantize.
Q6_K layers that we add by default for the Q4_0 models are not fully optimized at this point.

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…org#12886)

* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <[email protected]>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants