-
Notifications
You must be signed in to change notification settings - Fork 11.5k
opencl: split ggml-opencl.cl
into multiple files and cleanup
#12886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
--------- Co-authored-by: Shangqing Gu <[email protected]>
ggml-opencl.cl
into multiple files and cleanupggml-opencl.cl
into multiple files and cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Love the new clean kernel names and things.
it's a good idea and I borrowed this idea to my candidate PR accordingly: make the highly-complexity and frequently-changing codes more clear when stable code can be put into a single self-contained source file. thanks! |
Max, sorry to bother you. I know your time is valuable and I know you are a staff tech expert from the threadpool PR and thanks for your breakthrough reminder on 03/18/2025 again. I observed that the GGML_OP_ADD's performance through HWACCEL_CDSP is faster than the default ggml backend on Snapdragon 8Elite phone and much faster than QNN-NPU(latest QNN SDK) on Snapdragon 8Elite phone. could you help to verify this in your team with latest source code in that PR? I'd like to contribute that PR to your team(I'm not sure your team's relationship with Linaro because I see your team's codebase is CodeLinaro) and collaborate in the further dev activities of related topic as an volunteer programmer if this is really the correct direction. |
Cool work. It seems to work on Adreno 650. The performance results are kind of strange though. It seems slower than CPU. OpenCL LD_LIBRARY_PATH="/vendor/lib64" llama-bench -m qwen2.5-0.5b-instruct-q4_0.gguf -t 4 -p 32 -n 32
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) (OpenCL 2.0 Adreno(TM) 650)'
ggml_opencl: OpenCL driver: OpenCL 2.0 QUALCOMM build: commit #b213cd5627 changeid #I42f35bf1e0 Date: 06/11/23 Sun Local Branch: Remote Branch: Compiler E031.37.12.07
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................
load_backend: loaded OpenCL backend from /data/data/com.termux/files/usr/bin/../lib/libggml-opencl.so
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0 | 403.20 MiB | 630.17 M | OpenCL | 99 | 4 | pp32 | 85.34 ± 4.14 |
| qwen2 1B Q4_0 | 403.20 MiB | 630.17 M | OpenCL | 99 | 4 | tg32 | 18.05 ± 0.74 |
CPU LD_LIBRARY_PATH="/vendor/lib64" llama-bench -m qwen2.5-0.5b-instruct-q4_0.gguf -t 4 -p 32 -n 32 -ngl 0
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) (OpenCL 2.0 Adreno(TM) 650)'
ggml_opencl: OpenCL driver: OpenCL 2.0 QUALCOMM build: commit #b213cd5627 changeid #I42f35bf1e0 Date: 06/11/23 Sun Local Branch: Remote Branch: Compiler E031.37.12.07
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................
load_backend: loaded OpenCL backend from /data/data/com.termux/files/usr/bin/../lib/libggml-opencl.so
load_backend: loaded CPU backend from /data/data/com.termux/files/usr/bin/../lib/libggml-cpu.so
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0 | 403.20 MiB | 630.17 M | OpenCL | 0 | 4 | pp32 | 83.18 ± 0.06 |
| qwen2 1B Q4_0 | 403.20 MiB | 630.17 M | OpenCL | 0 | 4 | tg32 | 49.03 ± 0.50 | |
Sorry for the delayed feedback. I'll try to spend some time reviewing that PR later this week. |
Please try the pure Q4_0 model. i.e |
…org#12886) * opencl: refactor - split the kernel files --------- Co-authored-by: Shangqing Gu <[email protected]> * opencl: split more kernels into separate files * opencl: specify subgroup size instead of querying it * opencl: refine Adreno cl compiler version parsing * opencl: skip some kernels not used by Adreno on old compilers * opencl: refine logic for selecting Adreno kernels * opencl: refine Adreno cl compiler version * opencl: cleanup preprocessor for kernels * opencl: consider Adreno CL compiler on Windows * opencl: add final newline for `mul_mv_f16_f16.cl` --------- Co-authored-by: Shangqing Gu <[email protected]>
This PR splits
ggml-opencl.cl
into multiple.cl
files with some cleanup. This also allows OpenCL backend to run on older Adreno GPUs such as Adreno 660. Currently, compilers newer than E031.38.01.00 should work.