sycl: flash-attention implementation#16969
Conversation
NeoZhangJianyu
left a comment
There was a problem hiding this comment.
I meet compile error on https://github.com/ye-NX/llama.cpp/tree/saf-ye/flash-attn:
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:3843:13: error: use of undeclared identifier 'ggml_sycl_op_flash_attn'
3843 | ggml_sycl_op_flash_attn(ctx, dst);
| ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:4508:20: error: use of undeclared identifier 'ggml_sycl_flash_attn_ext_supported'
4508 | return ggml_sycl_flash_attn_ext_supported(op);
| ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/pad.cpp:64:30: warning: unused parameter 'item_ct1' [-Wunused-parameter]
64 | [=](sycl::nd_item<3> item_ct1) {
| ^
2 errors generated.
Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>
693157c to
af5b644
Compare
fdf83f7 to
dcd7ca5
Compare
|
The building is passed.
|
|
Thanks for feedback! |
Ok! It's great! In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. Cancel my task and depend on your implementation? How do you think? |
What a coincidence... |
Yes! I will support you! I want to contact to you by email. But I can't see your email address. Thank you! |
Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>
dcd7ca5 to
c62b98b
Compare
Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>
0d4a24c to
e1511c3
Compare
|
@ye-NX We hope the PR provide the whole functionality. If your PR is not ready for demo/review, please make it as 'draft'. Thank you! |
Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>
|
@ye-NX |
|
.We’re not completely sure
:We tried to run the tests on our machine using the following command
./build-llama.cpp/bin/test-backend-ops test -o FLASH_ATTN_EXT -b SYCL0
.However, we’re not sure whether the SYCL output is actually being compared
against the CPU reference
?Could this be the case
.If so, we would appreciate guidance on how to properly verify correctness
.Apart from that, unlike what we initially expected, we still have some
development time — our demo is in two weeks — so we can continue improving
the implementation
!Thank you
בתאריך יום ד׳, 3 בדצמ׳ 2025 ב-15:03 מאת Neo Zhang Jianyu <
***@***.***>:
… *NeoZhangJianyu* left a comment (ggml-org/llama.cpp#16969)
<#16969 (comment)>
@ye-NX <https://github.com/ye-NX>
Does the latest commit fix the UT issue?
Is the PR ready to be tested and reviewed?
—
Reply to this email directly, view it on GitHub
<#16969 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BPDAU75SOMOS5C7CQ2UZRN3373NRDAVCNFSM6AAAAACK7N6UDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMBWG42TIMJRGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, use the above cmd to run the UT case for flash-attention. It compares to CPU result in fact. Could you set this PR as 'draft' since it's not ready for review? Thank you! |
|
@ye-NX @NeoZhangJianyu any news regarding FA implementation for SYCL? :) |
This PR still has some issues to be fixed. In same time, I'm implementing another PR to support FA. Thank you! |
This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.
Authors:
Joint work by @safranowith and @ye-NX
Notes: