Skip to content

sycl: flash-attention implementation#16969

Open
ye-NX wants to merge 5 commits intoggml-org:masterfrom
ye-NX:saf-ye/flash-attn
Open

sycl: flash-attention implementation#16969
ye-NX wants to merge 5 commits intoggml-org:masterfrom
ye-NX:saf-ye/flash-attn

Conversation

@ye-NX
Copy link
Copy Markdown
Contributor

@ye-NX ye-NX commented Nov 3, 2025

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

  • Implemented Flash Attention kernel for SYCL backend
  • Added forward pass implementation with block-wise computation
  • Integrated with existing GGML SYCL infrastructure
  • Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

  • This is an initial implementation
  • Performance benchmarks and optimizations are planned for future iterations
  • Feedback and suggestions are welcome!

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Nov 3, 2025
Copy link
Copy Markdown
Contributor

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meet compile error on https://github.com/ye-NX/llama.cpp/tree/saf-ye/flash-attn:

/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:3843:13: error: use of undeclared identifier 'ggml_sycl_op_flash_attn'
 3843 |             ggml_sycl_op_flash_attn(ctx, dst);
      |             ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:4508:20: error: use of undeclared identifier 'ggml_sycl_flash_attn_ext_supported'
 4508 |             return ggml_sycl_flash_attn_ext_supported(op);
      |                    ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/pad.cpp:64:30: warning: unused parameter 'item_ct1' [-Wunused-parameter]
   64 |         [=](sycl::nd_item<3> item_ct1) {
      |                              ^
2 errors generated.

Comment thread ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated
Comment thread ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated
Co-authored-by: safranowith <bsh155762@gmail.com>
Co-authored-by: ye-NX <y8703470@gmail.com>
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

NeoZhangJianyu commented Nov 5, 2025

The building is passed.
But the flash attention is not enabled on GPU:

llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

  1. how to enable flash-attention?
  2. what's the performance benefit?
  3. Is there any plan to support more data types? This PR is for fp32.

@ye-NX
Copy link
Copy Markdown
Contributor Author

ye-NX commented Nov 6, 2025

Thanks for feedback!
We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation.
We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

NeoZhangJianyu commented Nov 10, 2025

Thanks for feedback! We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation. We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished.
I don't khow to handle my current task. :)

Cancel my task and depend on your implementation?
or continue my task and merge it if mine is better than yours.

How do you think?

@ye-NX
Copy link
Copy Markdown
Contributor Author

ye-NX commented Nov 10, 2025

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)

Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.

How do you think?

What a coincidence...
For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks.
We’d really appreciate it if you could let us continue developing it under your guidance.
If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward.
Does that sound okay to you?

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

Ok! It's great!
In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)
Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.
How do you think?

What a coincidence... For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks. We’d really appreciate it if you could let us continue developing it under your guidance. If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward. Does that sound okay to you?

Yes! I will support you!
No limitation of time. Please go ahead!

I want to contact to you by email. But I can't see your email address.
Could you send me an email to zhang.jianyu@outlook.com?
So we could discuss more.

Thank you!

Co-authored-by: safranowith <bsh155762@gmail.com>
Co-authored-by: ye-NX <y8703470@gmail.com>
@ye-NX ye-NX force-pushed the saf-ye/flash-attn branch from dcd7ca5 to c62b98b Compare November 23, 2025 15:09
@ye-NX ye-NX requested a review from ggerganov as a code owner December 2, 2025 09:33
Co-authored-by: safranowith <bsh155762@gmail.com>
Co-authored-by: ye-NX <y8703470@gmail.com>
@ye-NX ye-NX force-pushed the saf-ye/flash-attn branch from 0d4a24c to e1511c3 Compare December 2, 2025 09:44
@github-actions github-actions Bot added the testing Everything test related label Dec 2, 2025
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

@ye-NX
I meet crash when run the PR by the UT case of FLASH_ATTN_EXT.
Have you tested it by ./build-ci-release/bin/test-backend-ops -o FLASH_ATTN_EXT?

We hope the PR provide the whole functionality.
Test and pass with UT is basic requirement.

If your PR is not ready for demo/review, please make it as 'draft'.
So, the reviewer needn't to monitor the PR status again.

Thank you!

Co-authored-by: safranowith <bsh155762@gmail.com>
Co-authored-by: ye-NX <y8703470@gmail.com>
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

@ye-NX
Does the latest commit fix the UT issue?
Is the PR ready to be tested and reviewed?

@ye-NX
Copy link
Copy Markdown
Contributor Author

ye-NX commented Dec 4, 2025 via email

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

.We’re not completely sure :We tried to run the tests on our machine using the following command ./build-llama.cpp/bin/test-backend-ops test -o FLASH_ATTN_EXT -b SYCL0 .However, we’re not sure whether the SYCL output is actually being compared against the CPU reference ?Could this be the case .If so, we would appreciate guidance on how to properly verify correctness .Apart from that, unlike what we initially expected, we still have some development time — our demo is in two weeks — so we can continue improving the implementation !Thank you ‫בתאריך יום ד׳, 3 בדצמ׳ 2025 ב-15:03 מאת ‪Neo Zhang Jianyu‬‏ <‪ @.‬‏>:‬

NeoZhangJianyu left a comment (ggml-org/llama.cpp#16969) <#16969 (comment)> @ye-NX https://github.com/ye-NX Does the latest commit fix the UT issue? Is the PR ready to be tested and reviewed? — Reply to this email directly, view it on GitHub <#16969 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BPDAU75SOMOS5C7CQ2UZRN3373NRDAVCNFSM6AAAAACK7N6UDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMBWG42TIMJRGE . You are receiving this because you were mentioned.Message ID: @.
>

Yes, use the above cmd to run the UT case for flash-attention.
After they are passed for fp32, your PRs should be OK.
For other cases of flash-attention, they should be skipped and won't lead to crash or fault.

It compares to CPU result in fact.

Could you set this PR as 'draft' since it's not ready for review?

Thank you!

@andreyzagoruy
Copy link
Copy Markdown

andreyzagoruy commented Feb 7, 2026

@ye-NX @NeoZhangJianyu any news regarding FA implementation for SYCL? :)

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

@ye-NX @NeoZhangJianyu any news regarding FA implementation for SYCL? :)

This PR still has some issues to be fixed.
Looks like there is no update for a long time. I don't know if the author will continue this work.

In same time, I'm implementing another PR to support FA.
Test and debug is ongoing. I hope the basic version be ready soon.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants