sycl: flash-attention implementation by ye-NX · Pull Request #16969 · ggml-org/llama.cpp

ye-NX · 2025-11-03T13:11:27Z

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

Implemented Flash Attention kernel for SYCL backend
Added forward pass implementation with block-wise computation
Integrated with existing GGML SYCL infrastructure
Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

This is an initial implementation
Performance benchmarks and optimizations are planned for future iterations
Feedback and suggestions are welcome!

NeoZhangJianyu

I meet compile error on https://github.com/ye-NX/llama.cpp/tree/saf-ye/flash-attn:

/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:3843:13: error: use of undeclared identifier 'ggml_sycl_op_flash_attn'
 3843 |             ggml_sycl_op_flash_attn(ctx, dst);
      |             ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:4508:20: error: use of undeclared identifier 'ggml_sycl_flash_attn_ext_supported'
 4508 |             return ggml_sycl_flash_attn_ext_supported(op);
      |                    ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/pad.cpp:64:30: warning: unused parameter 'item_ct1' [-Wunused-parameter]
   64 |         [=](sycl::nd_item<3> item_ct1) {
      |                              ^
2 errors generated.

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

NeoZhangJianyu · 2025-11-05T00:45:00Z

The building is passed.
But the flash attention is not enabled on GPU:

llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

how to enable flash-attention?
what's the performance benefit?
Is there any plan to support more data types? This PR is for fp32.

ye-NX · 2025-11-06T11:15:49Z

Thanks for feedback!
We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation.
We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

NeoZhangJianyu · 2025-11-10T01:31:20Z

Thanks for feedback! We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation. We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished.
I don't khow to handle my current task. :)

Cancel my task and depend on your implementation?
or continue my task and merge it if mine is better than yours.

How do you think?

ye-NX · 2025-11-10T12:44:44Z

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)

Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.

How do you think?

What a coincidence...
For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks.
We’d really appreciate it if you could let us continue developing it under your guidance.
If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward.
Does that sound okay to you?

NeoZhangJianyu · 2025-11-11T01:47:29Z

Ok! It's great!
In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)
Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.
How do you think?

What a coincidence... For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks. We’d really appreciate it if you could let us continue developing it under your guidance. If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward. Does that sound okay to you?

Yes! I will support you!
No limitation of time. Please go ahead!

I want to contact to you by email. But I can't see your email address.
Could you send me an email to zhang.jianyu@outlook.com?
So we could discuss more.

Thank you!

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

NeoZhangJianyu · 2025-12-03T01:22:34Z

@ye-NX
I meet crash when run the PR by the UT case of FLASH_ATTN_EXT.
Have you tested it by ./build-ci-release/bin/test-backend-ops -o FLASH_ATTN_EXT?

We hope the PR provide the whole functionality.
Test and pass with UT is basic requirement.

If your PR is not ready for demo/review, please make it as 'draft'.
So, the reviewer needn't to monitor the PR status again.

Thank you!

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

NeoZhangJianyu · 2025-12-03T13:02:51Z

@ye-NX
Does the latest commit fix the UT issue?
Is the PR ready to be tested and reviewed?

ye-NX · 2025-12-04T00:15:28Z

.We’re not completely sure :We tried to run the tests on our machine using the following command ./build-llama.cpp/bin/test-backend-ops test -o FLASH_ATTN_EXT -b SYCL0 .However, we’re not sure whether the SYCL output is actually being compared against the CPU reference ?Could this be the case .If so, we would appreciate guidance on how to properly verify correctness .Apart from that, unlike what we initially expected, we still have some development time — our demo is in two weeks — so we can continue improving the implementation !Thank you ‫בתאריך יום ד׳, 3 בדצמ׳ 2025 ב-15:03 מאת ‪Neo Zhang Jianyu‬‏ <‪ ***@***.***‬‏>:‬

…

*NeoZhangJianyu* left a comment (ggml-org/llama.cpp#16969) <#16969 (comment)> @ye-NX <https://github.com/ye-NX> Does the latest commit fix the UT issue? Is the PR ready to be tested and reviewed? — Reply to this email directly, view it on GitHub <#16969 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BPDAU75SOMOS5C7CQ2UZRN3373NRDAVCNFSM6AAAAACK7N6UDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMBWG42TIMJRGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

NeoZhangJianyu · 2025-12-04T02:28:45Z

.We’re not completely sure :We tried to run the tests on our machine using the following command ./build-llama.cpp/bin/test-backend-ops test -o FLASH_ATTN_EXT -b SYCL0 .However, we’re not sure whether the SYCL output is actually being compared against the CPU reference ?Could this be the case .If so, we would appreciate guidance on how to properly verify correctness .Apart from that, unlike what we initially expected, we still have some development time — our demo is in two weeks — so we can continue improving the implementation !Thank you ‫בתאריך יום ד׳, 3 בדצמ׳ 2025 ב-15:03 מאת ‪Neo Zhang Jianyu‬‏ <‪ @.‬‏>:‬
…
NeoZhangJianyu left a comment (ggml-org/llama.cpp#16969) <#16969 (comment)> @ye-NX https://github.com/ye-NX Does the latest commit fix the UT issue? Is the PR ready to be tested and reviewed? — Reply to this email directly, view it on GitHub <#16969 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BPDAU75SOMOS5C7CQ2UZRN3373NRDAVCNFSM6AAAAACK7N6UDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMBWG42TIMJRGE . You are receiving this because you were mentioned.Message ID: @.>

Yes, use the above cmd to run the UT case for flash-attention.
After they are passed for fp32, your PRs should be OK.
For other cases of flash-attention, they should be skipped and won't lead to crash or fault.

It compares to CPU result in fact.

Could you set this PR as 'draft' since it's not ready for review?

Thank you!

andreyzagoruy · 2026-02-07T23:57:35Z

@ye-NX @NeoZhangJianyu any news regarding FA implementation for SYCL? :)

NeoZhangJianyu · 2026-02-09T01:11:26Z

@ye-NX @NeoZhangJianyu any news regarding FA implementation for SYCL? :)

This PR still has some issues to be fixed.
Looks like there is no update for a long time. I don't know if the author will continue this work.

In same time, I'm implementing another PR to support FA.
Test and debug is ongoing. I hope the basic version be ready soon.

Thank you!

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16969: sycl: flash-attention implementation auroralabs-loci/llama.cpp#51

Closed

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Nov 3, 2025

NeoZhangJianyu reviewed Nov 4, 2025

View reviewed changes

Comment thread ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated

Comment thread ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated

CISC reviewed Nov 4, 2025

View reviewed changes

Comment thread ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated

sycl: initialize flash-attention implementation

c9429b7

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

safranowith force-pushed the saf-ye/flash-attn branch from 693157c to af5b644 Compare November 4, 2025 09:29

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16969: sycl: flash-attention implementation auroralabs-loci/llama.cpp#73

Open

safranowith force-pushed the saf-ye/flash-attn branch from fdf83f7 to dcd7ca5 Compare November 4, 2025 17:06

flash-attn sycl: apply fixes and remove old implementation

c62b98b

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

ye-NX force-pushed the saf-ye/flash-attn branch from dcd7ca5 to c62b98b Compare November 23, 2025 15:09

ye-NX requested a review from ggerganov as a code owner December 2, 2025 09:33

fix: add supported tests and fix related issues

e1511c3

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

ye-NX force-pushed the saf-ye/flash-attn branch from 0d4a24c to e1511c3 Compare December 2, 2025 09:44

Merge branch 'master' into saf-ye/flash-attn

7bade66

github-actions Bot added the testing Everything test related label Dec 2, 2025

fix: update supported function

15c48eb

Co-authored-by: safranowith <bsh155762@gmail.com> Co-authored-by: ye-NX <y8703470@gmail.com>

Conversation

ye-NX commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NeoZhangJianyu commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-NX commented Nov 6, 2025

Uh oh!

NeoZhangJianyu commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-NX commented Nov 10, 2025

Uh oh!

NeoZhangJianyu commented Nov 11, 2025

Uh oh!

NeoZhangJianyu commented Dec 3, 2025

Uh oh!

NeoZhangJianyu commented Dec 3, 2025

Uh oh!

ye-NX commented Dec 4, 2025 via email

Uh oh!

NeoZhangJianyu commented Dec 4, 2025

Uh oh!

andreyzagoruy commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ye-NX commented Nov 3, 2025 •

edited

Loading

NeoZhangJianyu commented Nov 5, 2025 •

edited

Loading

NeoZhangJianyu commented Nov 10, 2025 •

edited

Loading

andreyzagoruy commented Feb 7, 2026 •

edited

Loading