Skip to content

Comments

Add Flash Decoding kernel (splitKV)#671

Open
wuxun-zhang wants to merge 24 commits intointel:mainfrom
wuxun-zhang:wuxun/split-reduction-kernel
Open

Add Flash Decoding kernel (splitKV)#671
wuxun-zhang wants to merge 24 commits intointel:mainfrom
wuxun-zhang:wuxun/split-reduction-kernel

Conversation

@wuxun-zhang
Copy link

@wuxun-zhang wuxun-zhang commented Dec 18, 2025

Description

This PR aims to add new split reduction kernel (splitKV) for flash attention which benefits for long context length scenario.

Note: Codes are not cleaned but ready for testing.

What's newly added in this PR

  • new FMHAFwdKernel named XeFMHAFwdSplitKVKernel
  • new split reduce kernel named ReduceSplitK
  • new tile shceudler named XeReduceSplitKTileScheduler
  • support variable length

Limitation

  • decoding only (GQA packing optimization)

Type

  • Bug
  • Feature
  • Performance
  • Refactor

Testing

  • Tests pass
  • Xe12
  • Xe20

Performance

Metric Before After

References

Fixes #

Checklist

  • Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used

@wuxun-zhang wuxun-zhang force-pushed the wuxun/split-reduction-kernel branch from 11ab8d0 to 2ad4764 Compare December 22, 2025 04:05
@wuxun-zhang wuxun-zhang changed the title Add split reduction kernel for Flash Attention decoding Add Flash Decoding kernel (splitKV) Jan 12, 2026
@wuxun-zhang
Copy link
Author

wuxun-zhang commented Jan 16, 2026

Now this kernel has been verified and integrated into vllm-xpu-kernels, which brings large e2e performance improvement (llama3.1 8B).

accuracy passed
performance WIP
each work group handles whole group query heads and packing
group query heads into single MMA call
Query heads packed for single MMA call

Limitations: GQA group size <= DPAS max repeat count (8)

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
@wuxun-zhang wuxun-zhang force-pushed the wuxun/split-reduction-kernel branch from 3dd09d7 to b364cec Compare February 4, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants