SAGE ATTENTION: 2.1 TIMES FASTER THAN FLASH ATTENTION 2 AND 2.7 TIMES FASTER THAN XFORMERS #9901

joseph777111 · 2024-10-15T19:03:23Z

joseph777111
Oct 15, 2024

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

ABSTRACT:

The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.

joseph777111 · 2024-10-15T19:13:48Z

joseph777111
Oct 15, 2024
Author

@ggerganov @ikawrakow @kalomaze @slaren @JohannesGaessler @calvintwr @LostRuins @bartowski1182 Thoughts? 😋

2 replies

bartowski1182 Oct 15, 2024

Love the shoutout but I am NOT on the technical side of things like this 😂😂 that said, 8 bit cache is still not really my cup of tea, but would be interested in some metrics nonetheless

joseph777111 Oct 16, 2024
Author

Love the shoutout but I am NOT on the technical side of things like this 😂😂 that said, 8 bit cache is still not really my cup of tea, but would be interested in some metrics nonetheless

You're always invited to the party, bro! 😋

sorasoras · 2024-12-06T04:26:07Z

sorasoras
Dec 6, 2024

This should be interesting as it can perform attention int8 so it does support most hardware

0 replies

Thomas-MMJ · 2025-02-06T00:57:21Z

Thomas-MMJ
Feb 6, 2025

Updated version Sage Attention 2 - SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

"The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation."

https://arxiv.org/abs/2411.10958
https://github.com/thu-ml/SageAttention

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SAGE ATTENTION: 2.1 TIMES FASTER THAN FLASH ATTENTION 2 AND 2.7 TIMES FASTER THAN XFORMERS #9901

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

SAGE ATTENTION: 2.1 TIMES FASTER THAN FLASH ATTENTION 2 AND 2.7 TIMES FASTER THAN XFORMERS #9901

Uh oh!

Uh oh!

joseph777111 Oct 15, 2024

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

joseph777111 Oct 15, 2024 Author

Uh oh!

bartowski1182 Oct 15, 2024

Uh oh!

joseph777111 Oct 16, 2024 Author

Uh oh!

sorasoras Dec 6, 2024

Uh oh!

Uh oh!

Thomas-MMJ Feb 6, 2025

joseph777111
Oct 15, 2024

Replies: 3 comments 2 replies

joseph777111
Oct 15, 2024
Author

joseph777111 Oct 16, 2024
Author

sorasoras
Dec 6, 2024

Thomas-MMJ
Feb 6, 2025