SAGE ATTENTION: 2.1 TIMES FASTER THAN FLASH ATTENTION 2 AND 2.7 TIMES FASTER THAN XFORMERS #9901
Replies: 3 comments 2 replies
-
@ggerganov @ikawrakow @kalomaze @slaren @JohannesGaessler @calvintwr @LostRuins @bartowski1182 Thoughts? 😋 |
Beta Was this translation helpful? Give feedback.
-
This should be interesting as it can perform attention int8 so it does support most hardware |
Beta Was this translation helpful? Give feedback.
-
Updated version Sage Attention 2 - SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization "The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation." https://arxiv.org/abs/2411.10958 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
https://arxiv.org/abs/2410.02367
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
ABSTRACT:
Beta Was this translation helpful? Give feedback.
All reactions