Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 114d774

Browse files
authoredDec 6, 2024··
Merge branch 'site' into rk119-fix-quantization-links
2 parents e0ccab5 + 58618eb commit 114d774

19 files changed

+494
-4
lines changed
 

Diff for: ‎_ecosystem/doctr

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
layout: ecosystem_detail
3+
title: docTR
4+
summary: docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
5+
link: https://github.com/mindee/doctr
6+
summary-home: docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
7+
featured-home: false
8+
github-id: mindee/doctr
9+
date-added: 12/3/24
10+
---

Diff for: ‎_ecosystem/vllm

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
layout: ecosystem_detail
3+
title: vllm
4+
summary: vllm is a high-throughput and memory-efficient inference and serving engine for LLMs.
5+
link: https://github.com/vllm-project/vllm
6+
summary-home: vllm is a high-throughput and memory-efficient inference and serving engine for LLMs.
7+
featured-home: false
8+
github-id: vllm-project/vllm
9+
date-added: 12/3/24
10+
---

Diff for: ‎_events/pt-korea.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
category: event
3+
title: "PyTorch Korea User Group Meetup"
4+
date: November 30, 2024
5+
---
6+
7+
**Date**: November 30, 2024
8+
**Location**: Seoul, South Korea
9+
10+
[Event info](https://festa.io/events/6409)

Diff for: ‎_events/pt-shanghai.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
category: event
3+
title: "PyTorch Shanghai Meetup"
4+
date: August 15, 2024
5+
---
6+
7+
**Date**: August 15, 2024
8+
**Location**: Shanghai, China
9+
10+
[Read the notes](https://pytorch.org/blog/pytorch-shanghai-notes/)

Diff for: ‎_posts/2024-10-28-unleashing-ai-mobile.md

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
layout: blog_detail
33
title: "Unleashing the Power of AI on Mobile: LLM Inference for Llama 3.2 Quantized Models with ExecuTorch and KleidiAI"
44
author: Gian Marco Iodice, Arm and Digant Desai, Meta
5+
excerpt: "At the recent PyTorch Conference, Arm highlighted the widespread impact of its technology, spanning from cloud to edge, emphasizing its commitment to delivering its advanced AI computing capabilities seamlessly to millions of developers worldwide."
56
---
67

78
## Introduction

Diff for: ‎_posts/2024-11-01-cutlass-ping-pong-gemm-kernel.md

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
layout: blog_detail
33
title: "Deep Dive on CUTLASS Ping-Pong GEMM Kernel"
44
author: Less Wright, Adnan Hoque
5+
excerpt: "In this post, we provide an overview, with relevant FP8 inference kernel benchmarking, of the CUTLASS Ping-Pong GEMM kernel."
56
---
67

78
![Figure 1. FP8 GEMM Throughput Comparison CUTLASS vs Triton](/assets/images/cutlass-ping-pong-gemm-kernel/fg1.png){:style="width:100%"}

Diff for: ‎_posts/2024-11-21-rebellions.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
layout: blog_detail
33
title: "Rebellions Joins the PyTorch Foundation as a General Member"
4+
excerpt: "The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Rebellions has joined as a general member."
45
---
56

67
![Rebellions logo](/assets/images/rebellions-logo.svg){:style="max-width:350px;width:100%;float:right;margin: 20px;"}

Diff for: ‎_posts/2024-11-25-training-using-float8-fsdp2.md

+243
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
---
2+
layout: blog_detail
3+
title: "Supercharging Training using float8 and FSDP2"
4+
author: "IBM and Meta"
5+
excerpt: "In this blog, we will demonstrate how we achieve up to 50% throughput speedup while achieving loss and evaluation benchmark parity in training over FSDP1 bf16 training"
6+
---
7+
8+
**IBM**: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert Walkup, Praneet Adusumilli, Nirmit Desai, Raghu Ganti, Seetharami Seelam
9+
**Meta**: Less Wright, Wei Feng, Vasiliy Kuznetsov, Driss Guesseous
10+
11+
In this blog, we will demonstrate how we achieve up to **50% throughput speedup** while achieving loss and evaluation benchmark parity in training over [FSDP1 bf16 training](https://pytorch.org/blog/maximizing-training-throughput/). We achieve this speedup by leveraging FSDP2, DTensor, and torch.compile with torchao’s float8 via linear layer updates (compute), and float8 all_gathers for weight communication. We showcase these improvements across a spectrum of Meta LLaMa model architecture sizes, ranging from small 1.8B model size all the way to 405B model size, making training faster than ever.
12+
13+
We demonstrate these improvements using the Meta Llama3 architecture, and then perform model quality studies at two scales: 100B tokens at 8B model size, and 50B tokens at 70B model size, which provide an exact comparison of float8 and bf16 training loss curves. We demonstrate that the loss curves result in identical loss convergence across these model training runs compared to the `bf16` counterpart. Further, we train a 3B model to 1T tokens using the FineWeb-edu dataset and run standard evaluation benchmarks to ensure that the model quality is intact and comparable to a `bf16` run.
14+
15+
At IBM Research, we plan to adopt these capabilities for our data ablations to improve the number of experiments we can perform in a given GPU budget. Longer term, we will follow up with a larger scale model run to demonstrate the end-to-end feasibility of `float8` training.
16+
17+
18+
## What is Float8?
19+
20+
The `float8` format for training models was introduced by NVIDIA, ARM, and Intel in a [2022 paper](https://arxiv.org/abs/2209.05433) which demonstrated the feasibility of training using lower precision float8, without sacrificing model quality. With the introduction of newer GPUs like the NVIDIA Hopper series, FP8 training became feasible with the potential of more than 2x improvement in training throughput due to native float8 tensor core support. There are a few challenges to realize this promise: \
21+
(i) Enable the core model operations like `matmul` and `attention` in `float8`, \
22+
(ii) Enable `float8` training in a distributed framework, and \
23+
(iii) Enable weight communication between GPUs in `float8`. \
24+
While the `float8` `matmul` was enabled by NVIDIA libraries, the latter two were provided in recent updates to `FSDP2` and `torchao`.
25+
26+
In this blog, we are using [torchtitan](https://github.com/pytorch/torchtitan) as the entry point for training, IBM’s deterministic data loader, the <code>float8</code> linear layer implementation from [torchao](https://www.google.com/url?q=https://github.com/pytorch/ao/tree/main/torchao/float8&sa=D&source=docs&ust=1730743084184771&usg=AOvVaw21FdkNG452P-nDIO-hIwcW), and the <code>float8 all gather</code> from the latest PyTorch nightlies in conjunction with FSDP2. For this training, we are using the float8 per tensor (tensorwise) scaling granularity rather than rowwise. We leverage <code>torch.compile</code> to ensure that we get maximum performance gains. We are computing <code>attention</code> in <code>bf16</code> using SDPA and are currently working on moving this to float8 as well.
27+
28+
29+
## Experiments
30+
31+
We perform various experiments to demonstrate the benefits of float8 training. The first is to ensure that model quality is not sacrificed. To verify this, we train an 8B model and 70B model for a few thousand steps and compare the loss curves between both the float8 and bf16 training run. Our experiments are performed on three different H100 clusters with 128, 256, and 512 H100 GPU configurations in very different environments to demonstrate reproducibility. The first cluster is customized on [Grand Teton](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/) in Meta with 400Gbps custom interconnect, the second is an IBM research cluster with 3.2Tbps Infiniband interconnect, and the third is an IBM Cloud cluster with 3.2Tbps RoCE interconnect for GPU-to-GPU communication.
32+
33+
34+
First, we plot the loss curve comparisons for both these models in the below figures to demonstrate loss parity for a few thousand steps.
35+
36+
37+
![Figure 1: (a) 8B model loss parity for 2k steps, (b) 70B loss parity for 1k steps](/assets/images/training-using-float8-fsdp2/fg1.png){:style="width:100%"}
38+
39+
40+
41+
![Figure 1: (a) 8B model loss parity for 2k steps, (b) 70B loss parity for 1k steps](/assets/images/training-using-float8-fsdp2/fg2.png){:style="width:100%"}
42+
43+
44+
*Figure 1: (a) 8B model loss parity for 2k steps, (b) 70B loss parity for 1k steps*
45+
46+
We observe that across these different models and in different environments, we obtain loss parity for the small scale of tokens. Next, we characterize the throughput gains for four different model sizes ranging from 1.8B to 405B. We explored the best batch size and activation checkpointing schemes for both the float8 and bf16 training runs to determine the tokens/sec/GPU (wps) metric and report the performance gain. For the 405B model, we leveraged `DTensor` for tensor parallel training with FSDP2. We use a sequence length of 8K for all our measurements.
47+
48+
49+
<table class="table table-bordered">
50+
<tr>
51+
<td><strong>Model size</strong>
52+
</td>
53+
<td><strong>wps (bf16) </strong>
54+
</td>
55+
<td><strong>wps (float8)</strong>
56+
</td>
57+
<td><strong>Percent gain</strong>
58+
</td>
59+
</tr>
60+
<tr>
61+
<td>1.8B
62+
</td>
63+
<td>29K
64+
</td>
65+
<td>35K
66+
</td>
67+
<td>18%
68+
</td>
69+
</tr>
70+
<tr>
71+
<td>8B
72+
</td>
73+
<td>8K
74+
</td>
75+
<td>10K
76+
</td>
77+
<td>28%
78+
</td>
79+
</tr>
80+
<tr>
81+
<td>70B
82+
</td>
83+
<td>956
84+
</td>
85+
<td>1430
86+
</td>
87+
<td>50%
88+
</td>
89+
</tr>
90+
<tr>
91+
<td>405B (TP4)
92+
</td>
93+
<td>149
94+
</td>
95+
<td>227
96+
</td>
97+
<td>52%
98+
</td>
99+
</tr>
100+
</table>
101+
102+
103+
*Table 1: Performance gains over bf16 (both bf16 and float8 use torch.compile)*
104+
105+
We observe from Table 1 that the gains for larger models (70B and 405B) reach up to 50%, the smaller models see gains between roughly 20 and 30%. In further experiments, we observed that the addition of `float8` `all_gather` enables a boost of ~5% beyond the compute itself in `float8`, which is inline with the observations in this [blog](https://aws.amazon.com/blogs/machine-learning/efficient-pre-training-of-llama-3-like-model-architectures-using-torchtitan-on-amazon-sagemaker/).
106+
107+
Second, to demonstrate the effectiveness of an FP8 model, we trained a 3B model following the Llama3 architecture for 1T tokens using the FineWeb-edu dataset from Hugging Face. We performed evaluations using the `lm-eval-harness` framework and present a small portion of these results in the below table. We observe that the `bf16` performance is marginally better than the `float8` scores (about one percent). While some scores are significantly better with `bf16` (e.g., MMLU is 3 pts higher), we expect these gaps to vanish when the right hyper parameters are chosen and across larger scale training runs (e.g., the `bf16` run had half the batch size and it is well known that smaller batch size runs can improve evaluation scores).
108+
109+
110+
<table class="table table-bordered">
111+
<tr>
112+
<td><strong>Benchmark</strong>
113+
</td>
114+
<td><strong>Score (float8)</strong>
115+
</td>
116+
<td><strong>Score (bf16)</strong>
117+
</td>
118+
</tr>
119+
<tr>
120+
<td>MMLU (5-shot)
121+
</td>
122+
<td>0.26
123+
</td>
124+
<td>0.29
125+
</td>
126+
</tr>
127+
<tr>
128+
<td>ARC-e
129+
</td>
130+
<td>0.73
131+
</td>
132+
<td>0.73
133+
</td>
134+
</tr>
135+
<tr>
136+
<td>ARC-c
137+
</td>
138+
<td>0.43
139+
</td>
140+
<td>0.46
141+
</td>
142+
</tr>
143+
<tr>
144+
<td>Hellaswag
145+
</td>
146+
<td>0.65
147+
</td>
148+
<td>0.67
149+
</td>
150+
</tr>
151+
<tr>
152+
<td>sciq
153+
</td>
154+
<td>0.89
155+
</td>
156+
<td>0.88
157+
</td>
158+
</tr>
159+
<tr>
160+
<td>OpenBook QA
161+
</td>
162+
<td>0.43
163+
</td>
164+
<td>0.43
165+
</td>
166+
</tr>
167+
<tr>
168+
<td>PIQA
169+
</td>
170+
<td>0.76
171+
</td>
172+
<td>0.76
173+
</td>
174+
</tr>
175+
<tr>
176+
<td>Winogrande
177+
</td>
178+
<td>0.60
179+
</td>
180+
<td>0.65
181+
</td>
182+
</tr>
183+
<tr>
184+
<td><strong>Average</strong>
185+
</td>
186+
<td><strong>0.59</strong>
187+
</td>
188+
<td><strong>0.60</strong>
189+
</td>
190+
</tr>
191+
</table>
192+
193+
194+
*Table 2: Benchmark scores for float8 trained model running in FP16 for eval (at 1T tokens of FineWeb pre-training).*
195+
196+
Finally, we scale our experiments to 512 H100 GPUs on the IBM Cloud cluster. We were able to recreate the results and speedups that we observed even at 512 GPU scale. We summarize these results only for the large models in the below table (70B and 405B).
197+
198+
199+
<table class="table table-bordered">
200+
<tr>
201+
<td><strong>Model size</strong>
202+
</td>
203+
<td><strong>wps (bf16) </strong>
204+
</td>
205+
<td><strong>wps (float8)</strong>
206+
</td>
207+
<td><strong>Percent gain</strong>
208+
</td>
209+
</tr>
210+
<tr>
211+
<td>70B
212+
</td>
213+
<td>960
214+
</td>
215+
<td>1448
216+
</td>
217+
<td>51%
218+
</td>
219+
</tr>
220+
<tr>
221+
<td>405B (TP4)
222+
</td>
223+
<td>152
224+
</td>
225+
<td>217
226+
</td>
227+
<td>43%
228+
</td>
229+
</tr>
230+
</table>
231+
232+
233+
*Table 3: Performance gains over bf16 (both bf16 and float8 use torch.compile) for 512 GPU scale*
234+
235+
236+
## Future work
237+
238+
We are also working on evaluating other forms of parallelism such as Context Parallelism. We plan to evaluate all of these features to demonstrate the composability and ability to make choices for training large scale models.
239+
240+
241+
## Acknowledgements
242+
243+
We thank Davis Wertheimer from IBM Research for enabling the data loader for torchtitan runs enabling us to replay data in the same order across multiple runs. We also thank IBM Cloud for enabling us with early test access to the H100 cluster.

Diff for: ‎_posts/2024-12-02-hadacore.md

+207
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
---
2+
layout: blog_detail
3+
title: "HadaCore: Tensor Core Accelerated Hadamard Transform Kernel"
4+
author: "IBM and Meta"
5+
excerpt: "Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers."
6+
---
7+
8+
**IBM**: Krish Agarwal, Rishi Astra, Adnan Hoque, Mudhakar Srivatsa, Raghu Ganti
9+
**Meta**: Less Wright, Sijia Chen
10+
11+
Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers. Recent works like [QuaRot](https://arxiv.org/abs/2404.00456), [SpinQuant](https://arxiv.org/abs/2405.16406), and [FlashAttention-3](https://arxiv.org/pdf/2407.08608) introduce methods to increase the numerical accuracy of INT4, INT8 and FP8 quantization in LLMs. These methods rely on [Hadamard Transforms](https://en.wikipedia.org/wiki/Hadamard_transform). In this blog, we present HadaCore, a Hadamard Transform CUDA kernel that achieves state-of-the-art performance on NVIDIA A100 and H100 GPUs. Our kernel achieves speedups of **1.1–1.4x** and **1.0–1.3x**, with a peak gain of **3.5x** and **3.6x** respectively, over Dao AI Lab’s [Fast Hadamard Transform Kernel](https://github.com/Dao-AILab/fast-hadamard-transform). We leverage a hardware-aware work decomposition that benefits from Tensor Core acceleration while maintaining quantization error reduction.
12+
13+
14+
15+
![Figure 1: Speedup of HadaCore vs Dao AI Hadamard CUDA kernel. A peak gain of 3.46x on the A100 is achieved using 128 rotation by 8.4M elements.](/assets/images/hadacore/fg1.png){:style="width:100%"}
16+
17+
*Figure 1: Speedup of HadaCore vs Dao AI Hadamard CUDA kernel. A peak gain of 3.46x on the A100 is achieved using 128 rotation by 8.4M elements.*
18+
19+
The [HadaCore Kernel is publicly available](https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform).
20+
21+
## Background
22+
23+
[QuaRot](https://arxiv.org/abs/2404.00456) and [SpinQuant](https://arxiv.org/abs/2405.16406) both propose methods to increase the numerical accuracy of INT4 and INT8 quantization in LLMs. Both methods rotate model activations since rotations are statistically likely to reduce the magnitude of outliers, as it “distributes” extreme values among other (less extreme) dimensions, and rotation is also an easily invertible operation using the inverse of the rotation matrix. These methods can also improve FP8 inference accuracy, such as in [FlashAttention-3](https://arxiv.org/pdf/2407.08608).
24+
25+
26+
![Figure 2. Transformer block showing online (red) and offline rotations (blue) in QuaRot](/assets/images/hadacore/fg2.png){:style="width:100%"}
27+
28+
29+
*Figure 2. Transformer block showing online (red) and offline rotations (blue) in QuaRot*
30+
31+
Applying these rotation matrices introduces model runtime overhead due to the online operations shown in Figure 2. These rotations can be applied through matrix multiplication, but the added overhead would diminish the benefits from quantization. Therefore, QuaRot and SpinQuant opt to use Walsh-Hadamard matrices, a special type of rotation matrix that can be applied faster than matrix multiplication using the [Fast Walsh-Hadamard Transform](https://en.wikipedia.org/wiki/Fast_Walsh%E2%80%93Hadamard_transform) algorithm. HadaCore is an optimized implementation of this algorithm for NVIDIA GPUs that support Tensor Cores.
32+
33+
## Tensor Core Accelerated Hadamard Transform
34+
35+
HadaCore leverages [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/), which are specialized compute units on NVIDIA GPUs optimized for matrix multiplication. To achieve this, our kernel performs a hardware-aware work decomposition of the Fast Walsh-Hadamard algorithm. This work decomposition ensures that we can utilize the [MMA PTX instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=mma#multiply-and-accumulate-instruction-mma) that execute on the Tensor Core chip. HadaCore applies a 16×16 Hadamard transform to chunks of the input data. The computation can then be offloaded to the FP16 Tensor Core with usage of the [mma.m16n8k16](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=mma#matrix-fragments-for-mma-m16n8k16-with-floating-point-type) instruction. The warp-level parallelism for HadaCore is shown below.
36+
37+
38+
![Figure 3: HadaCore Parallelization, 1x256 vectors (rows) being rotated by a size 256 Hadamard.](/assets/images/hadacore/fg3.png){:style="width:100%"}
39+
40+
41+
*Figure 3: HadaCore Parallelization, 1x256 vectors (rows) being rotated by a size 256 Hadamard.*
42+
43+
We process fragments of 256 elements in parallel using warp-level Tensor Core operations to achieve up to a 256-size Hadamard transform. For further sizes, we shuffle data between warps and repeat.
44+
45+
## Microbenchmarks
46+
47+
We benchmark HadaCore against the[ Dao AI Lab Hadamard Kernel](https://github.com/Dao-AILab) on both NVIDIA H100 and A100 GPUs across varying Hadamard and input tensor sizes.
48+
49+
![Figure 4: HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel](/assets/images/hadacore/fg4.png){:style="width:100%"}
50+
51+
52+
53+
*Figure 4: HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel*
54+
55+
56+
![Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline](/assets/images/hadacore/fg5.png){:style="width:100%; margin-top: 35px;"}
57+
58+
59+
*Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline*
60+
61+
62+
![Figure 5: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel](/assets/images/hadacore/fg6.png){:style="width:100%; margin-top: 35px;"}
63+
64+
65+
*Figure 5: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel*
66+
67+
68+
![Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline](/assets/images/hadacore/fg7.png){:style="width:100%; margin-top: 35px;"}
69+
70+
71+
*Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline*
72+
73+
We showcase our speedup as the input tensor size (labeled element count) in our charts increase. Element count is the number of elements in the target matrix we are rotating. For example, in multi-head attention:
74+
75+
76+
The queries (Q), keys (K) and values (V) tensors are 4D tensors of size:
77+
78+
`(batch_size, seq_len, n_heads, head_dim)`
79+
80+
A Hadamard matrix of size `head_dim` is applied to these activation tensors, so we refer to this as using a Hadamard size of `head_dim` with an element count of:
81+
82+
`batch_size*seq_len*n_heads*head_dim.`
83+
84+
Common element counts for query rotations in an attention block:
85+
86+
87+
<table class="table table-bordered">
88+
<tr>
89+
<td><strong>Model \ Tokens</strong>
90+
</td>
91+
<td><strong>Prefill</strong>
92+
</td>
93+
<td><strong>Decoding</strong>
94+
</td>
95+
</tr>
96+
<tr>
97+
<td><strong>Llama-2 70b</strong>
98+
</td>
99+
<td>33,554,432 elements
100+
<br>
101+
128 Hadamard size
102+
<br>
103+
104+
(1 batch * 64 heads * 4096 tokens * 128 dimensional embeddings per head per token)
105+
</td>
106+
<td>8192 elements
107+
<br>
108+
128 Hadamard size
109+
<br>
110+
(1 batch * 64 heads * 1 token * 128 dimensional embeddings per head per token)
111+
</td>
112+
</tr>
113+
<tr>
114+
<td><strong>Llama-3 8b</strong>
115+
</td>
116+
<td>33,554,432 elements
117+
<br>
118+
128 Hadamard size
119+
<br>
120+
(1 batch * 32 heads * 8192 tokens * 128 dimensional embeddings per head per token)
121+
</td>
122+
<td>4,096 elements
123+
<br>
124+
128 Hadamard size
125+
<br>
126+
(1 batch * 32 heads * 1 token * 128 dimensional embeddings per head per token)
127+
</td>
128+
</tr>
129+
</table>
130+
131+
132+
HadaCore achieves **1.1–1.4x** speedup on A100 and **1.0–1.3x** speedup on H100 over Dao AI Lab’s Fast Hadamard kernel, with a peak gain of **3.5x and 3.6x**, respectively. For smaller sizes on H100, HadaCore’s gain decreases. For future work, we plan to incorporate usage of Hopper specific features like TMA and WGMMA for improved H100 performance.
133+
134+
## MMLU Benchmarks
135+
136+
We evaluated MMLU scores on a [Llama 3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) inference workload where the FlashAttention computation was performed in FP8. Newer generation [NVIDIA Hopper GPUs ](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/)come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16.
137+
138+
Our results show the benefit of using HadaCore for accuracy preservation when combined with optimizations such as FP8 FlashAttention.
139+
140+
141+
<table class="table table-bordered">
142+
<tr>
143+
<td><strong>Format</strong>
144+
</td>
145+
<td><strong>Method</strong>
146+
</td>
147+
<td><strong>Llama3.1-8B</strong>
148+
<br>
149+
<strong>Avg. 5-Shot MMLU Accuracy</strong>
150+
</td>
151+
</tr>
152+
<tr>
153+
<td><strong>Q, K, V: FP16</strong>
154+
<br>
155+
<strong>FlashAttention: FP16</strong>
156+
</td>
157+
<td>N/A
158+
</td>
159+
<td>65.38
160+
</td>
161+
</tr>
162+
<tr>
163+
<td><strong>Q, K, V: FP16</strong>
164+
<br>
165+
<strong>FlashAttention: FP8</strong>
166+
</td>
167+
<td>No Hadamard
168+
</td>
169+
<td>64.40
170+
</td>
171+
</tr>
172+
<tr>
173+
<td><strong>Q, K, V: FP8</strong>
174+
<br>
175+
<strong>FlashAttention: FP8</strong>
176+
</td>
177+
<td>HadaCore
178+
</td>
179+
<td>65.09
180+
</td>
181+
</tr>
182+
<tr>
183+
<td><strong>Q, K, V: FP8</strong>
184+
<br>
185+
<strong>FlashAttention: FP8</strong>
186+
</td>
187+
<td>Dao AI Fast Hadamard Kernel
188+
</td>
189+
<td>65.45
190+
</td>
191+
</tr>
192+
</table>
193+
194+
195+
*Table 1: MMLU scores for Llama3.1 8B with FP16 baseline and FP8 attention using Hadamard transforms, comparing an implementation with explicit Hadamard matrix multiplications vs. HadaCore (**higher is better**)*
196+
197+
From the above MMLU scores, we note that for Llama3.1-8B inference with FP8 attention, HadaCore improves the quantization error introduced from computing attention in a lower precision.
198+
199+
## Conclusion
200+
201+
We showcased our speedups achieved by moving the Fast-Walsh Hadamard algorithm into a CUDA kernel that leverages Tensor Core acceleration and achieves a peak speedup of **3.5x** and **3.6x** over the Dao AI Fast-Hadamard kernel on NVIDIA A100 and H100, respectively.
202+
203+
Further, we showed on the MMLU benchmark that rotating with HadaCore maintains similar quantization error reduction to the Fast-Hadamard kernel, while providing computational acceleration.
204+
205+
## Future Work
206+
207+
We plan to implement a Triton version of our kernel and experiment with more advanced techniques such as kernel fusion to support fused Hadamard transform and quantization. Further, we plan to extend our kernel to support BF16 Tensor Core compute.

Diff for: ‎assets/images/hadacore/fg1.png

247 KB
Loading

Diff for: ‎assets/images/hadacore/fg2.png

59.7 KB
Loading

Diff for: ‎assets/images/hadacore/fg3.png

22.1 KB
Loading

Diff for: ‎assets/images/hadacore/fg4.png

247 KB
Loading

Diff for: ‎assets/images/hadacore/fg5.png

81.4 KB
Loading

Diff for: ‎assets/images/hadacore/fg6.png

236 KB
Loading

Diff for: ‎assets/images/hadacore/fg7.png

85 KB
Loading

Diff for: ‎assets/images/training-using-float8-fsdp2/fg1.png

47.6 KB
Loading

Diff for: ‎assets/images/training-using-float8-fsdp2/fg2.png

47.6 KB
Loading

Diff for: ‎events.html

+1-4
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,14 @@ <h1>Events</h1>
2121
<nav class="navbar navbar-expand-lg navbar-light main-content-menu">
2222
<ul class="navbar-nav events-nav">
2323
<li id="events" class="nav-item nav-select">
24-
<span class="nav-link events-nav-link">Live Events</span>
24+
<span class="nav-link events-nav-link">Events</span>
2525
</li>
2626
<li id="live-streams" class="nav-item">
2727
<span class="nav-link events-nav-link">Webinars</span>
2828
</li>
2929
<li id="podcasts" class="nav-item">
3030
<span class="nav-link events-nav-link">Podcasts</span>
3131
</li>
32-
<li id="community-events" class="nav-item">
33-
<span class="nav-link events-nav-link">Community Events</span>
34-
</li>
3532
</ul>
3633
</nav>
3734
</div>

0 commit comments

Comments
 (0)
Please sign in to comment.