-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Suggestion Description
Hello,
We are currently optimizing DeepSeek with Expert Parallelism (EP) for relatively high concurrency (around 2K). In the process, we identified potential opportunities to improve the current aiter MoE assembly kernel and would like to submit a feature request regarding this.
We are experimenting with a8w8 mode, and observed that the following kernel is being invoked:
fmoe_bf16_pertokenFp8_g1u1_silu_1tg_ps_32x512
With a workload of 2048 tokens per GPU, this kernel achieved the following performance (inter dim = 2048, hidden dim = 7168):
- Execution time: ~484 µs
- Throughput: ~372 TFLOPS
The modification we would like to request mainly concerns the block size. The current kernel operates with block_size_m = 32. We hypothesize that increasing this size could improve MoE performance, especially in high-concurrency cases. Since this operation behaves similarly to GEMM, increasing block_size_m is expected to reduce redundant I/O and improve effective TFLOPS.
Estimated Performance Scaling
Below is a simple estimation of how performance and I/O traffic may improve with different block sizes. The following are the assumptions and rationale related to this estimation:
-
When calculating I/O, we did not simply use the tensor size. Instead, we estimated the actual total amount of global memory access, taking into account the tiling of matrix multiplication — in other words, the fact that smaller tiles can lead to more frequent data reloads was reflected in the estimation
-
To simplify the estimation, we mainly focused on the total I/O involved in reading the experts.
For tokens, the access pattern can vary depending on implementation details (e.g., in a two-stage fused MoE, intermediate tensors may not be written to global memory), and these differences are not expected to significantly affect the overall estimation. -
We assume a uniform distribution of tokens across all experts.
-
The first row shows the measured execution time and TFLOPS from the actual kernel run. We believe that the kernel performance in this case is bound by memory bandwidth.
-
The other two rows show the estimated results. In this estimation, we assumed that (1) the total amount of I/O changes with different block sizes, and (2) the kernel operates under the same memory bandwidth as in the first row. Under these assumptions, higher TFLOPS are expected.
-
Please note that this estimation is not intended to calculate the exact TFLOPS, but rather to demonstrate that the potential performance ceiling increases.
| Experts/GPU | Tokens/Expert | inter dim | hidden dim | block_size_m | block_size_n | Input + Output I/O (GB) | Exec Time (µs) | TFLOPS |
|---|---|---|---|---|---|---|---|---|
| 32 | 64 | 2048 | 7168 | 32 | 512 | 2.81 | 484 | 372.70 |
| 32 | 64 | 2048 | 7168 | 64 | 256 | 1.66 | 286 | 630.22 |
| 16 | 128 | 2048 | 7168 | 128 | 128 | 1.33 | 230 | 785.24 |
Values shown in bold italics are estimated.
Feature Request
For the reasons described above, we believe that a kernel supporting a block_size_m larger than 32 is needed. In this regard,
-
It would be great if the aiter team could add an assembly kernel with a larger block size
-
Or alternatively, if there is a kernel generation script that allows us to experiment with different block sizes, sharing it with us would be helpful — we could then directly contribute to improving the functionality and coverage of aiter.
This issue is submitted on behalf of Moreh Inc.
Operating System
No response
GPU
MI300
ROCm Component
No response