[Feature]: Add more configuration options for MoE assembly kernels

### Suggestion Description

Hello,
We are currently optimizing DeepSeek with Expert Parallelism (EP) for relatively high concurrency (around 2K). In the process, we identified potential opportunities to improve the current aiter MoE assembly kernel and would like to submit a feature request regarding this.

We are experimenting with a8w8 mode, and observed that the following kernel is being invoked:

```fmoe_bf16_pertokenFp8_g1u1_silu_1tg_ps_32x512```

With a workload of 2048 tokens per GPU, this kernel achieved the following performance (inter dim = 2048, hidden dim = 7168):

- **Execution time**: ~484 µs
- **Throughput**: ~372 TFLOPS

The modification we would like to request mainly concerns the block size. The current kernel operates with `block_size_m = 32`. We hypothesize that increasing this size could improve MoE performance, especially in high-concurrency cases. Since this operation behaves similarly to GEMM, increasing `block_size_m` is expected to reduce redundant I/O and improve effective TFLOPS.


**Estimated Performance Scaling**

Below is a simple estimation of how performance and I/O traffic may improve with different block sizes. The following are the assumptions and rationale related to this estimation:

- When calculating I/O, we did not simply use the tensor size. Instead, we estimated the actual total amount of global memory access, taking into account the tiling of matrix multiplication — in other words, the fact that smaller tiles can lead to more frequent data reloads was reflected in the estimation

- To simplify the estimation, we mainly focused on the total I/O involved in reading the experts.
For tokens, the access pattern can vary depending on implementation details (e.g., in a two-stage fused MoE, intermediate tensors may not be written to global memory), and these differences are not expected to significantly affect the overall estimation.

- We assume a uniform distribution of tokens across all experts.

- The first row shows the measured execution time and TFLOPS from the actual kernel run. We believe that the kernel performance in this case is bound by memory bandwidth.

- The other two rows show the estimated results. In this estimation, we assumed that (1) the total amount of I/O changes with different block sizes, and (2) the kernel operates under the same memory bandwidth as in the first row. Under these assumptions, higher TFLOPS are expected.

- Please note that this estimation is not intended to calculate the exact TFLOPS, but rather to demonstrate that the potential performance ceiling increases.

| Experts/GPU | Tokens/Expert | inter dim | hidden dim | block_size_m | block_size_n | Input + Output I/O (GB) | Exec Time (µs) | TFLOPS     |
|-------------|---------------|-----------|------------|--------------|--------------|-------------------------|----------------|------------|
| 32          | 64            | 2048      | 7168       | 32           | 512          | _**2.81**_              | 484            | 372.70     |
| 32          | 64            | 2048      | 7168       | 64           | 256          | _**1.66**_              | _**286**_      | _**630.22**_|
| 16          | 128           | 2048      | 7168       | 128          | 128          | _**1.33**_              | _**230**_      | _**785.24**_|

Values shown in _**bold italics**_ are estimated.

**Feature Request**

For the reasons described above, we believe that a kernel supporting a `block_size_m` larger than 32 is needed. In this regard, 

- It would be great if the aiter team could add an assembly kernel with a larger block size

- Or alternatively, if there is a kernel generation script that allows us to experiment with different block sizes, sharing it with us would be helpful — we could then directly contribute to improving the functionality and coverage of aiter.

This issue is submitted on behalf of Moreh Inc.

### Operating System

_No response_

### GPU

MI300

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Add more configuration options for MoE assembly kernels #1248

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experts/GPU	Tokens/Expert	inter dim	hidden dim	block_size_m	block_size_n	Input + Output I/O (GB)	Exec Time (µs)	TFLOPS
32	64	2048	7168	32	512	2.81	484	372.70
32	64	2048	7168	64	256	1.66	286	630.22
16	128	2048	7168	128	128	1.33	230	785.24

[Feature]: Add more configuration options for MoE assembly kernels #1248

Description

Suggestion Description

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions