Skip to content

[Issue]: asm MoE failed fail to call hipModuleLaunchKernel( kernel_func, gdx, gdy, gdz, bdx, 1, 1, 0, stream, nullptr, (void**)&config) ---> [HIP error](invalid argument) #1220

@b8zhong

Description

@b8zhong

Problem Description

We encountered this issue on SGLang.

python3 -m sglang.bench_one_batch --batch-size 256 --input-len 1024 --output-len 1024 --model-path deepseek-ai/DeepSeek-V3.1 --load-format dummy --tp 8

[2025-10-17 23:21:26 TP4] shape is M:262144, N:512, K:7168, not found tuned config in CKGEMM, will use default config!
[aiter] shape is M:262144, N:7168, K:256, not found tuned config in CKGEMM, will use default config!
[2025-10-17 23:21:26 TP4] shape is M:262144, N:7168, K:256, not found tuned config in CKGEMM, will use default config!
[aiter] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[2025-10-17 23:21:26 TP6] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[aiter] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[2025-10-17 23:21:26 TP7] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[aiter] shape is M:262144, N:512, K:7168, not found tuned config in CKGEMM, will use default config!
[2025-10-17 23:21:26 TP5] shape is M:262144, N:512, K:7168, not found tuned config in CKGEMM, will use default config!
[aiter] shape is M:262144, N:7168, K:256, not found tuned config in CKGEMM, will use default config!
[2025-10-17 23:21:26 TP5] shape is M:262144, N:7168, K:256, not found tuned config in CKGEMM, will use default config!
[aiter] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[2025-10-17 23:21:26 TP4] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[aiter] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[2025-10-17 23:21:26 TP5] type hints mismatch, override to --> moe_fused_gate(input: torch.Tensor, bias: torch.Tensor, topk_weights: torch.Tensor, topk_ids: torch.Tensor, num_expert_group: int, topk_group: int, topk: int, n_share_experts_fusion: int, routed_scaling_factor: float = 1.0) -> list[torch.Tensor]
[aiter] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 
[2025-10-17 23:21:26 TP6] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 

[AITER] /sgl-workspace/aiter/aiter/jit/build/module_moe_asm/build/srcs/asm_fmoe.hip:250 fail to call hipModuleLaunchKernel( kernel_func, gdx, gdy, gdz, bdx, 1, 1, 0, stream, nullptr, (void**)&config) ---> [HIP error](invalid argument)
[aiter] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 
[2025-10-17 23:21:26 TP4] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 

[AITER] /sgl-workspace/aiter/aiter/jit/build/module_moe_asm/build/srcs/asm_fmoe.hip:250 fail to call hipModuleLaunchKernel( kernel_func, gdx, gdy, gdz, bdx, 1, 1, 0, stream, nullptr, (void**)&config) ---> [HIP error](invalid argument)
[aiter] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 
[2025-10-17 23:21:26 TP5] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 

[AITER] /sgl-workspace/aiter/aiter/jit/build/module_moe_asm/build/srcs/asm_fmoe.hip:250 fail to call hipModuleLaunchKernel( kernel_func, gdx, gdy, gdz, bdx, 1, 1, 0, stream, nullptr, (void**)&config) ---> [HIP error](invalid argument)
[aiter] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 
[2025-10-17 23:21:26 TP7] [fused_moe] using 1stage (kernelName1='_ZN5aiter47fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_32x256E', kernelName2='Null') for (256, 1024, 7168, 256, 256, 8, 'ActivationType.Silu', 'torch.bfloat16', 'torch.float8_e4m3fn', 'torch.float8_e4m3fn', 'QuantType.per_1x128', True, False) 

[AITER] /sgl-workspace/aiter/aiter/jit/build/module_moe_asm/build/srcs/asm_fmoe.hip:250 fail to call hipModuleLaunchKernel( kernel_func, gdx, gdy, gdz, bdx, 1, 1, 0, stream, nullptr, (void**)&config) ---> [HIP error](invalid argument)

Operating System

NAME="Ubuntu" VERSION="22.04.5 LTS (Jammy Jellyfish)"

CPU

AMD EPYC 9575F 64-Core Processor

GPU

MI355X x 8

ROCm Version

7.0.0

ROCm Component

No response

Steps to Reproduce

Use the latest SGLang Docker for ROCm. I found this issue is encountered at BS > 256

python3 -m sglang.bench_one_batch --batch-size 256 --input-len 1024 --output-len 1024 --model-path deepseek-ai/DeepSeek-V3.1 --load-format dummy --tp 8

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions