Question about Nsight Compute results for sm100_fp8_fp4_mega_moe_impl on B300/Blackwell
Hi, thanks for providing the scripts/run_ncu_mega_moe.sh script.
I am trying to profile the Mega-MoE kernel on B300/Blackwell using Nsight Compute. The code provides scripts/run_ncu_mega_moe.sh, and I modified the ncu_args by adding --set full in order to collect more metrics.
The relevant NCU arguments are:
ncu_args=(
--config-file off
--force-overwrite
--kernel-name sm100_fp8_fp4_mega_moe_impl
--import-source yes
--replay-mode application
--section PmSampling
--section SourceCounters
--rule LocalMemoryUsage
--set full
--launch-skip 0
--launch-count 1
--lockstep-kernel-launch
--communicator tcp
--clock-control none
--pm-sampling-interval 1000
--pm-sampling-max-passes 1
--disable-pm-warp-sampling
--communicator-tcp-num-peers "$num_processes"
--kill yes
--app-replay-buffer memory
)
However, the generated NCU report shows very low utilization:
- SM throughput is only around 1%
- Memory throughput is only around 7%
The detailed throughput numbers are also very low, as shown in the screenshots below.


I am not sure whether this result is expected.
My understanding is that this kernel should use Blackwell-specific features such as:
- UMMA /
tcgen05.mma for matrix computation
- TMA for memory movement
- Tensor Memory / TMEM as part of the Blackwell MMA pipeline
So I would not expect the overall SM and memory utilization to be this low, unless the standard Nsight Compute SpeedOfLight metrics do not properly reflect UMMA/TMA/TMEM activity, or unless I am collecting the wrong metrics.
Could you please clarify the following?
-
Is it expected that the standard NCU SpeedOfLight page reports such low SM and memory utilization for this Mega-MoE kernel on B300/Blackwell?
-
Does --set full include the right metrics for analyzing Blackwell UMMA/TMA/TMEM usage, or should I explicitly collect metrics such as:
sm__inst_executed_pipe_tc.sum
sm__inst_executed_pipe_tc.avg.pct_of_peak_sustained_elapsed
sm__inst_executed_pipe_tma.sum
sm__inst_executed_pipe_tma.avg.pct_of_peak_sustained_elapsed
sm__inst_executed_pipe_tmem.sum
sm__inst_executed_pipe_tmem.avg.pct_of_peak_sustained_elapsed
sass__inst_executed_per_opcode
sass__inst_executed_per_opcode_pipeline
- Is there a recommended Nsight Compute command or metric set for profiling this kernel on B300/Blackwell, especially for checking UMMA, TMA, Tensor Memory, and NVLink/communication behavior?
I would like to understand whether the low utilization numbers indicate a real performance issue, or whether they are simply not the right metrics for this Blackwell Mega-MoE kernel.
Thanks!
Question about Nsight Compute results for
sm100_fp8_fp4_mega_moe_implon B300/BlackwellHi, thanks for providing the
scripts/run_ncu_mega_moe.shscript.I am trying to profile the Mega-MoE kernel on B300/Blackwell using Nsight Compute. The code provides
scripts/run_ncu_mega_moe.sh, and I modified thencu_argsby adding--set fullin order to collect more metrics.The relevant NCU arguments are:
ncu_args=( --config-file off --force-overwrite --kernel-name sm100_fp8_fp4_mega_moe_impl --import-source yes --replay-mode application --section PmSampling --section SourceCounters --rule LocalMemoryUsage --set full --launch-skip 0 --launch-count 1 --lockstep-kernel-launch --communicator tcp --clock-control none --pm-sampling-interval 1000 --pm-sampling-max-passes 1 --disable-pm-warp-sampling --communicator-tcp-num-peers "$num_processes" --kill yes --app-replay-buffer memory )However, the generated NCU report shows very low utilization:
The detailed throughput numbers are also very low, as shown in the screenshots below.
I am not sure whether this result is expected.
My understanding is that this kernel should use Blackwell-specific features such as:
tcgen05.mmafor matrix computationSo I would not expect the overall SM and memory utilization to be this low, unless the standard Nsight Compute
SpeedOfLightmetrics do not properly reflect UMMA/TMA/TMEM activity, or unless I am collecting the wrong metrics.Could you please clarify the following?
Is it expected that the standard NCU
SpeedOfLightpage reports such low SM and memory utilization for this Mega-MoE kernel on B300/Blackwell?Does
--set fullinclude the right metrics for analyzing Blackwell UMMA/TMA/TMEM usage, or should I explicitly collect metrics such as:I would like to understand whether the low utilization numbers indicate a real performance issue, or whether they are simply not the right metrics for this Blackwell Mega-MoE kernel.
Thanks!