NCCL Test Multi-node Bus Bandwidth Tuning issue #283

LXLei · 2025-01-18T01:03:05Z

There are 248 GPU servers in the environment, each equipped with 8 H200 GPUs，performing an all_reduce_perf test. The default NCCL_ALGO is being used. During extended testing, the bus bandwidth has been unstable, typically measuring around 350G, but occasionally dropping below 300G. What could be the reasons for this? Additionally, at this scale of nodes, what levels of bus bandwidth should be expected when using NCCL_ALGO=Ring and NCCL_ALGO=NVLSTree respectively?
Here are the test command parameters:

mpirun --mca btl_tcp_if_include bond0 --allow-run-as-root -np 1984
-hostfile hostlist --map-by node
-x NCCL_TOPO_DUMP_FILE=topo.xml
-x NCCL_SOCKET_IFNAME=bond0
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1
-x UCX_TLS=rc,sm
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-x LD_LIBRARY_PATH
-x PATH
-bind-to numa --gmca btl tcp,self
../build/all_reduce_perf --minbytes 512M --maxbytes 512M -f2 -g1 -n200000000 -c0 -i10 -w 200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Test Multi-node Bus Bandwidth Tuning issue #283

NCCL Test Multi-node Bus Bandwidth Tuning issue #283

LXLei commented Jan 18, 2025

NCCL Test Multi-node Bus Bandwidth Tuning issue #283

NCCL Test Multi-node Bus Bandwidth Tuning issue #283

Comments

LXLei commented Jan 18, 2025