You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are 248 GPU servers in the environment, each equipped with 8 H200 GPUs,performing an all_reduce_perf test. The default NCCL_ALGO is being used. During extended testing, the bus bandwidth has been unstable, typically measuring around 350G, but occasionally dropping below 300G. What could be the reasons for this? Additionally, at this scale of nodes, what levels of bus bandwidth should be expected when using NCCL_ALGO=Ring and NCCL_ALGO=NVLSTree respectively?
Here are the test command parameters:
There are 248 GPU servers in the environment, each equipped with 8 H200 GPUs,performing an all_reduce_perf test. The default NCCL_ALGO is being used. During extended testing, the bus bandwidth has been unstable, typically measuring around 350G, but occasionally dropping below 300G. What could be the reasons for this? Additionally, at this scale of nodes, what levels of bus bandwidth should be expected when using NCCL_ALGO=Ring and NCCL_ALGO=NVLSTree respectively?
Here are the test command parameters:
mpirun --mca btl_tcp_if_include bond0 --allow-run-as-root -np 1984
-hostfile hostlist --map-by node
-x NCCL_TOPO_DUMP_FILE=topo.xml
-x NCCL_SOCKET_IFNAME=bond0
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1
-x UCX_TLS=rc,sm
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-x LD_LIBRARY_PATH
-x PATH
-bind-to numa --gmca btl tcp,self
../build/all_reduce_perf --minbytes 512M --maxbytes 512M -f2 -g1 -n200000000 -c0 -i10 -w 200
The text was updated successfully, but these errors were encountered: