Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Test Multi-node Bus Bandwidth Tuning issue #283

Open
LXLei opened this issue Jan 18, 2025 · 0 comments
Open

NCCL Test Multi-node Bus Bandwidth Tuning issue #283

LXLei opened this issue Jan 18, 2025 · 0 comments

Comments

@LXLei
Copy link

LXLei commented Jan 18, 2025

There are 248 GPU servers in the environment, each equipped with 8 H200 GPUs,performing an all_reduce_perf test. The default NCCL_ALGO is being used. During extended testing, the bus bandwidth has been unstable, typically measuring around 350G, but occasionally dropping below 300G. What could be the reasons for this? Additionally, at this scale of nodes, what levels of bus bandwidth should be expected when using NCCL_ALGO=Ring and NCCL_ALGO=NVLSTree respectively?
Here are the test command parameters:

mpirun --mca btl_tcp_if_include bond0 --allow-run-as-root -np 1984
-hostfile hostlist --map-by node
-x NCCL_TOPO_DUMP_FILE=topo.xml
-x NCCL_SOCKET_IFNAME=bond0
-x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1
-x UCX_TLS=rc,sm
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-x LD_LIBRARY_PATH
-x PATH
-bind-to numa --gmca btl tcp,self
../build/all_reduce_perf --minbytes 512M --maxbytes 512M -f2 -g1 -n200000000 -c0 -i10 -w 200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant