You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs?
#1600
Open
FortPercent opened this issue
Feb 11, 2025
· 2 comments
When using NCCL, I found that the inter-node RDMA connections only utilized a subset of the NICs instead of all available ones. Additionally, the inter-node connections did not appear to be on the same rail. As a result, when testing with 16 H100 machines, the performance was only 10GB/s.
Below are our topology diagram and the log printed with DEBUG=INFO with 2 nodes.
Here are the environment variables we used:
NCCL_NET_GDR_LEVEL=2
NCCL_IBEXT_DISABLE=1
NCCL_SHARP_DISABLE=1
I suggest adding NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING to gain more insight into the topology that NCCL is seeing and the collective algorithm/protocol it chooses.
The log you attached is from just one node -- based on it we can't be sure what the situation is on the other node.
NCCL will normally prefer rail-optimized connectivity but by default that's not actually enforced. See NCCL_CROSS_NIC for the list of available options.
Thanks for the suggestion! I'll add NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING for more insights and check NCCL_CROSS_NIC for connectivity options. I'll also collect logs from both nodes for a full picture. Appreciate your input!
I suggest adding NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING to gain more insight into the topology that NCCL is seeing and the collective algorithm/protocol it chooses.
The log you attached is from just one node -- based on it we can't be sure what the situation is on the other node.
NCCL will normally prefer rail-optimized connectivity but by default that's not actually enforced. See NCCL_CROSS_NIC for the list of available options.
When using NCCL, I found that the inter-node RDMA connections only utilized a subset of the NICs instead of all available ones. Additionally, the inter-node connections did not appear to be on the same rail. As a result, when testing with 16 H100 machines, the performance was only 10GB/s.
Below are our topology diagram and the log printed with DEBUG=INFO with 2 nodes.
Here are the environment variables we used:
NCCL_NET_GDR_LEVEL=2
NCCL_IBEXT_DISABLE=1
NCCL_SHARP_DISABLE=1
test.log
topo.txt
The text was updated successfully, but these errors were encountered: