Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs? #1600

FortPercent · 2025-02-11T08:25:59Z

When using NCCL, I found that the inter-node RDMA connections only utilized a subset of the NICs instead of all available ones. Additionally, the inter-node connections did not appear to be on the same rail. As a result, when testing with 16 H100 machines, the performance was only 10GB/s.

Below are our topology diagram and the log printed with DEBUG=INFO with 2 nodes.
Here are the environment variables we used:
NCCL_NET_GDR_LEVEL=2
NCCL_IBEXT_DISABLE=1
NCCL_SHARP_DISABLE=1

test.log
topo.txt

kiskra-nvidia · 2025-02-11T17:16:42Z

I suggest adding NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING to gain more insight into the topology that NCCL is seeing and the collective algorithm/protocol it chooses.

The log you attached is from just one node -- based on it we can't be sure what the situation is on the other node.

NCCL will normally prefer rail-optimized connectivity but by default that's not actually enforced. See NCCL_CROSS_NIC for the list of available options.

FortPercent · 2025-02-13T06:19:07Z

Thanks for the suggestion! I'll add NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING for more insights and check NCCL_CROSS_NIC for connectivity options. I'll also collect logs from both nodes for a full picture. Appreciate your input!

I suggest adding NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING to gain more insight into the topology that NCCL is seeing and the collective algorithm/protocol it chooses.

The log you attached is from just one node -- based on it we can't be sure what the situation is on the other node.

NCCL will normally prefer rail-optimized connectivity but by default that's not actually enforced. See NCCL_CROSS_NIC for the list of available options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs? #1600

Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs? #1600

FortPercent commented Feb 11, 2025

kiskra-nvidia commented Feb 11, 2025

FortPercent commented Feb 13, 2025

Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs? #1600

Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs? #1600

Comments

FortPercent commented Feb 11, 2025

kiskra-nvidia commented Feb 11, 2025

FortPercent commented Feb 13, 2025