Determine which dim results in the fastest CCL operation #2318

tapspatel · 2025-02-27T16:15:11Z

Currently, when we decompose all_reduce into reduce_scatter and all_gather, we arbitrarily select a tensor dimension to do the split along. We find which dimension evenly divides by the number of devices along the cluster_axis we are performing the CCL on, and if no such tensor dim exists, we throw an error. We should select the tensor dim that results in the best performance.

tapspatel added the enhancement New feature or request label Feb 27, 2025

tapspatel added this to the [Multi Device 1] milestone Feb 27, 2025

tapspatel mentioned this issue Feb 27, 2025

#2065: Updated all reduce code to handle 0 or 1 mesh cluster axis #2215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine which dim results in the fastest CCL operation #2318

Determine which dim results in the fastest CCL operation #2318

tapspatel commented Feb 27, 2025

Determine which dim results in the fastest CCL operation #2318

Determine which dim results in the fastest CCL operation #2318

Comments

tapspatel commented Feb 27, 2025