You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Place args for all gather/reduce on devices before the op to avoid CSE and excessive copying (#171)
We are encoding the device/shard information in the
flow.tensor.transfer/transfer_to_logical_device
operation. Then if we do an all-gather or an all-reduce, CSE is happy to
collapse the expressions into one. This would result in the
all-gather/all-reduce being performed on one device and then the result
is copied to the rest. We want each device to do the
all-gather/all-reduce.
There is no easy way to test the desired effect, but at least we test
for correctness on the PyTorch level.
This change adds the all_reduce op that is currently not used anywhere.
Here is expanded the elementwise op to support a variable number of
tensor arguments.
0 commit comments