You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New chaining/partitioning algorithm for async_scheduling for inference (pytorch#11957)
Summary:
Pull Request resolved: pytorch#11957
For distributed inference, we want to use async_scheduling net to run the net as we need its async part. However, according to the profiling, async_net has big overhead of dispatching tasks onto worker threads. This diff improves the issue by generating a smaller number of chains/tasks by grouping the sync ops that can be run in one shot. Note that it also schedule individual async ops as a single chain because unlike gpu ops, rpc ops are not guaranteed to be linearized at the remote site. For example, if you have two rps ops `op1->op2`, op2 won't implicitly block until op1 finishes. Therefore we need to put each of the async op as one chain as async_scheduling net will only sync the tail of the chain.
For the all sync op nets, this change give us `1.5X` slower than simple_net, while without the change, it is `7X` slower.
Next step is to work on the executor to make the task scheduling faster. And add a fallback path to be able to run ops inline if it's a all-sync net.
Reviewed By: ilia-cher
Differential Revision: D9874140
fbshipit-source-id: fcd45328698c29211f2c06ee3287194acda12227
0 commit comments