Initial pipeline parallelism support #1008

Alex-Vasile · 2025-02-25T19:25:52Z

Goal: Introduce pipeline parallelism without requiring a change to the weight irpa files or the forward passes for the different layers (see PPFFN.forward in the example file).

Changes

ShardedTensor now explicitly store what device each of their shards should live on in a .device attribute. Previously the implicit convection was that shard i lived on device i.
ShardedTensor can also be pinned to specific devices, such as for weights, or left unpinned to signal that it should be moved if needed using a .pinned attribute.
Binary operators call a helper function to see if either tensor needs to be transferred such that all shards are on matching devices. E.g. ops.foo(t1 on devs [1,2,3], t2 pinned on devs[5,6,7]) would transfer the shards of t1 onto devices [5,6,7] before performing the operation.
Several helper functions in ops can take in a torch.Tensor and therefore won't know what devices to place them on, e.g. def replicate(input: AnyTensor, count: int) -> ShardedTensor:. I've added devices and pinned as extra parameters and used defaults to keep the current behaviour unchanged.
Added a wrapper to all functions important from ops.signatures into ops to handle transfer and pinning automatically when called with ShardedTensor subclasses. Making device parallelism work, mostly, invisibly without needing to modify the functions in sharded_impls.py
- One downside is that IDEs don't appear to be able to do tab completion anymore for ops.___

Discussion points

Overall thoughts on approach?
Both device and pinned are required parameters. Should either, especially pinned, be optional and have defaults?
Exactly how should the different unary ops like ops.replicate handle the extra parameters needs more thought.
Should is_deep_equal() consider .devices and .pinned? Enabling causes a few tests to fail, such as testReplicatedLhsShardedParallelDimRhs

TODOs

Better names
Change transfer_if_needed into a decorator to automatically perform the transfers
Add support for all ops
Add tests based on sharded tests
Several helper functions in ops Change signature to accept adding current behavior as default
Test if it works in eager mode, not just AOT

sharktank/sharktank/ops/sharded_impls.py

sharktank/sharktank/types/tensors.py

sharktank/sharktank/ops/sharded_impls.py

sharktank/sharktank/examples/pipeline/export_ppffn_net.py

sharktank/sharktank/examples/sharding/export_ffn_net.py

sharktank/sharktank/layers/kv_cache.py

sharktank/sharktank/types/tensors.py

sharktank/tests/export_test.py

sharktank/tests/layers/sharded_rotary_embedding_test.py

sharktank/tests/ops/sharded_test.py

Alex-Vasile · 2025-02-27T18:32:17Z

sharktank/tests/ops/sharded_test.py

        b = torch.rand(3, 6, dtype=torch.float32)
        shard_count = 3
        unsharded_result = torch.matmul(a, b)
-        expected_result = ops.reshard_split(unsharded_result, dim=2, count=shard_count)
+        expected_result = ops.reshard_split(unsharded_result, dim=2, count=shard_count)  # TODO: How to know this should also not be pinned
        b_sharded = ops.reshard_split(b, dim=1, count=shard_count)
        a_sharded = ops.replicate(a, count=shard_count)
-        actual_result = ops.matmul(a_sharded, b_sharded)
+        actual_result = ops.matmul(a_sharded, b_sharded)  # GOOD: Should NOT be pinned
        assert ops.equal(expected_result, actual_result)


Adding .pinned and .devices to is_deep_equals, which ops.equals calls, causes this test to fail.

actual_result.pinned == False which is correct. But expected_result.pinned would end up being True if any heuristic is used to convert the default None value into a bool: it's a concrete torch.Tensor and would be indistinguishable from one loaded from a file, i.e. a weight.

How to handle this? Should is_deep_equal be looking at .devices and .pinned? I feel like it should given its docstring and current behaviour.

Blah, I can see both cases. We may want to update to include an option to include comparisons of pinning. ops.equals is such a special case that I can see us wanting to compare both numerics and placement information. For our tests though we mostly just want to compare numerics and pinning comparison is mostly metadata and not value. If this is the only case we can always just manually compare.

Alex-Vasile · 2025-02-27T18:35:23Z

sharktank/sharktank/ops/__init__.py

@@ -33,3 +33,5 @@
 # Comment this out to completely disable optimized quantized implementations.
 from . import qconv_impls
 from . import qlinear_impls
+
+from .sharded_impls import transfer_if_needed  # TODO: Hack just to get tests running, figure out properly later


Alex-Vasile · 2025-02-27T18:35:32Z

sharktank/sharktank/ops/sharded_impls.py

@@ -30,6 +30,100 @@
 from .shape import broadcast_dims, broadcast_dim, unbroadcast_dim
 from ..utils import longest_equal_range

+def copy_w_new_shards_and_devices(tensor: ShardedTensor, new_shards: List[torch.Tensor], new_devices: Tuple[int]) -> ShardedTensor:
+    # TODO: What does transfrom_globals need from this function?


Alex-Vasile · 2025-02-27T18:36:42Z

sharktank/sharktank/ops/signatures.py

 @replicate.trampoline
 def _replicate_trampoline(
-    d: SignatureDispatcher, input: AnyTensor, count: int
+    d: SignatureDispatcher, input: AnyTensor, count: int, devices: Tuple[int] | None = None, pinned: bool | None = None
 ) -> ShardedTensor:
    tensors = (input,)
+    if isinstance(input, torch.Tensor):
+        devices = devices if devices is not None else tuple(range(count))
+        pinned = pinned if pinned is not None else False
+    else:
+        # TODO: Is this correct? Will use data on `input`.
+        assert devices is None
+        assert pinned is None
+
    for override in d.find_overrides(tensors):
-        result = override(input, count=count)
+        result = override(input, count=count, devices=devices, pinned=pinned)


How to handle these helper functions correctly. We can pass a torch.Tensor and so have no idea about placement information.

Alex-Vasile · 2025-02-27T18:37:04Z

sharktank/tests/ops/pipeline_parallelized_test.py

+# TODO: Tests needed
+# 1. Pinned input for unary ops should return a pinned result.


sharktank/sharktank/ops/sharded_impls.py

rsuderman · 2025-02-28T18:26:49Z

sharktank/tests/ops/sharded_test.py

        b = torch.rand(3, 6, dtype=torch.float32)
        shard_count = 3
        unsharded_result = torch.matmul(a, b)
-        expected_result = ops.reshard_split(unsharded_result, dim=2, count=shard_count)
+        expected_result = ops.reshard_split(unsharded_result, dim=2, count=shard_count)  # TODO: How to know this should also not be pinned
        b_sharded = ops.reshard_split(b, dim=1, count=shard_count)
        a_sharded = ops.replicate(a, count=shard_count)
-        actual_result = ops.matmul(a_sharded, b_sharded)
+        actual_result = ops.matmul(a_sharded, b_sharded)  # GOOD: Should NOT be pinned
        assert ops.equal(expected_result, actual_result)


Blah, I can see both cases. We may want to update to include an option to include comparisons of pinning. ops.equals is such a special case that I can see us wanting to compare both numerics and placement information. For our tests though we mostly just want to compare numerics and pinning comparison is mostly metadata and not value. If this is the only case we can always just manually compare.

Alex-Vasile · 2025-02-28T18:44:38Z

sharktank/sharktank/ops/sharded_impls.py

+        if hasattr(f, "override"):  # Needed for ops like .gelu_tanh_approximation
+            wrapper.override = f.override


Should this be in here or applied on the output at the call site?

… and tensors and collections of them in kwargs.

Alex-Vasile · 2025-03-03T20:38:43Z

sharktank/sharktank/ops/__init__.py

@@ -18,7 +18,75 @@

 from . import _registry
 from ..types.tensors import unbox_tensor
-from .signatures import *
+
+def import_and_wrap_signatures():


This ended up having to go in the __init__ file so that any import of ops would wrap the functions.

One problem with the approach: a subsequent run of from .signatures import * will override the wrapped versions.

I am pretty certain we don't want to do this. Only the sharded_impls should be the one to consider transfer information. This is pretty specific to sharded impls. We only want to wrap the importers for the sharded cases.

Also, try to name this import_sharded_signatures or something to that effect.

Alex-Vasile · 2025-03-03T20:41:29Z

sharktank/sharktank/ops/sharded_impls.py

@@ -31,6 +31,57 @@
 from ..utils import longest_equal_range


+def transfer_if_needed(*tensors: Tuple[ShardedTensor]) -> List[ShardedTensor]:


I would like to put this in __init__ at well since that is not the only place it's being used. But the top level type hints for this functions will make the imports messy.

We want to keep in sharding as all of these wrapping behavior should be sharded specific.

Alex-Vasile · 2025-03-03T20:42:56Z

sharktank/tests/ops/pipeline_parallelized_test.py

+        post = pre.T
+        assert all(d_pre == d_post for d_pre, d_post in zip(pre.devices, post.devices))
+        # TODO: post gets pinned since resulting ShardedTensor is made with torch.Tensor shards which are assumed to always be pinned
+        # assert post.pinned == pre.pinned


This may be an issue

rsuderman · 2025-03-03T21:14:57Z

sharktank/tests/ops/sharded_test.py

-        expected_result = ops.reshard_split(unsharded_result, dim=2, count=shard_count)
+        expected_result = ops.reshard_split(
+            unsharded_result, dim=2, count=shard_count
+        )  # TODO: How to know this should also not be pinned


Remove the TODOs here and below

Alex-Vasile requested review from stellaraccident and rsuderman February 25, 2025 19:25

Alex-Vasile marked this pull request as draft February 25, 2025 19:26

Alex-Vasile force-pushed the pipeline branch from 5e54f75 to b549558 Compare February 25, 2025 20:59

Alex-Vasile added 16 commits February 25, 2025 13:03

Example

5145b9b

Changes to tensors and sharded_impls to get working

76991de

Use valid PP and TP combo

524d4c1

Formatting cleanup

93dad69

Cleanup from feedback

934d556

Changes to tensors to make tests pass

b421fd7

Changes to sharded_impls to have tests pass

4a198ea

Changes to tests to have them pass

db43786

Initial commit of tests

f9bd9fd

Cleanedup TODO

fdef39a

ops.replicate support

0a12c71

Remove ambiguity of calling ops.replicate with a ShardedTensor

b549558

Further tests, and fix typo in file name

817569c

Wrap transfer_if_needed into a decorator

cc6aaa2

Testing wrapped decorator

b5d471a

Expand transfer_if_needed to work with arbitrary number of tensors.

8c54c47

rsuderman reviewed Feb 26, 2025

View reviewed changes

sharktank/sharktank/ops/sharded_impls.py Outdated Show resolved Hide resolved

Throw error when tensors are on different devices and none are pinned.

627c73c

rsuderman requested changes Feb 26, 2025

View reviewed changes

Alex-Vasile added 4 commits February 26, 2025 11:40

Fix missing arg to constructor

7907ce1

Changes to transfer_if_needed

f2c5615

Stop pinning all_reduce and all_gather results

0d9af59

Changes to decorator

251a70a

rsuderman requested changes Feb 26, 2025

View reviewed changes

Alex-Vasile added 2 commits February 26, 2025 12:43

override_w_transfer pinnes unary ops result if input is pinned

d1fcf81

Correct year in copyright header

2ca09d4

Disable changes to is_deep_equal

cf23568

Alex-Vasile commented Feb 27, 2025

View reviewed changes

Alex-Vasile added 5 commits February 27, 2025 11:23

Cleanup and documentation

1a1091b

Missed params for initializer

da2db54

Default value for .pinned

649018f

example cleanup

2af7cd6

Remove setter and rewrite example

10e4d9b

rsuderman reviewed Feb 27, 2025

View reviewed changes

sharktank/sharktank/ops/sharded_impls.py Outdated Show resolved Hide resolved

Add clone constructor for ShardedTensor

bb87dd0

rsuderman requested changes Feb 28, 2025

View reviewed changes

Change from overriding the decorator to overring imports

a4bc196

Alex-Vasile commented Feb 28, 2025

View reviewed changes

Alex-Vasile added 12 commits February 28, 2025 11:38

Change wrapper to specify devices on output

f9340fe

Remove unnecessary passing of arguments

a14a283

fix return logic

5e147ef

Cleanup

4ff8667

Expand wrapper functionality to handle collections of tensors in args…

07d8ed8

… and tensors and collections of them in kwargs.

More tests

9e1aa62

Move tranfer wrapper to init to capture all imports of ops

b494532

Don't transfer for Index_put and index_copy

37fc560

More tests

d72f1c6

Changes to wrapper to have reshard_like pass

4817485

Don't wrap transfer_to_logical_device

5dbff38

pre-commit cleanup

24a33e9

Alex-Vasile commented Mar 3, 2025

View reviewed changes

rsuderman reviewed Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial pipeline parallelism support #1008

Initial pipeline parallelism support #1008

Alex-Vasile commented Feb 25, 2025 •

edited

Loading

Alex-Vasile Feb 27, 2025

rsuderman Feb 28, 2025

Alex-Vasile Feb 27, 2025

Alex-Vasile Feb 27, 2025

Alex-Vasile Feb 27, 2025 •

edited

Loading

Alex-Vasile Feb 27, 2025

rsuderman Feb 28, 2025

Alex-Vasile Feb 28, 2025

Alex-Vasile Mar 3, 2025

rsuderman Mar 3, 2025

rsuderman Mar 3, 2025

Alex-Vasile Mar 3, 2025

rsuderman Mar 3, 2025

Alex-Vasile Mar 3, 2025

rsuderman Mar 3, 2025

		# TODO: Tests needed
		# 1. Pinned input for unary ops should return a pinned result.

		if hasattr(f, "override"): # Needed for ops like .gelu_tanh_approximation
		wrapper.override = f.override

		@@ -31,6 +31,57 @@
		from ..utils import longest_equal_range


		def transfer_if_needed(*tensors: Tuple[ShardedTensor]) -> List[ShardedTensor]:

Initial pipeline parallelism support #1008

Are you sure you want to change the base?

Initial pipeline parallelism support #1008

Conversation

Alex-Vasile commented Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alex-Vasile Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alex-Vasile commented Feb 25, 2025 •

edited

Loading

Alex-Vasile Feb 27, 2025 •

edited

Loading