Separate mFSDP v2 unit tests#5640
Open
wujingyue wants to merge 4 commits into
Open
Conversation
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test f3cd82f |
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
Contributor
Author
|
/ok to test 86c7085 |
Contributor
Author
|
/ok to test 1e4e6c3 |
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
Contributor
Author
|
/ok to test 9b8be36 |
wujingyue
commented
Jul 4, 2026
Contributor
Author
There was a problem hiding this comment.
Alternatively, I can put these tests under tests/unit_tests/distributed/megatron_fsdp/experimental. I don't think the extra level of nesting is worthwhile just to reuse conftest.py, but I'm happy to do that if you prefer.
wujingyue
commented
Jul 4, 2026
Comment on lines
+58
to
+59
| # Pass the device explicitly to suppress PyTorch's NCCL barrier warning. | ||
| dist.barrier(device_ids=[device.index]) |
Contributor
Author
There was a problem hiding this comment.
This file is copied from distributed/megatron_fsdp; the only changes are these two lines.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Move the mFSDP v2 / experimental unit tests from
tests/unit_tests/distributed/megatron_fsdp/into a separatetests/unit_tests/distributed/mfsdp_v2/bucket, with a duplicated localconftest.pyand a dedicated H100 unit-test recipe entry.Moved tests:
test_cuda_graph.pytest_dbuffer.pytest_fully_shard.py(renamed fromtest_experimental_fully_shard.pybecause the folder now carries the v2 context)test_symmetric_memory.pyThis also scopes mFSDP v2 distributed cleanup to the new folder and passes CUDA devices explicitly to PyTorch barriers, which removes the PyTorch NCCL barrier-warning noise from this new bucket.
Why
The biggest motivation is CI signal: the existing v1 bucket is much slower and much noisier than the v2 tests. In the final split run, the v1 bucket took
14m19swall time and reported574 warnings, while the standalone v2 bucket took3m28swall time and reported43 warnings.The old
tests/unit_tests/distributed/megatron_fsdp/**/*.py - latestCI bucket mixed two different kinds of coverage:test_mfsdp_fully_shard.py,test_mfsdp_uneven_dtensor.py,test_mcore_fully_sharded_data_parallel.py, etc.)test_dbuffer.py,test_experimental_fully_shard.py,test_cuda_graph.py,test_symmetric_memory.py)Keeping v2 coverage in the same bucket makes warning growth and runtime changes harder to attribute. One old combined-bucket sample attributed a
145.20s teardownto v2test_symmetric_memory.py, but that test had run in the same distributed pytest invocation as the v1 files. The final split-bucket CI result below shows that attribution was not representative of standalone v2 runtime.Final split-bucket CI data from run
28675683046:tests/unit_tests/distributed/mfsdp_v2/**/*.py - latest850552420293m28s36 passed, 43 warnings in 12.66s2.86s teardown tests/unit_tests/distributed/mfsdp_v2/test_symmetric_memory.py::test_fully_shard_symmetric_memory_matches_default_and_profiles_nccl[3]tests/unit_tests/distributed/megatron_fsdp/**/*.py - latest8505524201714m19s251 passed, 121 skipped, 21 deselected, 6 xfailed, 574 warnings in 656.01s (0:10:56)147.32s teardown tests/unit_tests/distributed/megatron_fsdp/test_mfsdp_uneven_dtensor.py::test_split_dtensor_zero_local_shardCurrent v2 warning state:
dist.barrier(...)in both function-scoped and session-scoped cleanup, removing the repeated PyTorch NCCL barrier-warning noise.28675683046, job85055242029, validated the final cleanup with36 passed, 43 warnings in 12.66s.43 warningsare34import/collection baseline warnings,8PyTorch module-backward-hook warnings from FSDP training-style v2 tests, and1PyTorch profiler cycle warning fromtest_symmetric_memory.py. Notorch.distributedbarrier warning remains in the final v2 bucket run.Splitting the bucket does not fix the existing v1 warnings or teardown behavior. It makes ownership and regression signal explicit: v1 warnings/teardown remain in the v1 bucket, while v2 warnings/teardown show up in the new
mfsdp_v2bucket with separate CI timing.This PR only separates the test bucket and cleans the barrier warning introduced by the new v2 fixture boundary. It does not change test logic.
Validation
uv run --no-sync python -m pytest --collect-only -q tests/unit_tests/distributed/mfsdp_v2python tests/unit_tests/find_test_cases.py 'tests/unit_tests/**/*.py' h100 | rg 'megatron_fsdp|mfsdp_v2'python tests/unit_tests/find_test_cases.py 'tests/unit_tests/distributed/megatron_fsdp/**/*.py' h100 && python tests/unit_tests/find_test_cases.py 'tests/unit_tests/distributed/mfsdp_v2/**/*.py' h10028675683046tests/unit_tests/distributed/mfsdp_v2/**/*.py - latest: success after session-scoped barrier cleanup, with43 warningstests/unit_tests/distributed/megatron_fsdp/**/*.py - latest: success