Skip to content

Separate mFSDP v2 unit tests#5640

Open
wujingyue wants to merge 4 commits into
NVIDIA:mainfrom
wujingyue:codex/separate-mfsdp-v2-tests
Open

Separate mFSDP v2 unit tests#5640
wujingyue wants to merge 4 commits into
NVIDIA:mainfrom
wujingyue:codex/separate-mfsdp-v2-tests

Conversation

@wujingyue

@wujingyue wujingyue commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Move the mFSDP v2 / experimental unit tests from tests/unit_tests/distributed/megatron_fsdp/ into a separate tests/unit_tests/distributed/mfsdp_v2/ bucket, with a duplicated local conftest.py and a dedicated H100 unit-test recipe entry.

Moved tests:

  • test_cuda_graph.py
  • test_dbuffer.py
  • test_fully_shard.py (renamed from test_experimental_fully_shard.py because the folder now carries the v2 context)
  • test_symmetric_memory.py

This also scopes mFSDP v2 distributed cleanup to the new folder and passes CUDA devices explicitly to PyTorch barriers, which removes the PyTorch NCCL barrier-warning noise from this new bucket.

Why

The biggest motivation is CI signal: the existing v1 bucket is much slower and much noisier than the v2 tests. In the final split run, the v1 bucket took 14m19s wall time and reported 574 warnings, while the standalone v2 bucket took 3m28s wall time and reported 43 warnings.

The old tests/unit_tests/distributed/megatron_fsdp/**/*.py - latest CI bucket mixed two different kinds of coverage:

  • v1 mFSDP tests (test_mfsdp_fully_shard.py, test_mfsdp_uneven_dtensor.py, test_mcore_fully_sharded_data_parallel.py, etc.)
  • v2 / experimental FSDP tests (test_dbuffer.py, test_experimental_fully_shard.py, test_cuda_graph.py, test_symmetric_memory.py)

Keeping v2 coverage in the same bucket makes warning growth and runtime changes harder to attribute. One old combined-bucket sample attributed a 145.20s teardown to v2 test_symmetric_memory.py, but that test had run in the same distributed pytest invocation as the v1 files. The final split-bucket CI result below shows that attribution was not representative of standalone v2 runtime.

Final split-bucket CI data from run 28675683046:

Bucket Job Wall time Main pytest result Slowest duration
tests/unit_tests/distributed/mfsdp_v2/**/*.py - latest 85055242029 3m28s 36 passed, 43 warnings in 12.66s 2.86s teardown tests/unit_tests/distributed/mfsdp_v2/test_symmetric_memory.py::test_fully_shard_symmetric_memory_matches_default_and_profiles_nccl[3]
tests/unit_tests/distributed/megatron_fsdp/**/*.py - latest 85055242017 14m19s 251 passed, 121 skipped, 21 deselected, 6 xfailed, 574 warnings in 656.01s (0:10:56) 147.32s teardown tests/unit_tests/distributed/megatron_fsdp/test_mfsdp_uneven_dtensor.py::test_split_dtensor_zero_local_shard

Current v2 warning state:

  • The v2 bucket now passes CUDA devices explicitly to dist.barrier(...) in both function-scoped and session-scoped cleanup, removing the repeated PyTorch NCCL barrier-warning noise.
  • CI run 28675683046, job 85055242029, validated the final cleanup with 36 passed, 43 warnings in 12.66s.
  • The current 43 warnings are 34 import/collection baseline warnings, 8 PyTorch module-backward-hook warnings from FSDP training-style v2 tests, and 1 PyTorch profiler cycle warning from test_symmetric_memory.py. No torch.distributed barrier warning remains in the final v2 bucket run.

Splitting the bucket does not fix the existing v1 warnings or teardown behavior. It makes ownership and regression signal explicit: v1 warnings/teardown remain in the v1 bucket, while v2 warnings/teardown show up in the new mfsdp_v2 bucket with separate CI timing.

This PR only separates the test bucket and cleans the barrier warning introduced by the new v2 fixture boundary. It does not change test logic.

Validation

  • uv run --no-sync python -m pytest --collect-only -q tests/unit_tests/distributed/mfsdp_v2
    • collected 36 tests
  • python tests/unit_tests/find_test_cases.py 'tests/unit_tests/**/*.py' h100 | rg 'megatron_fsdp|mfsdp_v2'
    • confirmed the catch-all H100 unit-test bucket ignores both the v1 and v2 FSDP buckets
  • python tests/unit_tests/find_test_cases.py 'tests/unit_tests/distributed/megatron_fsdp/**/*.py' h100 && python tests/unit_tests/find_test_cases.py 'tests/unit_tests/distributed/mfsdp_v2/**/*.py' h100
    • confirmed neither dedicated bucket ignores files from itself
  • GitHub Actions run 28675683046
    • tests/unit_tests/distributed/mfsdp_v2/**/*.py - latest: success after session-scoped barrier cleanup, with 43 warnings
    • tests/unit_tests/distributed/megatron_fsdp/**/*.py - latest: success

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test f3cd82f

wujingyue added 2 commits July 3, 2026 16:13
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test 86c7085

@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test 1e4e6c3

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test 9b8be36

@wujingyue wujingyue changed the title [codex] Separate mFSDP v2 unit tests Separate mFSDP v2 unit tests Jul 4, 2026
@wujingyue wujingyue marked this pull request as ready for review July 4, 2026 06:41
@wujingyue wujingyue requested a review from a team as a code owner July 4, 2026 06:41
@wujingyue wujingyue requested a review from a team July 4, 2026 06:42

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I can put these tests under tests/unit_tests/distributed/megatron_fsdp/experimental. I don't think the extra level of nesting is worthwhile just to reuse conftest.py, but I'm happy to do that if you prefer.

Comment on lines +58 to +59
# Pass the device explicitly to suppress PyTorch's NCCL barrier warning.
dist.barrier(device_ids=[device.index])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is copied from distributed/megatron_fsdp; the only changes are these two lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants