Separate mFSDP v2 unit tests by wujingyue · Pull Request #5640 · NVIDIA/Megatron-LM

wujingyue · 2026-07-03T06:14:18Z

Summary

Move the mFSDP v2 / experimental unit tests from tests/unit_tests/distributed/megatron_fsdp/ into a separate tests/unit_tests/distributed/mfsdp_v2/ bucket, with a duplicated local conftest.py and a dedicated H100 unit-test recipe entry.

Moved tests:

test_cuda_graph.py
test_dbuffer.py
test_fully_shard.py (renamed from test_experimental_fully_shard.py because the folder now carries the v2 context)
test_symmetric_memory.py

This also scopes mFSDP v2 distributed cleanup to the new folder and passes CUDA devices explicitly to PyTorch barriers, which removes the PyTorch NCCL barrier-warning noise from this new bucket.

Why

The biggest motivation is CI signal: the existing v1 bucket is much slower and much noisier than the v2 tests. In the final split run, the v1 bucket took 14m19s wall time and reported 574 warnings, while the standalone v2 bucket took 3m28s wall time and reported 43 warnings.

The old tests/unit_tests/distributed/megatron_fsdp/**/*.py - latest CI bucket mixed two different kinds of coverage:

v1 mFSDP tests (test_mfsdp_fully_shard.py, test_mfsdp_uneven_dtensor.py, test_mcore_fully_sharded_data_parallel.py, etc.)
v2 / experimental FSDP tests (test_dbuffer.py, test_experimental_fully_shard.py, test_cuda_graph.py, test_symmetric_memory.py)

Keeping v2 coverage in the same bucket makes warning growth and runtime changes harder to attribute. One old combined-bucket sample attributed a 145.20s teardown to v2 test_symmetric_memory.py, but that test had run in the same distributed pytest invocation as the v1 files. The final split-bucket CI result below shows that attribution was not representative of standalone v2 runtime.

Final split-bucket CI data from run 28675683046:

Bucket	Job	Wall time	Main pytest result	Slowest duration
`tests/unit_tests/distributed/mfsdp_v2/*/.py - latest`	`85055242029`	`3m28s`	`36 passed, 43 warnings in 12.66s`	`2.86s teardown tests/unit_tests/distributed/mfsdp_v2/test_symmetric_memory.py::test_fully_shard_symmetric_memory_matches_default_and_profiles_nccl[3]`
`tests/unit_tests/distributed/megatron_fsdp/*/.py - latest`	`85055242017`	`14m19s`	`251 passed, 121 skipped, 21 deselected, 6 xfailed, 574 warnings in 656.01s (0:10:56)`	`147.32s teardown tests/unit_tests/distributed/megatron_fsdp/test_mfsdp_uneven_dtensor.py::test_split_dtensor_zero_local_shard`

Current v2 warning state:

The v2 bucket now passes CUDA devices explicitly to dist.barrier(...) in both function-scoped and session-scoped cleanup, removing the repeated PyTorch NCCL barrier-warning noise.
CI run 28675683046, job 85055242029, validated the final cleanup with 36 passed, 43 warnings in 12.66s.
The current 43 warnings are 34 import/collection baseline warnings, 8 PyTorch module-backward-hook warnings from FSDP training-style v2 tests, and 1 PyTorch profiler cycle warning from test_symmetric_memory.py. No torch.distributed barrier warning remains in the final v2 bucket run.

Splitting the bucket does not fix the existing v1 warnings or teardown behavior. It makes ownership and regression signal explicit: v1 warnings/teardown remain in the v1 bucket, while v2 warnings/teardown show up in the new mfsdp_v2 bucket with separate CI timing.

This PR only separates the test bucket and cleans the barrier warning introduced by the new v2 fixture boundary. It does not change test logic.

Validation

uv run --no-sync python -m pytest --collect-only -q tests/unit_tests/distributed/mfsdp_v2
- collected 36 tests
python tests/unit_tests/find_test_cases.py 'tests/unit_tests/**/*.py' h100 | rg 'megatron_fsdp|mfsdp_v2'
- confirmed the catch-all H100 unit-test bucket ignores both the v1 and v2 FSDP buckets
python tests/unit_tests/find_test_cases.py 'tests/unit_tests/distributed/megatron_fsdp/**/*.py' h100 && python tests/unit_tests/find_test_cases.py 'tests/unit_tests/distributed/mfsdp_v2/**/*.py' h100
- confirmed neither dedicated bucket ignores files from itself
GitHub Actions run 28675683046
- tests/unit_tests/distributed/mfsdp_v2/**/*.py - latest: success after session-scoped barrier cleanup, with 43 warnings
- tests/unit_tests/distributed/megatron_fsdp/**/*.py - latest: success

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

copy-pr-bot · 2026-07-03T06:14:22Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

wujingyue · 2026-07-03T06:14:32Z

/ok to test f3cd82f

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

wujingyue · 2026-07-03T16:15:09Z

/ok to test 86c7085

wujingyue · 2026-07-03T16:15:42Z

/ok to test 1e4e6c3

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

wujingyue · 2026-07-03T17:41:47Z

/ok to test 9b8be36

wujingyue · 2026-07-04T16:02:28Z

Alternatively, I can put these tests under tests/unit_tests/distributed/megatron_fsdp/experimental. I don't think the extra level of nesting is worthwhile just to reuse conftest.py, but I'm happy to do that if you prefer.

wujingyue · 2026-07-04T16:03:46Z

+            # Pass the device explicitly to suppress PyTorch's NCCL barrier warning.
+            dist.barrier(device_ids=[device.index])


This file is copied from distributed/megatron_fsdp; the only changes are these two lines.

Separate mFSDP v2 unit tests

f3cd82f

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

wujingyue added the Run tests label Jul 3, 2026

copy-pr-bot Bot temporarily deployed to public July 3, 2026 06:15 Inactive

copy-pr-bot Bot temporarily deployed to test July 3, 2026 06:15 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 06:18 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 06:27 Inactive

wujingyue added 2 commits July 3, 2026 16:13

Specify mFSDP v2 barrier device

86c7085

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

Document mFSDP v2 barrier device selection

1e4e6c3

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

copy-pr-bot Bot temporarily deployed to public July 3, 2026 16:16 Inactive

copy-pr-bot Bot temporarily deployed to test July 3, 2026 16:16 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 16:19 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 16:20 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 16:28 Inactive

Silence mFSDP v2 cleanup barrier warning

9b8be36

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

copy-pr-bot Bot temporarily deployed to public July 3, 2026 17:42 Inactive

copy-pr-bot Bot temporarily deployed to test July 3, 2026 17:43 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 17:45 Inactive

copy-pr-bot Bot temporarily deployed to public July 3, 2026 17:54 Inactive

wujingyue changed the title ~~[codex] Separate mFSDP v2 unit tests~~ Separate mFSDP v2 unit tests Jul 4, 2026

wujingyue marked this pull request as ready for review July 4, 2026 06:41

wujingyue requested a review from a team as a code owner July 4, 2026 06:41

wujingyue requested a review from a team July 4, 2026 06:42

svcnvidia-nemo-ci added the complexity: low label Jul 4, 2026

wujingyue commented Jul 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate mFSDP v2 unit tests#5640

Separate mFSDP v2 unit tests#5640
wujingyue wants to merge 4 commits into
NVIDIA:mainfrom
wujingyue:codex/separate-mfsdp-v2-tests

wujingyue commented Jul 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue Jul 4, 2026

Uh oh!

wujingyue Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Pass the device explicitly to suppress PyTorch's NCCL barrier warning.
		dist.barrier(device_ids=[device.index])

Uh oh!

Conversation

wujingyue commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue commented Jul 3, 2026

Uh oh!

wujingyue Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wujingyue commented Jul 3, 2026 •

edited

Loading