[Dev] Fix FSDP backward hooks for TE-fused experts by lhb8125 · Pull Request #5636 · NVIDIA/Megatron-LM

lhb8125 · 2026-07-03T01:45:46Z

Summary

forward the original GroupedLinear post-forward hooks to the TE op-fuser output
preserve Megatron-FSDP pre-backward parameter all-gathers when fused expert execution bypasses the original submodule forward() calls
cover both regular and with_kwargs=True post-forward hooks in the grouped-MLP unit tests

Root cause

The TE op-fuser path already forwards the original GroupedLinear pre-forward hooks, but it bypasses their post-forward hooks. Megatron-FSDP uses those post-forward hooks to attach pre-backward all-gathers. With optim_grads_params, expert parameters released after forward could therefore remain unavailable when deferred grouped-wgrad ran, resulting in an illegal CUDA memory access.

The non-op-fuser path continues to invoke GroupedLinear.forward() normally and is unaffected.

Test plan

CHECK_ONLY=true BASE_REF=dev bash tools/autoformat.sh (black, isort, pylint, and ruff passed)
python tools/check_copyright.py megatron/core/transformer/moe/experts.py tests/unit_tests/transformer/moe/test_grouped_mlp.py
Qwen3-235B, 94 layers, TP1/PP1/EP64, 128 GB200 GPUs, MBS=1, MXFP8, Megatron-FSDP optim_grads_params, paged stash, TE op fuser, and full-iteration CUDA graph: job 20260702-082849-85c1 completed 9 stable iterations with finite loss/grad norm, median 861.45 TFLOP/s/GPU, and 92,006 MB peak allocated memory

The CI-faithful target unit-test launch did not reach pytest because the Lyris login node could not import mcore-ci-dev:latest without registry/Enroot credentials; upstream CI should execute the added test.

Signed-off-by: hongbinl <hongbinl@nvidia.com>

copy-pr-bot · 2026-07-03T01:45:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fix: restore FSDP backward hooks for fused experts

b98b528

Signed-off-by: hongbinl <hongbinl@nvidia.com>

lhb8125 requested review from a team as code owners July 3, 2026 01:45

lhb8125 requested a review from shjwudp July 3, 2026 03:02

lhb8125 mentioned this pull request Jul 3, 2026

Fix FSDP hooks bypassed by TE fused experts shjwudp/Megatron-LM#25

Draft

shjwudp approved these changes Jul 3, 2026

View reviewed changes

lhb8125 requested a review from yaox12 July 3, 2026 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev] Fix FSDP backward hooks for TE-fused experts#5636

[Dev] Fix FSDP backward hooks for TE-fused experts#5636
lhb8125 wants to merge 1 commit into
NVIDIA:devfrom
lhb8125:denliu/fix-te-op-fuser-fsdp-hooks

lhb8125 commented Jul 3, 2026

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lhb8125 commented Jul 3, 2026

Summary

Root cause

Test plan

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants