Skip to content

[Dev] Fix FSDP backward hooks for TE-fused experts#5636

Open
lhb8125 wants to merge 1 commit into
NVIDIA:devfrom
lhb8125:denliu/fix-te-op-fuser-fsdp-hooks
Open

[Dev] Fix FSDP backward hooks for TE-fused experts#5636
lhb8125 wants to merge 1 commit into
NVIDIA:devfrom
lhb8125:denliu/fix-te-op-fuser-fsdp-hooks

Conversation

@lhb8125

@lhb8125 lhb8125 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • forward the original GroupedLinear post-forward hooks to the TE op-fuser output
  • preserve Megatron-FSDP pre-backward parameter all-gathers when fused expert execution bypasses the original submodule forward() calls
  • cover both regular and with_kwargs=True post-forward hooks in the grouped-MLP unit tests

Root cause

The TE op-fuser path already forwards the original GroupedLinear pre-forward hooks, but it bypasses their post-forward hooks. Megatron-FSDP uses those post-forward hooks to attach pre-backward all-gathers. With optim_grads_params, expert parameters released after forward could therefore remain unavailable when deferred grouped-wgrad ran, resulting in an illegal CUDA memory access.

The non-op-fuser path continues to invoke GroupedLinear.forward() normally and is unaffected.

Test plan

  • CHECK_ONLY=true BASE_REF=dev bash tools/autoformat.sh (black, isort, pylint, and ruff passed)
  • python tools/check_copyright.py megatron/core/transformer/moe/experts.py tests/unit_tests/transformer/moe/test_grouped_mlp.py
  • Qwen3-235B, 94 layers, TP1/PP1/EP64, 128 GB200 GPUs, MBS=1, MXFP8, Megatron-FSDP optim_grads_params, paged stash, TE op fuser, and full-iteration CUDA graph: job 20260702-082849-85c1 completed 9 stable iterations with finite loss/grad norm, median 861.45 TFLOP/s/GPU, and 92,006 MB peak allocated memory

The CI-faithful target unit-test launch did not reach pytest because the Lyris login node could not import mcore-ci-dev:latest without registry/Enroot credentials; upstream CI should execute the added test.

Signed-off-by: hongbinl <hongbinl@nvidia.com>
@lhb8125 lhb8125 requested review from a team as code owners July 3, 2026 01:45
@copy-pr-bot

copy-pr-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lhb8125 lhb8125 requested a review from yaox12 July 3, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants