Fix FSDP hooks bypassed by TE fused experts by lhb8125 · Pull Request #25 · shjwudp/Megatron-LM

lhb8125 · 2026-07-02T15:53:59Z

Summary

Forward GroupedLinear pre-forward hooks through the TE fused expert implementation while preserving PyTorch with_kwargs signatures.
Forward submodule post-forward hooks to the fused MLP output so Megatron-FSDP can attach its pre-backward parameter all-gather.

This is the mfsdp_refactor backport of the core post-forward fix in NVIDIA#5636, plus the pre-forward hook-signature correction needed by with_kwargs=True fine-grained hooks on this branch. The PR contains production code only.

The rewrite removes the previous HybridEP static-budget fallback, token/probability/gradient row trimming, padding metadata/cache/zeroing, CUDA-graph handle retention, and related documentation/tests. The target paged-stash recipes already set moe_expert_rank_capacity_factor: 1.2; the automatic fallback was inactive there and lacked a complete overflow contract.

Validation

CHECK_ONLY=true BASE_REF=mfsdp_refactor bash tools/autoformat.sh: Black, isort, Pylint 10.00/10, and Ruff passed.
Copyright, Python 3.12 py_compile, and git diff --check passed.
Same-node 64-GPU GB200 A/B, TP1/PP1/EP64, MBS2/GBS512, seq 4096, paged stash, explicit capacity 1.2, activation offload, full-iteration CUDA graph; metrics skip 8 warmup iterations.

Path	Control job / median TFLOP/s/GPU	Minimal job / median TFLOP/s/GPU	Delta	Peak allocated delta	Reserved/device-used delta
FSDP v2	`20260702-202730-fdfa` / 1033.50	`20260702-203255-400a` / 1033.15	-0.034%	-988 MB	-60/-60 MB
FSDP v1 on v2 code	`20260702-203814-177d` / 1050.05	`20260702-204330-f5ca` / 1045.70	-0.41%	-989 MB	-4280/-4280 MB

All four jobs completed 16/16 iterations with finite loss/grad norm, zero skipped iterations, and zero NaN iterations. Final control/candidate losses differed by only 1e-5 on each path.

The PR remains draft while the reduced scope is reviewed.

lhb8125 mentioned this pull request Jul 2, 2026

[Integration reference] [FSDP] Stabilize full-iteration CUDA graph training #23

Closed

lhb8125 marked this pull request as ready for review July 3, 2026 01:57

lhb8125 marked this pull request as draft July 3, 2026 03:10

lhb8125 force-pushed the codex/mfsdp-v2-split-2-hybridep branch from 828d588 to dcf38bd Compare July 3, 2026 03:49

lhb8125 changed the title ~~Stabilize HybridEP token shapes for full CUDA graphs~~ Fix FSDP hooks bypassed by TE fused experts Jul 3, 2026

fix(moe): forward fused expert module hooks

8063e74

lhb8125 force-pushed the codex/mfsdp-v2-split-2-hybridep branch from dcf38bd to 8063e74 Compare July 3, 2026 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP hooks bypassed by TE fused experts#25

Fix FSDP hooks bypassed by TE fused experts#25
lhb8125 wants to merge 1 commit into
shjwudp:mfsdp_refactorfrom
lhb8125:codex/mfsdp-v2-split-2-hybridep

lhb8125 commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lhb8125 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lhb8125 commented Jul 2, 2026 •

edited

Loading