Skip to content

Fix FSDP hooks bypassed by TE fused experts#25

Draft
lhb8125 wants to merge 1 commit into
shjwudp:mfsdp_refactorfrom
lhb8125:codex/mfsdp-v2-split-2-hybridep
Draft

Fix FSDP hooks bypassed by TE fused experts#25
lhb8125 wants to merge 1 commit into
shjwudp:mfsdp_refactorfrom
lhb8125:codex/mfsdp-v2-split-2-hybridep

Conversation

@lhb8125

@lhb8125 lhb8125 commented Jul 2, 2026

Copy link
Copy Markdown

Summary

  • Forward GroupedLinear pre-forward hooks through the TE fused expert implementation while preserving PyTorch with_kwargs signatures.
  • Forward submodule post-forward hooks to the fused MLP output so Megatron-FSDP can attach its pre-backward parameter all-gather.

This is the mfsdp_refactor backport of the core post-forward fix in NVIDIA#5636, plus the pre-forward hook-signature correction needed by with_kwargs=True fine-grained hooks on this branch. The PR contains production code only.

The rewrite removes the previous HybridEP static-budget fallback, token/probability/gradient row trimming, padding metadata/cache/zeroing, CUDA-graph handle retention, and related documentation/tests. The target paged-stash recipes already set moe_expert_rank_capacity_factor: 1.2; the automatic fallback was inactive there and lacked a complete overflow contract.

Validation

  • CHECK_ONLY=true BASE_REF=mfsdp_refactor bash tools/autoformat.sh: Black, isort, Pylint 10.00/10, and Ruff passed.
  • Copyright, Python 3.12 py_compile, and git diff --check passed.
  • Same-node 64-GPU GB200 A/B, TP1/PP1/EP64, MBS2/GBS512, seq 4096, paged stash, explicit capacity 1.2, activation offload, full-iteration CUDA graph; metrics skip 8 warmup iterations.
Path Control job / median TFLOP/s/GPU Minimal job / median TFLOP/s/GPU Delta Peak allocated delta Reserved/device-used delta
FSDP v2 20260702-202730-fdfa / 1033.50 20260702-203255-400a / 1033.15 -0.034% -988 MB -60/-60 MB
FSDP v1 on v2 code 20260702-203814-177d / 1050.05 20260702-204330-f5ca / 1045.70 -0.41% -989 MB -4280/-4280 MB

All four jobs completed 16/16 iterations with finite loss/grad norm, zero skipped iterations, and zero NaN iterations. Final control/candidate losses differed by only 1e-5 on each path.

The PR remains draft while the reduced scope is reviewed.

@lhb8125 lhb8125 marked this pull request as ready for review July 3, 2026 01:57
@lhb8125 lhb8125 marked this pull request as draft July 3, 2026 03:10
@lhb8125 lhb8125 force-pushed the codex/mfsdp-v2-split-2-hybridep branch from 828d588 to dcf38bd Compare July 3, 2026 03:49
@lhb8125 lhb8125 changed the title Stabilize HybridEP token shapes for full CUDA graphs Fix FSDP hooks bypassed by TE fused experts Jul 3, 2026
@lhb8125 lhb8125 force-pushed the codex/mfsdp-v2-split-2-hybridep branch from dcf38bd to 8063e74 Compare July 3, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant