Fix FSDP hooks bypassed by TE fused experts#25
Draft
lhb8125 wants to merge 1 commit into
Draft
Conversation
828d588 to
dcf38bd
Compare
dcf38bd to
8063e74
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GroupedLinearpre-forward hooks through the TE fused expert implementation while preserving PyTorchwith_kwargssignatures.This is the
mfsdp_refactorbackport of the core post-forward fix in NVIDIA#5636, plus the pre-forward hook-signature correction needed bywith_kwargs=Truefine-grained hooks on this branch. The PR contains production code only.The rewrite removes the previous HybridEP static-budget fallback, token/probability/gradient row trimming, padding metadata/cache/zeroing, CUDA-graph handle retention, and related documentation/tests. The target paged-stash recipes already set
moe_expert_rank_capacity_factor: 1.2; the automatic fallback was inactive there and lacked a complete overflow contract.Validation
CHECK_ONLY=true BASE_REF=mfsdp_refactor bash tools/autoformat.sh: Black, isort, Pylint 10.00/10, and Ruff passed.py_compile, andgit diff --checkpassed.20260702-202730-fdfa/ 1033.5020260702-203255-400a/ 1033.1520260702-203814-177d/ 1050.0520260702-204330-f5ca/ 1045.70All four jobs completed 16/16 iterations with finite loss/grad norm, zero skipped iterations, and zero NaN iterations. Final control/candidate losses differed by only
1e-5on each path.The PR remains draft while the reduced scope is reviewed.