[dev] partial cuda graph support for dynamic cp by HaochenYuan · Pull Request #5618 · NVIDIA/Megatron-LM

HaochenYuan · 2026-07-02T07:14:17Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do?

Summary

Enable layer-wise Transformer Engine CUDA Graphs with dynamic context parallelism by capturing and selecting a graph bank for each supported CP size.
Preserve dynamic-CP process groups and THD actual/padded metadata during capture and replay.
Keep MLA RoPE tensors alive for the graph lifetime and use a zero-valid dummy sequence so fused RoPE covers the full physical THD buffer.
Bound packed-sequence capacity and graph slots to prevent unsafe replay reuse.

Validation

16-GPU dynamic-CP E2E:
- Qwen3-8B with TP2/PP1 and TP2/PP2; eager and graph loss/grad metrics match.
- Moonlight with forced runtime CP1/2/4/8; eager and graph metrics match.
Moonlight 100-step graph soak completed without NaN or illegal memory access.
Static CP4 regression passed.
Targeted unit tests cover THD metadata, graph-bank selection, MLA RoPE lifetime, and CP partitioning.
Note: fused RoPE is unsafe only when a no-dummy configuration creates a hidden-only tail outside the padded THD metadata.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

Signed-off-by: HaochenYuan <haocheny@nvidia.com>

partial cuda graph support for dynamic cp

292388f

Signed-off-by: HaochenYuan <haocheny@nvidia.com>

HaochenYuan requested review from a team as code owners July 2, 2026 07:14

copy-pr-bot Bot temporarily deployed to public July 2, 2026 07:15 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 07:18 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 07:19 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 07:29 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[dev] partial cuda graph support for dynamic cp#5618

[dev] partial cuda graph support for dynamic cp#5618
HaochenYuan wants to merge 1 commit into
NVIDIA:devfrom
HaochenYuan:dynamic_cp_cuda_graph

HaochenYuan commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HaochenYuan commented Jul 2, 2026

What does this PR do?

Summary

Validation

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant