Scatter embeddings for sequence parallelism in standalone LM forwards by kevalmorabia97 · Pull Request #5628 · NVIDIA/Megatron-LM

kevalmorabia97 · 2026-07-02T19:13:29Z

What

Fix a sequence-parallel correctness bug for standalone language-model forwards of models whose embedding is built with scatter_embedding_sequence_parallel=False.

Why

Some models build their GPTModel embedding with scatter_embedding_sequence_parallel=False so the embedding output stays un-scattered for a caller that merges/scatters it. The prime example is vision-language (and omni/audio) models: the outer multimodal model calls language_model.embedding() directly, merges the vision/audio embeddings with the text embeddings, and only then scatters the combined sequence for sequence parallelism (and passes it in as decoder_input).

When such a language model is run standalone — GPTModel.forward(input_ids=..., decoder_input=None), e.g. distilling or PTQ-ing only the language-model tower of a VLM — that outer scatter is bypassed. Under sequence parallelism the embeddings then stay full-length on every TP rank, the decoder runs the full sequence, and the output-side sequence-parallel gather doubles the sequence. Downstream this shows up as a TP_size × seq_length vs seq_length shape mismatch.

Concretely, ModelOpt language-model distillation of a VLM (Qwen3-VL/Qwen3.5-VL, Gemma3-VL, …) at TP=2 + SP fails in the KD loss-mask step:

RuntimeError: The size of tensor a (32) must match the size of tensor b (16) at non-singleton dimension 0

(32 = TP_size(2) × seq_length(16).) Plain TP (no SP) works; standard LMs work; only the standalone-LM + SP case of these scatter=False models is affected.

Fix

In GPTModel._preprocess, when the model embeds internally under sequence parallelism and the embedding was built not to scatter, scatter the sequence here:

decoder_input = self.embedding(input_ids=input_ids, position_ids=position_ids)
if self.config.sequence_parallel and not self.embedding.scatter_to_sequence_parallel:
    decoder_input = tensor_parallel.scatter_to_sequence_parallel_region(decoder_input)

This only affects models that set scatter_embedding_sequence_parallel=False; standard models (scatter=True) are unchanged (the guard is false), and the decoder_input-provided path (normal VLM inference) is untouched.

Testing

Validated on nvcr.io/nvidia/nemo:26.06 via ModelOpt/Megatron-Bridge language-model distillation: Gemma3-VL and Qwen3.5-VL both pass at TP=2 + SP with this change (previously both failed at the loss-mask step). Single-GPU and TP-without-SP were already passing and remain so.

🤖 Generated with Claude Code

Models can build their GPTModel embedding with scatter_embedding_sequence_parallel=False so that the embedding output stays un-scattered for a caller that merges/scatters it -- e.g. VLM language models, whose outer multimodal model calls .embedding() directly, merges vision/audio + text embeddings, and then scatters the combined sequence for sequence parallelism. When such a language model is run standalone (GPTModel.forward with input_ids and no external decoder_input) -- for example distilling or quantizing only the language-model tower of a VLM -- the outer scatter is bypassed. Under sequence parallelism the embeddings then stay full-length on every TP rank, and the output-side sequence-parallel gather doubles the sequence, producing a tensor of length TP_size x seq_length vs seq_length downstream (observed as a shape mismatch in knowledge-distillation loss masking). Scatter the internally-embedded sequence in _preprocess when sequence parallelism is on and the embedding was built not to scatter. This only affects models that set scatter_embedding_sequence_parallel=False; standard models (scatter=True) are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

github-actions · 2026-07-02T19:13:38Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

copy-pr-bot · 2026-07-02T19:13:40Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kevalmorabia97 requested review from a team as code owners July 2, 2026 19:13

svcnvidia-nemo-ci marked this pull request as draft July 2, 2026 19:13

kevalmorabia97 mentioned this pull request Jul 2, 2026

fix: scatter Qwen3-VL text-model embeddings under SP for standalone LM forwards NVIDIA-NeMo/Megatron-Bridge#4629

Closed

copy-pr-bot Bot temporarily deployed to public July 2, 2026 19:14 Inactive

copy-pr-bot Bot temporarily deployed to test July 2, 2026 19:14 Inactive

kevalmorabia97 marked this pull request as ready for review July 2, 2026 19:18

copy-pr-bot Bot temporarily deployed to public July 2, 2026 19:18 Inactive

svcnvidia-nemo-ci added the complexity: low label Jul 2, 2026

copy-pr-bot Bot temporarily deployed to public July 2, 2026 19:18 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 19:28 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scatter embeddings for sequence parallelism in standalone LM forwards#5628

Scatter embeddings for sequence parallelism in standalone LM forwards#5628
kevalmorabia97 wants to merge 1 commit into
NVIDIA:mainfrom
kevalmorabia97:kmorabia/vlm-lm-sp-embedding-scatter

kevalmorabia97 commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kevalmorabia97 commented Jul 2, 2026

What

Why

Fix

Testing

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

copy-pr-bot Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants