Skip to content

Scatter embeddings for sequence parallelism in standalone LM forwards#5628

Open
kevalmorabia97 wants to merge 1 commit into
NVIDIA:mainfrom
kevalmorabia97:kmorabia/vlm-lm-sp-embedding-scatter
Open

Scatter embeddings for sequence parallelism in standalone LM forwards#5628
kevalmorabia97 wants to merge 1 commit into
NVIDIA:mainfrom
kevalmorabia97:kmorabia/vlm-lm-sp-embedding-scatter

Conversation

@kevalmorabia97

Copy link
Copy Markdown
Contributor

What

Fix a sequence-parallel correctness bug for standalone language-model forwards of models whose embedding is built with scatter_embedding_sequence_parallel=False.

Why

Some models build their GPTModel embedding with scatter_embedding_sequence_parallel=False so the embedding output stays un-scattered for a caller that merges/scatters it. The prime example is vision-language (and omni/audio) models: the outer multimodal model calls language_model.embedding() directly, merges the vision/audio embeddings with the text embeddings, and only then scatters the combined sequence for sequence parallelism (and passes it in as decoder_input).

When such a language model is run standaloneGPTModel.forward(input_ids=..., decoder_input=None), e.g. distilling or PTQ-ing only the language-model tower of a VLM — that outer scatter is bypassed. Under sequence parallelism the embeddings then stay full-length on every TP rank, the decoder runs the full sequence, and the output-side sequence-parallel gather doubles the sequence. Downstream this shows up as a TP_size × seq_length vs seq_length shape mismatch.

Concretely, ModelOpt language-model distillation of a VLM (Qwen3-VL/Qwen3.5-VL, Gemma3-VL, …) at TP=2 + SP fails in the KD loss-mask step:

RuntimeError: The size of tensor a (32) must match the size of tensor b (16) at non-singleton dimension 0

(32 = TP_size(2) × seq_length(16).) Plain TP (no SP) works; standard LMs work; only the standalone-LM + SP case of these scatter=False models is affected.

Fix

In GPTModel._preprocess, when the model embeds internally under sequence parallelism and the embedding was built not to scatter, scatter the sequence here:

decoder_input = self.embedding(input_ids=input_ids, position_ids=position_ids)
if self.config.sequence_parallel and not self.embedding.scatter_to_sequence_parallel:
    decoder_input = tensor_parallel.scatter_to_sequence_parallel_region(decoder_input)

This only affects models that set scatter_embedding_sequence_parallel=False; standard models (scatter=True) are unchanged (the guard is false), and the decoder_input-provided path (normal VLM inference) is untouched.

Testing

Validated on nvcr.io/nvidia/nemo:26.06 via ModelOpt/Megatron-Bridge language-model distillation: Gemma3-VL and Qwen3.5-VL both pass at TP=2 + SP with this change (previously both failed at the loss-mask step). Single-GPU and TP-without-SP were already passing and remain so.

🤖 Generated with Claude Code

Models can build their GPTModel embedding with scatter_embedding_sequence_parallel=False so that
the embedding output stays un-scattered for a caller that merges/scatters it -- e.g. VLM language
models, whose outer multimodal model calls .embedding() directly, merges vision/audio + text
embeddings, and then scatters the combined sequence for sequence parallelism.

When such a language model is run standalone (GPTModel.forward with input_ids and no external
decoder_input) -- for example distilling or quantizing only the language-model tower of a VLM --
the outer scatter is bypassed. Under sequence parallelism the embeddings then stay full-length on
every TP rank, and the output-side sequence-parallel gather doubles the sequence, producing a
tensor of length TP_size x seq_length vs seq_length downstream (observed as a shape mismatch in
knowledge-distillation loss masking).

Scatter the internally-embedded sequence in _preprocess when sequence parallelism is on and the
embedding was built not to scatter. This only affects models that set
scatter_embedding_sequence_parallel=False; standard models (scatter=True) are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners July 2, 2026 19:13
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft July 2, 2026 19:13
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants