[Bug]Add sequence_parallel in layernorm init to enable 3D parallelism with DeepSpeed for non CUDA device. #468

ys950902 · 2025-02-28T06:32:58Z

When you running on non-CUDA device, for 3D parallelism with DeepSpeed you will got this error, can see below:
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 214, in init
[rank19]: self._build()
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 270, in _build
[rank19]: module = layer.build()
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
[rank19]: return self.typename(*self.module_args, **self.module_kwargs)
[rank19]: TypeError: LayerNorm.init() got an unexpected keyword argument 'sequence_parallel'

cause for Megatron-DeepSpeed, sequence_parallel is added in Megatron-DeepSpeed for layernorm, for current implementation, non-CUDA device is using from torch.nn import LayerNorm for layernorm, there is no attr named sequence_parallel, will cause init error for non-CUDA device.

ys950902 · 2025-02-28T07:33:09Z

Hi @tjruwase, I think we have talked about this question before,
1.It is quite subtle since it does not show the connection to sequence-parallelism
Cause for Megatron-DeepSpeed, the sequence_parallel is added, can see below
https://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/model/gpt_model.py#L406
and when you running 3D parallelism+deepspeed the keyword argument 'sequence_parallel' will be checked, if not added on non-CUDA device it will cause error.
2.It is unclear to me that new LayerNorm is equivalent to torch.nn.LayerNorm for non sequence-parallel case. Maintaining parity with torch.nn.LayerNorm imposes extra development burden.
It is the same, you can see in fused_layer_norm that cuda used, if not using fuesd kernel, is the same
http://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/model/fused_layer_norm.py#L96

delock · 2025-06-30T10:58:57Z

Hi @tjruwase , is it possible to have this PR reviewed? This PR is to fix a Megatron-DeepSpeed incompatibility to torch.nn.layernorm. Without it Megatron-DeepSpeed does not work normally for non-CUDA devices.

megatron/model/layernorm.py

…run successfully with DeepSpeed Signed-off-by: yisheng <[email protected]>

delock · 2025-07-16T01:39:26Z

Hi @tjruwase this PR had been updated, should be ready for merge. Thanks!

delock · 2025-07-16T01:46:38Z

Hi @tjruwase this PR had been updated, should be ready for merge. Thanks!

@sfc-gh-truwase in case you mainly use the other github account

…run successfully with DeepSpeed (deepspeedai#468) Signed-off-by: yisheng <[email protected]> Signed-off-by: Jinghan Yao <[email protected]>

…nabled (#479) * pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 Signed-off-by: Jinghan Yao <[email protected]> * add fused_rms_norm support on XPU device (#431) Signed-off-by: Jinghan Yao <[email protected]> * [LLaMa] Adding support converting checkpoint from mds to hf (#432) * add support converting checkpoint from hf to mds * Fix PP issue * update Signed-off-by: Jinghan Yao <[email protected]> * add device check when import ipex (#436) Signed-off-by: Jinghan Yao <[email protected]> * fix TFLOPs calculation (#371) * fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights Signed-off-by: Jinghan Yao <[email protected]> * fix nan issue when running megatron-deepspeed (#434) Signed-off-by: Jinghan Yao <[email protected]> * enable empty cache on XPU device (#438) Signed-off-by: Jinghan Yao <[email protected]> * [wandb] disable wandb more gracefully (#422) Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Jinghan Yao <[email protected]> * [Bug] Fix crash when logging optimizer state to tb (#417) Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * remove unnecessary files Signed-off-by: Jinghan Yao <[email protected]> * set the warmup length to be FPDT chunk size if enabled Signed-off-by: Jinghan Yao <[email protected]> * Enable Sequence Parallelism (#429) Signed-off-by: Jinghan Yao <[email protected]> * grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428) Signed-off-by: Jinghan Yao <[email protected]> * fix init issue for rms_norm in squence_parallel (#448) Signed-off-by: Jinghan Yao <[email protected]> * enable profiler for specific ranks (#451) Signed-off-by: Jinghan Yao <[email protected]> * fix init issue for silently ignoring the deepspeed config (#452) Signed-off-by: Jinghan Yao <[email protected]> * fix moe tflops (#445) Signed-off-by: Jinghan Yao <[email protected]> * [tool]GQA convert support (#454) * [tools]GQA convert support * fix readme Signed-off-by: Jinghan Yao <[email protected]> * Fix import error in `deepspeed_to_megatron.py` (#455) Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`. Signed-off-by: Jinghan Yao <[email protected]> * Update references to new GitHub org (deepspeedai) (#462) Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Jinghan Yao <[email protected]> * add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468) Signed-off-by: yisheng <[email protected]> Signed-off-by: Jinghan Yao <[email protected]> * fix bug when FPDT is disabled but with original Ulysses Signed-off-by: Jinghan Yao <[email protected]> Signed-off-by: jinghan yao [email protected] Signed-off-by: Jinghan Yao <[email protected]> --------- Signed-off-by: Jinghan Yao <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: jinghan yao [email protected] Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: billishyahao <[email protected]> Co-authored-by: Polisetty V R K Jyothendra Varma <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: ranzhejiang <[email protected]> Co-authored-by: Xinyu Lian <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: hotsuyuki <[email protected]> Co-authored-by: Jinghan Yao <[email protected]>

ys950902 requested review from GuanhuaWang, jeffra and tjruwase as code owners February 28, 2025 06:32

ys950902 force-pushed the layernorm_init branch from 44619fa to f099692 Compare February 28, 2025 06:37

tjruwase reviewed Jun 30, 2025

View reviewed changes

megatron/model/layernorm.py Outdated Show resolved Hide resolved

tjruwase approved these changes Jun 30, 2025

View reviewed changes

add sequence_parallel in layernorm init to enable 3D parallelism can …

e7c05a6

…run successfully with DeepSpeed Signed-off-by: yisheng <[email protected]>

ys950902 force-pushed the layernorm_init branch from f099692 to e7c05a6 Compare July 3, 2025 08:52

ys950902 requested a review from tjruwase July 4, 2025 06:25

sfc-gh-truwase approved these changes Jul 16, 2025

View reviewed changes

tjruwase merged commit 4efb479 into deepspeedai:main Jul 16, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]Add sequence_parallel in layernorm init to enable 3D parallelism with DeepSpeed for non CUDA device. #468

[Bug]Add sequence_parallel in layernorm init to enable 3D parallelism with DeepSpeed for non CUDA device. #468

Uh oh!

ys950902 commented Feb 28, 2025

Uh oh!

ys950902 commented Feb 28, 2025

Uh oh!

delock commented Jun 30, 2025

Uh oh!

Uh oh!

delock commented Jul 16, 2025

Uh oh!

delock commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Bug]Add sequence_parallel in layernorm init to enable 3D parallelism with DeepSpeed for non CUDA device. #468

[Bug]Add sequence_parallel in layernorm init to enable 3D parallelism with DeepSpeed for non CUDA device. #468

Uh oh!

Conversation

ys950902 commented Feb 28, 2025

Uh oh!

ys950902 commented Feb 28, 2025

Uh oh!

delock commented Jun 30, 2025

Uh oh!

Uh oh!

delock commented Jul 16, 2025

Uh oh!

delock commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants