Skip to content

Conversation

ys950902
Copy link

When you running on non-CUDA device, for 3D parallelism with DeepSpeed you will got this error, can see below:
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 214, in init
[rank19]: self._build()
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 270, in _build
[rank19]: module = layer.build()
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
[rank19]: return self.typename(*self.module_args, **self.module_kwargs)
[rank19]: TypeError: LayerNorm.init() got an unexpected keyword argument 'sequence_parallel'

cause for Megatron-DeepSpeed, sequence_parallel is added in Megatron-DeepSpeed for layernorm, for current implementation, non-CUDA device is using from torch.nn import LayerNorm for layernorm, there is no attr named sequence_parallel, will cause init error for non-CUDA device.

@ys950902
Copy link
Author

Hi @tjruwase, I think we have talked about this question before,
1.It is quite subtle since it does not show the connection to sequence-parallelism
Cause for Megatron-DeepSpeed, the sequence_parallel is added, can see below
https://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/model/gpt_model.py#L406
and when you running 3D parallelism+deepspeed the keyword argument 'sequence_parallel' will be checked, if not added on non-CUDA device it will cause error.
2.It is unclear to me that new LayerNorm is equivalent to torch.nn.LayerNorm for non sequence-parallel case. Maintaining parity with torch.nn.LayerNorm imposes extra development burden.
It is the same, you can see in fused_layer_norm that cuda used, if not using fuesd kernel, is the same
http://github.com/deepspeedai/Megatron-DeepSpeed/blob/main/megatron/model/fused_layer_norm.py#L96

@delock
Copy link

delock commented Jun 30, 2025

Hi @tjruwase , is it possible to have this PR reviewed? This PR is to fix a Megatron-DeepSpeed incompatibility to torch.nn.layernorm. Without it Megatron-DeepSpeed does not work normally for non-CUDA devices.

…run successfully with DeepSpeed

Signed-off-by: yisheng <[email protected]>
@ys950902 ys950902 requested a review from tjruwase July 4, 2025 06:25
@delock
Copy link

delock commented Jul 16, 2025

Hi @tjruwase this PR had been updated, should be ready for merge. Thanks!

@delock
Copy link

delock commented Jul 16, 2025

Hi @tjruwase this PR had been updated, should be ready for merge. Thanks!

@sfc-gh-truwase in case you mainly use the other github account

@tjruwase tjruwase merged commit 4efb479 into deepspeedai:main Jul 16, 2025
5 checks passed
YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025
…run successfully with DeepSpeed (deepspeedai#468)

Signed-off-by: yisheng <[email protected]>
Signed-off-by: Jinghan Yao <[email protected]>
YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025
…run successfully with DeepSpeed (deepspeedai#468)

Signed-off-by: yisheng <[email protected]>
Signed-off-by: Jinghan Yao <[email protected]>
tjruwase pushed a commit that referenced this pull request Aug 14, 2025
…nabled (#479)

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

Signed-off-by: Jinghan Yao <[email protected]>

* add fused_rms_norm support on XPU device (#431)

Signed-off-by: Jinghan Yao <[email protected]>

* [LLaMa] Adding support converting checkpoint from mds to hf (#432)

* add support converting checkpoint from hf to mds

* Fix PP issue

* update

Signed-off-by: Jinghan Yao <[email protected]>

* add device check when import ipex (#436)

Signed-off-by: Jinghan Yao <[email protected]>

* fix TFLOPs calculation (#371)

* fix TFLOPs calculation

when GQA used, we observe right TFLOPs after this fix.
when GQA is not used, huge difference in TFLOPs is solved with
selective recompute .
some other minor difference will also be observed as logits macs also added.

* add copyrights

Signed-off-by: Jinghan Yao <[email protected]>

* fix nan issue when running megatron-deepspeed (#434)

Signed-off-by: Jinghan Yao <[email protected]>

* enable empty cache on XPU device (#438)

Signed-off-by: Jinghan Yao <[email protected]>

* [wandb] disable wandb more gracefully (#422)

Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Jinghan Yao <[email protected]>

* [Bug] Fix crash when logging optimizer state to tb (#417)

Signed-off-by: Jinghan Yao <[email protected]>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <[email protected]>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <[email protected]>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <[email protected]>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <[email protected]>

* remove unnecessary files

Signed-off-by: Jinghan Yao <[email protected]>

* set the warmup length to be FPDT chunk size if enabled

Signed-off-by: Jinghan Yao <[email protected]>

* Enable Sequence Parallelism (#429)

Signed-off-by: Jinghan Yao <[email protected]>

* grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428)

Signed-off-by: Jinghan Yao <[email protected]>

* fix init issue for rms_norm in squence_parallel (#448)

Signed-off-by: Jinghan Yao <[email protected]>

* enable profiler for specific ranks (#451)

Signed-off-by: Jinghan Yao <[email protected]>

* fix init issue for silently ignoring the deepspeed config (#452)

Signed-off-by: Jinghan Yao <[email protected]>

* fix moe tflops (#445)

Signed-off-by: Jinghan Yao <[email protected]>

* [tool]GQA convert support (#454)

* [tools]GQA convert support

* fix readme

Signed-off-by: Jinghan Yao <[email protected]>

* Fix import error in `deepspeed_to_megatron.py` (#455)

Previously, `deepspeed_to_megatron.py` would raise an import error
due to the relative import.

This commit fixes this issue by changing from the relative import
to the absolute import like in `deepspeed_to_transformers.py`.

Signed-off-by: Jinghan Yao <[email protected]>

* Update references to new GitHub org (deepspeedai) (#462)

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: Jinghan Yao <[email protected]>

* add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468)

Signed-off-by: yisheng <[email protected]>
Signed-off-by: Jinghan Yao <[email protected]>

* fix bug when FPDT is disabled but with original Ulysses

Signed-off-by: Jinghan Yao <[email protected]>
Signed-off-by: jinghan yao [email protected]
Signed-off-by: Jinghan Yao <[email protected]>

---------

Signed-off-by: Jinghan Yao <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: yisheng <[email protected]>
Signed-off-by: jinghan yao [email protected]
Co-authored-by: Jinghan Yao <[email protected]>
Co-authored-by: YiSheng5 <[email protected]>
Co-authored-by: billishyahao <[email protected]>
Co-authored-by: Polisetty V R K Jyothendra Varma <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Jinghan Yao <[email protected]>
Co-authored-by: ranzhejiang <[email protected]>
Co-authored-by: Xinyu Lian <[email protected]>
Co-authored-by: inkcherry <[email protected]>
Co-authored-by: hotsuyuki <[email protected]>
Co-authored-by: Jinghan Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants