-
Notifications
You must be signed in to change notification settings - Fork 363
[Bug]Add sequence_parallel in layernorm init to enable 3D parallelism with DeepSpeed for non CUDA device. #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
44619fa
to
f099692
Compare
Hi @tjruwase, I think we have talked about this question before, |
Hi @tjruwase , is it possible to have this PR reviewed? This PR is to fix a Megatron-DeepSpeed incompatibility to torch.nn.layernorm. Without it Megatron-DeepSpeed does not work normally for non-CUDA devices. |
…run successfully with DeepSpeed Signed-off-by: yisheng <[email protected]>
Hi @tjruwase this PR had been updated, should be ready for merge. Thanks! |
@sfc-gh-truwase in case you mainly use the other github account |
…run successfully with DeepSpeed (deepspeedai#468) Signed-off-by: yisheng <[email protected]> Signed-off-by: Jinghan Yao <[email protected]>
…run successfully with DeepSpeed (deepspeedai#468) Signed-off-by: yisheng <[email protected]> Signed-off-by: Jinghan Yao <[email protected]>
…nabled (#479) * pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 Signed-off-by: Jinghan Yao <[email protected]> * add fused_rms_norm support on XPU device (#431) Signed-off-by: Jinghan Yao <[email protected]> * [LLaMa] Adding support converting checkpoint from mds to hf (#432) * add support converting checkpoint from hf to mds * Fix PP issue * update Signed-off-by: Jinghan Yao <[email protected]> * add device check when import ipex (#436) Signed-off-by: Jinghan Yao <[email protected]> * fix TFLOPs calculation (#371) * fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights Signed-off-by: Jinghan Yao <[email protected]> * fix nan issue when running megatron-deepspeed (#434) Signed-off-by: Jinghan Yao <[email protected]> * enable empty cache on XPU device (#438) Signed-off-by: Jinghan Yao <[email protected]> * [wandb] disable wandb more gracefully (#422) Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Jinghan Yao <[email protected]> * [Bug] Fix crash when logging optimizer state to tb (#417) Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <[email protected]> * remove unnecessary files Signed-off-by: Jinghan Yao <[email protected]> * set the warmup length to be FPDT chunk size if enabled Signed-off-by: Jinghan Yao <[email protected]> * Enable Sequence Parallelism (#429) Signed-off-by: Jinghan Yao <[email protected]> * grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428) Signed-off-by: Jinghan Yao <[email protected]> * fix init issue for rms_norm in squence_parallel (#448) Signed-off-by: Jinghan Yao <[email protected]> * enable profiler for specific ranks (#451) Signed-off-by: Jinghan Yao <[email protected]> * fix init issue for silently ignoring the deepspeed config (#452) Signed-off-by: Jinghan Yao <[email protected]> * fix moe tflops (#445) Signed-off-by: Jinghan Yao <[email protected]> * [tool]GQA convert support (#454) * [tools]GQA convert support * fix readme Signed-off-by: Jinghan Yao <[email protected]> * Fix import error in `deepspeed_to_megatron.py` (#455) Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`. Signed-off-by: Jinghan Yao <[email protected]> * Update references to new GitHub org (deepspeedai) (#462) Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Jinghan Yao <[email protected]> * add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468) Signed-off-by: yisheng <[email protected]> Signed-off-by: Jinghan Yao <[email protected]> * fix bug when FPDT is disabled but with original Ulysses Signed-off-by: Jinghan Yao <[email protected]> Signed-off-by: jinghan yao [email protected] Signed-off-by: Jinghan Yao <[email protected]> --------- Signed-off-by: Jinghan Yao <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: yisheng <[email protected]> Signed-off-by: jinghan yao [email protected] Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: YiSheng5 <[email protected]> Co-authored-by: billishyahao <[email protected]> Co-authored-by: Polisetty V R K Jyothendra Varma <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: ranzhejiang <[email protected]> Co-authored-by: Xinyu Lian <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: hotsuyuki <[email protected]> Co-authored-by: Jinghan Yao <[email protected]>
When you running on non-CUDA device, for 3D parallelism with DeepSpeed you will got this error, can see below:
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 214, in init
[rank19]: self._build()
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 270, in _build
[rank19]: module = layer.build()
[rank19]: File "/home/yisheng/anaconda3/envs/llm_pt_25/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
[rank19]: return self.typename(*self.module_args, **self.module_kwargs)
[rank19]: TypeError: LayerNorm.init() got an unexpected keyword argument 'sequence_parallel'
cause for Megatron-DeepSpeed, sequence_parallel is added in Megatron-DeepSpeed for layernorm, for current implementation, non-CUDA device is using from torch.nn import LayerNorm for layernorm, there is no attr named sequence_parallel, will cause init error for non-CUDA device.