Skip to content

Commit aa2ca83

Browse files
add example and update deepspeed/FSDP docs (huggingface#1489)
* add example and update deepspeed docs * fixes * fixes and update FSDP docs * fixes and addressing commentsa * fixes * resolve comments * Apply suggestions from code review Co-authored-by: Benjamin Bossan <[email protected]> * address comments * Update fsdp.md * Update docs/source/accelerate/fsdp.md Co-authored-by: Benjamin Bossan <[email protected]> * addressing comments * address comments --------- Co-authored-by: Benjamin Bossan <[email protected]>
1 parent 1b3b7b5 commit aa2ca83

15 files changed

+982
-83
lines changed

docs/source/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444

4545
- title: 🤗 Accelerate integrations
4646
sections:
47-
- local: accelerate/deepspeed-zero3-offload
47+
- local: accelerate/deepspeed
4848
title: DeepSpeed
4949
- local: accelerate/fsdp
5050
title: Fully Sharded Data Parallel

docs/source/accelerate/deepspeed-zero3-offload.md renamed to docs/source/accelerate/deepspeed.md

Lines changed: 159 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,158 @@ rendered properly in your Markdown viewer.
66

77
[DeepSpeed](https://www.deepspeed.ai/) is a library designed for speed and scale for distributed training of large models with billions of parameters. At its core is the Zero Redundancy Optimizer (ZeRO) that shards optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across data parallel processes. This drastically reduces memory usage, allowing you to scale your training to billion parameter models. To unlock even more memory efficiency, ZeRO-Offload reduces GPU compute and memory by leveraging CPU resources during optimization.
88

9-
Both of these features are supported in 🤗 Accelerate, and you can use them with 🤗 PEFT. This guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You'll configure the script to train a large model for conditional generation with ZeRO-3 and ZeRO-Offload.
9+
Both of these features are supported in 🤗 Accelerate, and you can use them with 🤗 PEFT.
10+
11+
# Use PEFT and DeepSpeed with ZeRO3 for finetuning large models on multiple machines and multiple nodes
12+
This section of guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/sft/train.py) for performing SFT. You'll configure the script to do SFT (supervised fine-tuning) of Llama-70B model with LoRA and ZeRO-3 on 8xH100 80GB GPUs on a single machine. You can configure it to scale to multiple machines by changing the accelerate config.
13+
14+
## Configuration
15+
16+
Start by running the following command to [create a DeepSpeed configuration file](https://huggingface.co/docs/accelerate/quicktour#launching-your-distributed-script) with 🤗 Accelerate. The `--config_file` flag allows you to save the configuration file to a specific location, otherwise it is saved as a `default_config.yaml` file in the 🤗 Accelerate cache.
17+
18+
The configuration file is used to set the default options when you launch the training script.
19+
20+
```bash
21+
accelerate config --config_file deepspeed_config.yaml
22+
```
23+
24+
You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll use ZeRO-3 so make sure you pick those options.
25+
26+
```bash
27+
`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
28+
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them. Pass the same value as you would pass via cmd argument else you will encounter mismatch error.
29+
`gradient_clipping`: Enable gradient clipping with value. Don't set this as you will be passing it via cmd arguments.
30+
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2. Set this as `none` as don't want to enable offloading.
31+
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3. Set this as `none` as don't want to enable offloading.
32+
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3. Set this to `True`.
33+
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3. Set this to `True`.
34+
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. Set this to `True`.
35+
```
36+
37+
Once this is done, the corresponding config should look like below and you can find it in config folder at [deepspeed_config.yaml](https://github.com/huggingface/peft/blob/main/examples/sft/configs/deepspeed_config.yaml):
38+
39+
```yml
40+
compute_environment: LOCAL_MACHINE
41+
debug: false
42+
deepspeed_config:
43+
deepspeed_multinode_launcher: standard
44+
gradient_accumulation_steps: 4
45+
offload_optimizer_device: none
46+
offload_param_device: none
47+
zero3_init_flag: true
48+
zero3_save_16bit_model: true
49+
zero_stage: 3
50+
distributed_type: DEEPSPEED
51+
downcast_bf16: 'no'
52+
machine_rank: 0
53+
main_training_function: main
54+
mixed_precision: bf16
55+
num_machines: 1
56+
num_processes: 8
57+
rdzv_backend: static
58+
same_network: true
59+
tpu_env: []
60+
tpu_use_cluster: false
61+
tpu_use_sudo: false
62+
use_cpu: false
63+
```
64+
65+
## Launch command
66+
67+
The launch command is available at [run_peft_deepspeed.sh](https://github.com/huggingface/peft/blob/main/examples/sft/run_peft_deepspeed.sh) and it is also shown below:
68+
```bash
69+
accelerate launch --config_file "configs/deepspeed_config.yaml" train.py \
70+
--seed 100 \
71+
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
72+
--dataset_name "smangrul/ultrachat-10k-chatml" \
73+
--chat_template_format "chatml" \
74+
--add_special_tokens False \
75+
--append_concat_token False \
76+
--splits "train,test" \
77+
--max_seq_len 2048 \
78+
--num_train_epochs 1 \
79+
--logging_steps 5 \
80+
--log_level "info" \
81+
--logging_strategy "steps" \
82+
--evaluation_strategy "epoch" \
83+
--save_strategy "epoch" \
84+
--push_to_hub \
85+
--hub_private_repo True \
86+
--hub_strategy "every_save" \
87+
--bf16 True \
88+
--packing True \
89+
--learning_rate 1e-4 \
90+
--lr_scheduler_type "cosine" \
91+
--weight_decay 1e-4 \
92+
--warmup_ratio 0.0 \
93+
--max_grad_norm 1.0 \
94+
--output_dir "llama-sft-lora-deepspeed" \
95+
--per_device_train_batch_size 8 \
96+
--per_device_eval_batch_size 8 \
97+
--gradient_accumulation_steps 4 \
98+
--gradient_checkpointing True \
99+
--use_reentrant False \
100+
--dataset_text_field "content" \
101+
--use_flash_attn True \
102+
--use_peft_lora True \
103+
--lora_r 8 \
104+
--lora_alpha 16 \
105+
--lora_dropout 0.1 \
106+
--lora_target_modules "all-linear" \
107+
--use_4bit_quantization False
108+
```
109+
110+
Notice that we are using LoRA with rank=8, alpha=16 and targeting all linear layers. We are passing the deepspeed config file and finetuning 70B Llama model on a subset of the ultrachat dataset.
111+
112+
## The important parts
113+
114+
Let's dive a little deeper into the script so you can see what's going on, and understand how it works.
115+
116+
The first thing to know is that the script uses DeepSpeed for distributed training as the DeepSpeed config has been passed. The `SFTTrainer` class handles all the heavy lifting of creating the PEFT model using the peft config that is passed. After that, when you call `trainer.train()`, `SFTTrainer` internally uses 🤗 Accelerate to prepare the model, optimizer and trainer using the DeepSpeed config to create DeepSpeed engine which is then trained. The main code snippet is below:
117+
118+
```python
119+
# trainer
120+
trainer = SFTTrainer(
121+
model=model,
122+
tokenizer=tokenizer,
123+
args=training_args,
124+
train_dataset=train_dataset,
125+
eval_dataset=eval_dataset,
126+
peft_config=peft_config,
127+
packing=data_args.packing,
128+
dataset_kwargs={
129+
"append_concat_token": data_args.append_concat_token,
130+
"add_special_tokens": data_args.add_special_tokens,
131+
},
132+
dataset_text_field=data_args.dataset_text_field,
133+
max_seq_length=data_args.max_seq_length,
134+
)
135+
trainer.accelerator.print(f"{trainer.model}")
136+
137+
# train
138+
checkpoint = None
139+
if training_args.resume_from_checkpoint is not None:
140+
checkpoint = training_args.resume_from_checkpoint
141+
trainer.train(resume_from_checkpoint=checkpoint)
142+
143+
# saving final model
144+
trainer.save_model()
145+
```
146+
147+
## Memory usage
148+
149+
In the above example, the memory consumed per GPU is 64 GB (80%) as seen in the screenshot below:
150+
151+
<div class="flex justify-center">
152+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/peft_deepspeed_mem_usage.png"/>
153+
</div>
154+
<small>GPU memory usage for the training run</small>
155+
156+
## More resources
157+
You can also refer this blog post [Falcon 180B Finetuning using 🤗 PEFT and DeepSpeed](https://medium.com/@sourabmangrulkar/falcon-180b-finetuning-using-peft-and-deepspeed-b92643091d99) on how to finetune 180B Falcon model on 16 A100 GPUs on 2 machines.
158+
159+
# Use PEFT and DeepSpeed with ZeRO3 and CPU Offloading for finetuning large models on a single GPU
160+
This section of guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You'll configure the script to train a large model for conditional generation with ZeRO-3 and CPU Offload.
10161

11162
<Tip>
12163

@@ -24,7 +175,7 @@ The configuration file is used to set the default options when you launch the tr
24175
accelerate config --config_file ds_zero3_cpu.yaml
25176
```
26177

27-
You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll use ZeRO-3 and ZeRO-Offload so make sure you pick those options.
178+
You'll be asked a few questions about your setup, and configure the following arguments. In this example, you'll use ZeRO-3 along with CPU-Offload so make sure you pick those options.
28179

29180
```bash
30181
`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
@@ -105,7 +256,7 @@ model, train_dataloader, eval_dataloader, test_dataloader, optimizer, lr_schedul
105256
)
106257
```
107258

108-
The next bit of code checks whether the DeepSpeed plugin is used in the `Accelerator`, and if the plugin exists, then the `Accelerator` uses ZeRO-3 as specified in the configuration file:
259+
The next bit of code checks whether the DeepSpeed plugin is used in the `Accelerator`, and if the plugin exists, then we check if we are using ZeRO-3. This conditional flag is used when calling `generate` function call during inference for syncing GPUs when the model parameters are sharded:
109260

110261
```py
111262
is_ds_zero_3 = False
@@ -164,4 +315,8 @@ CPU Total Peak Memory consumed during the eval (max): 19411
164315
accuracy=100.0
165316
eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
166317
dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
167-
```
318+
```
319+
320+
# Caveats
321+
1. Merging when using PEFT and DeepSpeed is currently unsupported and will raise error.
322+
2. When using CPU offloading, the major gains from using PEFT to shrink the optimizer states and gradients to that of the adapter weights would be realized on CPU RAM and there won't be savings with respect to GPU memory.

0 commit comments

Comments
 (0)