You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PaddlePaddle/LanguageModeling/BERT/README.md
+39-25
Original file line number
Diff line number
Diff line change
@@ -437,6 +437,7 @@ Advanced Training:
437
437
--use-dynamic-loss-scaling
438
438
Enable dynamic loss scaling in AMP training, only applied when --amp is set. (default: False)
439
439
--use-pure-fp16 Enable pure FP16 training, only applied when --amp is set. (default: False)
440
+
--fuse-mha Enable multihead attention fusion. Require cudnn version >= 8.9.1.
440
441
```
441
442
442
443
@@ -463,6 +464,7 @@ Default arguments are listed below in the order `scripts/run_squad.sh` expects:
463
464
- Enable benchmark - The default is `false`.
464
465
- Benchmark steps - The default is `100`.
465
466
- Benchmark warmup steps - The default is `100`.
467
+
- Fuse MHA fusion - The default is `true`
466
468
467
469
The script saves the final checkpoint to the `/results/bert-large-uncased/squad` folder.
468
470
@@ -593,7 +595,8 @@ bash run_pretraining.sh \
593
595
<bert_config_file> \
594
596
<enable_benchmark> \
595
597
<benchmark_steps> \
596
-
<benchmark_warmup_steps>
598
+
<benchmark_warmup_steps> \
599
+
<fuse_mha>
597
600
```
598
601
599
602
Where:
@@ -627,6 +630,7 @@ Where:
627
630
-`masking` LDDL supports both static and dynamic masking. Refer to [LDDL's README](https://github.com/NVIDIA/LDDL/blob/main/README.md) for more information.
628
631
-`<bert_config_file>` is the path to the bert config file.
629
632
-`<enable_benchmark>` a flag to enable benchmark. The train process will warmup for `<benchmark_warmup_steps>` and then measure the throughput of the following `<benchmark_steps>`.
633
+
-`<fuse_mha>` a flag to enable cuDNN MHA fusion.
630
634
631
635
Note that:
632
636
- If users follow [Quick Start Guide](#quick-start-guide) to set up container and dataset, there is no need to set any parameters. For example:
By default, the `mode` argument is set to `train eval`. Refer to the [Quick Start Guide](#quick-start-guide) for explanations of each positional argument.
To benchmark the training performance on a specific batch size for SQuAD, refer to [Fine-tuning](#fine-tuning) and turn on the `<benchmark>` flags. An example call to run training for 200 steps (100 steps for warmup and 100 steps to measure), and generate throughput numbers:
An example call to run inference and generate throughput numbers:
@@ -854,7 +861,7 @@ bash scripts/run_squad.sh \
854
861
results/checkpoints \
855
862
eval \
856
863
bert_configs/bert-large-uncased.json \
857
-
-1 true 100 100
864
+
-1 true 100 100 true
858
865
```
859
866
860
867
@@ -870,7 +877,7 @@ Our results were obtained by running the `scripts/run_squad.sh` and `scripts/run
870
877
871
878
| DGX System | GPUs / Node | Precision | Accumulated Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss | Time to train(hours) | Time to train speedup (TF32 to mixed precision) |
0 commit comments