Skip to content

Commit c088b9a

Browse files
committed
add csmsc tacotron2
1 parent fb238d8 commit c088b9a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+3335
-134
lines changed

examples/aishell3/tts3/conf/default.yaml

+2-3
Original file line numberDiff line numberDiff line change
@@ -64,14 +64,14 @@ model:
6464
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
6565
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
6666
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
67-
stop_gradient_from_pitch_predictor: true # whether to stop the gradient from pitch predictor to encoder
67+
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
6868
energy_predictor_layers: 2 # number of conv layers in energy predictor
6969
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
7070
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
7171
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
7272
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
7373
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
74-
stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
74+
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
7575
spk_embed_dim: 256 # speaker embedding dimension
7676
spk_embed_integration_type: concat # speaker embedding integration type
7777

@@ -84,7 +84,6 @@ updater:
8484
use_masking: True # whether to apply masking for padded part in loss calculation
8585

8686

87-
8887
###########################################################
8988
# OPTIMIZER SETTING #
9089
###########################################################

examples/aishell3/vc1/conf/default.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -64,14 +64,14 @@ model:
6464
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
6565
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
6666
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
67-
stop_gradient_from_pitch_predictor: true # whether to stop the gradient from pitch predictor to encoder
67+
stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder
6868
energy_predictor_layers: 2 # number of conv layers in energy predictor
6969
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
7070
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
7171
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
7272
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
7373
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
74-
stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
74+
stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
7575
spk_embed_dim: 256 # speaker embedding dimension
7676
spk_embed_integration_type: concat # speaker embedding integration type
7777

examples/aishell3/voc1/conf/default.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ generator_params:
3333
aux_context_window: 2 # Context window size for auxiliary feature.
3434
# If set to 2, previous 2 and future 2 frames will be considered.
3535
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
36-
use_weight_norm: true # Whether to use weight norm.
36+
use_weight_norm: True # Whether to use weight norm.
3737
# If set to true, it will be applied to all of the conv layers.
3838
upsample_scales: [4, 5, 3, 5] # Upsampling scales. prod(upsample_scales) == n_shift
3939

@@ -46,8 +46,8 @@ discriminator_params:
4646
kernel_size: 3 # Number of output channels.
4747
layers: 10 # Number of conv layers.
4848
conv_channels: 64 # Number of chnn layers.
49-
bias: true # Whether to use bias parameter in conv.
50-
use_weight_norm: true # Whether to use weight norm.
49+
bias: True # Whether to use bias parameter in conv.
50+
use_weight_norm: True # Whether to use weight norm.
5151
# If set to true, it will be applied to all of the conv layers.
5252
nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
5353
nonlinear_activation_params: # Nonlinear function parameters

examples/csmsc/tts0/README.md

+264
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# FastSpeech2 with CSMSC
2+
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
3+
4+
## Dataset
5+
### Download and Extract
6+
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
7+
8+
### Get MFA Result and Extract
9+
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
10+
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
11+
12+
## Get Started
13+
Assume the path to the dataset is `~/datasets/BZNSYP`.
14+
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
15+
Run the command below to
16+
1. **source path**.
17+
2. preprocess the dataset.
18+
3. train the model.
19+
4. synthesize wavs.
20+
- synthesize waveform from `metadata.jsonl`.
21+
- synthesize waveform from a text file.
22+
5. inference using the static model.
23+
```bash
24+
./run.sh
25+
```
26+
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
27+
```bash
28+
./run.sh --stage 0 --stop-stage 0
29+
```
30+
### Data Preprocessing
31+
```bash
32+
./local/preprocess.sh ${conf_path}
33+
```
34+
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
35+
36+
```text
37+
dump
38+
├── dev
39+
│ ├── norm
40+
│ └── raw
41+
├── phone_id_map.txt
42+
├── speaker_id_map.txt
43+
├── test
44+
│ ├── norm
45+
│ └── raw
46+
└── train
47+
├── energy_stats.npy
48+
├── norm
49+
├── pitch_stats.npy
50+
├── raw
51+
└── speech_stats.npy
52+
```
53+
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
54+
55+
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and the id of each utterance.
56+
57+
### Model Training
58+
```bash
59+
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
60+
```
61+
`./local/train.sh` calls `${BIN_DIR}/train.py`.
62+
Here's the complete help message.
63+
```text
64+
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
65+
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
66+
[--ngpu NGPU] [--phones-dict PHONES_DICT]
67+
[--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
68+
69+
Train a FastSpeech2 model.
70+
71+
optional arguments:
72+
-h, --help show this help message and exit
73+
--config CONFIG fastspeech2 config file.
74+
--train-metadata TRAIN_METADATA
75+
training data.
76+
--dev-metadata DEV_METADATA
77+
dev data.
78+
--output-dir OUTPUT_DIR
79+
output dir.
80+
--ngpu NGPU if ngpu=0, use cpu.
81+
--phones-dict PHONES_DICT
82+
phone vocabulary file.
83+
--speaker-dict SPEAKER_DICT
84+
speaker id map file for multiple speaker model.
85+
--voice-cloning VOICE_CLONING
86+
whether training voice cloning model.
87+
```
88+
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
89+
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
90+
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
91+
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
92+
5. `--phones-dict` is the path of the phone vocabulary file.
93+
94+
### Synthesizing
95+
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
96+
Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) and unzip it.
97+
```bash
98+
unzip pwg_baker_ckpt_0.4.zip
99+
```
100+
Parallel WaveGAN checkpoint contains files listed below.
101+
```text
102+
pwg_baker_ckpt_0.4
103+
├── pwg_default.yaml # default config used to train parallel wavegan
104+
├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan
105+
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
106+
```
107+
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
108+
```bash
109+
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
110+
```
111+
```text
112+
usage: synthesize.py [-h]
113+
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
114+
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
115+
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
116+
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
117+
[--voice-cloning VOICE_CLONING]
118+
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
119+
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
120+
[--voc_stat VOC_STAT] [--ngpu NGPU]
121+
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
122+
123+
Synthesize with acoustic model & vocoder
124+
125+
optional arguments:
126+
-h, --help show this help message and exit
127+
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
128+
Choose acoustic model type of tts task.
129+
--am_config AM_CONFIG
130+
Config of acoustic model. Use deault config when it is
131+
None.
132+
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
133+
--am_stat AM_STAT mean and standard deviation used to normalize
134+
spectrogram when training acoustic model.
135+
--phones_dict PHONES_DICT
136+
phone vocabulary file.
137+
--tones_dict TONES_DICT
138+
tone vocabulary file.
139+
--speaker_dict SPEAKER_DICT
140+
speaker id map file.
141+
--voice-cloning VOICE_CLONING
142+
whether training voice cloning model.
143+
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
144+
Choose vocoder type of tts task.
145+
--voc_config VOC_CONFIG
146+
Config of voc. Use deault config when it is None.
147+
--voc_ckpt VOC_CKPT Checkpoint file of voc.
148+
--voc_stat VOC_STAT mean and standard deviation used to normalize
149+
spectrogram when training voc.
150+
--ngpu NGPU if ngpu == 0, use cpu.
151+
--test_metadata TEST_METADATA
152+
test metadata.
153+
--output_dir OUTPUT_DIR
154+
output dir.
155+
```
156+
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
157+
```bash
158+
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
159+
```
160+
```text
161+
usage: synthesize_e2e.py [-h]
162+
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
163+
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
164+
[--am_stat AM_STAT] [--phones_dict PHONES_DICT]
165+
[--tones_dict TONES_DICT]
166+
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
167+
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
168+
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
169+
[--voc_stat VOC_STAT] [--lang LANG]
170+
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
171+
[--text TEXT] [--output_dir OUTPUT_DIR]
172+
173+
Synthesize with acoustic model & vocoder
174+
175+
optional arguments:
176+
-h, --help show this help message and exit
177+
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
178+
Choose acoustic model type of tts task.
179+
--am_config AM_CONFIG
180+
Config of acoustic model. Use deault config when it is
181+
None.
182+
--am_ckpt AM_CKPT Checkpoint file of acoustic model.
183+
--am_stat AM_STAT mean and standard deviation used to normalize
184+
spectrogram when training acoustic model.
185+
--phones_dict PHONES_DICT
186+
phone vocabulary file.
187+
--tones_dict TONES_DICT
188+
tone vocabulary file.
189+
--speaker_dict SPEAKER_DICT
190+
speaker id map file.
191+
--spk_id SPK_ID spk id for multi speaker acoustic model
192+
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
193+
Choose vocoder type of tts task.
194+
--voc_config VOC_CONFIG
195+
Config of voc. Use deault config when it is None.
196+
--voc_ckpt VOC_CKPT Checkpoint file of voc.
197+
--voc_stat VOC_STAT mean and standard deviation used to normalize
198+
spectrogram when training voc.
199+
--lang LANG Choose model language. zh or en
200+
--inference_dir INFERENCE_DIR
201+
dir to save inference models
202+
--ngpu NGPU if ngpu == 0, use cpu.
203+
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
204+
--output_dir OUTPUT_DIR
205+
output dir.
206+
```
207+
1. `--am` is acoustic model type with the format {model_name}_{dataset}
208+
2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
209+
3. `--voc` is vocoder type with the format {model_name}_{dataset}
210+
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
211+
5. `--lang` is the model language, which can be `zh` or `en`.
212+
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
213+
7. `--text` is the text file, which contains sentences to synthesize.
214+
8. `--output_dir` is the directory to save synthesized audio files.
215+
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
216+
217+
### Inferencing
218+
After synthesizing, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`.
219+
`./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for fastspeech2 + pwgan synthesize.
220+
```bash
221+
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
222+
```
223+
224+
## Pretrained Model
225+
Pretrained FastSpeech2 model with no silence in the edge of audios:
226+
- [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)
227+
- [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)
228+
229+
The static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
230+
231+
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
232+
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
233+
default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
234+
conformer| 2(gpu) x 76000|1.0675|0.56103|0.035869|0.31553|0.15509|
235+
236+
FastSpeech2 checkpoint contains files listed below.
237+
```text
238+
fastspeech2_nosil_baker_ckpt_0.4
239+
├── default.yaml # default config used to train fastspeech2
240+
├── phone_id_map.txt # phone vocabulary file when training fastspeech2
241+
├── snapshot_iter_76000.pdz # model parameters and optimizer states
242+
└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2
243+
```
244+
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained fastspeech2 and parallel wavegan models.
245+
```bash
246+
source path.sh
247+
248+
FLAGS_allocator_strategy=naive_best_fit \
249+
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
250+
python3 ${BIN_DIR}/../synthesize_e2e.py \
251+
--am=fastspeech2_csmsc \
252+
--am_config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
253+
--am_ckpt=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
254+
--am_stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
255+
--voc=pwgan_csmsc \
256+
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
257+
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
258+
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
259+
--lang=zh \
260+
--text=${BIN_DIR}/../sentences.txt \
261+
--output_dir=exp/default/test_e2e \
262+
--inference_dir=exp/default/inference \
263+
--phones_dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
264+
```

0 commit comments

Comments
 (0)