Skip to content

Commit 35c37ac

Browse files
committed
change nprocs to ngpu, add aishell3/voc1
1 parent 58b24aa commit 35c37ac

File tree

96 files changed

+643
-715
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

96 files changed

+643
-715
lines changed

examples/aishell3/tts3/README.md

+12-15
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,8 @@ Here's the complete help message.
6767
```text
6868
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
6969
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
70-
[--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
71-
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
70+
[--ngpu NGPU] [--verbose VERBOSE] [--phones-dict PHONES_DICT]
71+
[--speaker-dict SPEAKER_DICT]
7272
7373
Train a FastSpeech2 model.
7474
@@ -81,8 +81,7 @@ optional arguments:
8181
dev data.
8282
--output-dir OUTPUT_DIR
8383
output dir.
84-
--device DEVICE device type to use.
85-
--nprocs NPROCS number of processes.
84+
--ngpu NGPU if ngpu=0, use cpu.
8685
--verbose VERBOSE verbose.
8786
--phones-dict PHONES_DICT
8887
phone vocabulary file.
@@ -92,10 +91,9 @@ optional arguments:
9291
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
9392
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
9493
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
95-
4. `--device` is the type of the device to run the experiment, 'cpu' or 'gpu' are supported.
96-
5. `--nprocs` is the number of processes to run in parallel, note that nprocs > 1 is only supported when `--device` is 'gpu'.
97-
6. `--phones-dict` is the path of the phone vocabulary file.
98-
7. `--speaker-dict`is the path of the speaker id map file when training a multi-speaker FastSpeech2.
94+
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
95+
5. `--phones-dict` is the path of the phone vocabulary file.
96+
6. `--speaker-dict`is the path of the speaker id map file when training a multi-speaker FastSpeech2.
9997

10098
### Synthesize
10199
We use [parallel wavegan](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
@@ -122,7 +120,7 @@ usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
122120
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
123121
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
124122
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
125-
[--device DEVICE] [--verbose VERBOSE]
123+
[--ngpu NGPU] [--verbose VERBOSE]
126124
127125
Synthesize with fastspeech2 & parallel wavegan.
128126
@@ -149,8 +147,8 @@ optional arguments:
149147
test metadata.
150148
--output-dir OUTPUT_DIR
151149
output dir.
152-
--device DEVICE device type to use.
153-
--verbose VERBOSE verbose.
150+
--ngpu NGPU if ngpu == 0, use cpu.
151+
--verbose VERBOSE verbose
154152
```
155153
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e.py`, which can synthesize waveform from text file.
156154
```bash
@@ -166,7 +164,7 @@ usage: multi_spk_synthesize_e2e.py [-h]
166164
[--pwg-stat PWG_STAT]
167165
[--phones-dict PHONES_DICT]
168166
[--speaker-dict SPEAKER_DICT] [--text TEXT]
169-
[--output-dir OUTPUT_DIR] [--device DEVICE]
167+
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
170168
[--verbose VERBOSE]
171169
172170
Synthesize with fastspeech2 & parallel wavegan.
@@ -193,15 +191,15 @@ optional arguments:
193191
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
194192
--output-dir OUTPUT_DIR
195193
output dir.
196-
--device DEVICE device type to use.
194+
--ngpu NGPU if ngpu == 0, use cpu.
197195
--verbose VERBOSE verbose.
198196
```
199197
1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat`, `--phones-dict` and `--speaker-dict` are arguments for fastspeech2, which correspond to the 5 files in the fastspeech2 pretrained model.
200198
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
201199
3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
202200
4. `--text` is the text file, which contains sentences to synthesize.
203201
5. `--output-dir` is the directory to save synthesized audio files.
204-
6. `--device` is the type of device to run synthesis, 'cpu' and 'gpu' are supported. 'gpu' is recommended for faster synthesis.
202+
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
205203

206204
## Pretrained Model
207205
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
@@ -231,7 +229,6 @@ python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \
231229
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
232230
--text=${BIN_DIR}/../sentences.txt \
233231
--output-dir=exp/default/test_e2e \
234-
--device="gpu" \
235232
--phones-dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
236233
--speaker-dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt
237234

examples/aishell3/tts3/local/synthesize.sh

-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,5 @@ python3 ${BIN_DIR}/synthesize.py \
1515
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
1616
--test-metadata=dump/test/norm/metadata.jsonl \
1717
--output-dir=${train_output_path}/test \
18-
--device="gpu" \
1918
--phones-dict=dump/phone_id_map.txt \
2019
--speaker-dict=dump/speaker_id_map.txt

examples/aishell3/tts3/local/synthesize_e2e.sh

-1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,5 @@ python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \
1515
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
1616
--text=${BIN_DIR}/../sentences.txt \
1717
--output-dir=${train_output_path}/test_e2e \
18-
--device="gpu" \
1918
--phones-dict=dump/phone_id_map.txt \
2019
--speaker-dict=dump/speaker_id_map.txt

examples/aishell3/tts3/local/train.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,6 @@ python3 ${BIN_DIR}/train.py \
88
--dev-metadata=dump/dev/norm/metadata.jsonl \
99
--config=${config_path} \
1010
--output-dir=${train_output_path} \
11-
--nprocs=2 \
11+
--ngpu=2 \
1212
--phones-dict=dump/phone_id_map.txt \
1313
--speaker-dict=dump/speaker_id_map.txt

examples/aishell3/vc0/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
2828
python3 ${BIN_DIR}/../ge2e/inference.py \
2929
--input=${input} \
3030
--output=${preprocess_path}/embed \
31-
--device="gpu" \
31+
--ngpu=1 \
3232
--checkpoint_path=${ge2e_ckpt_path}
3333
fi
3434
```

examples/aishell3/vc0/local/preprocess.sh

-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
1212
python3 ${BIN_DIR}/../../ge2e/inference.py \
1313
--input=${input} \
1414
--output=${preprocess_path}/embed \
15-
--device="gpu" \
1615
--checkpoint_path=${ge2e_ckpt_path}
1716
fi
1817

examples/aishell3/vc0/local/train.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ train_output_path=$2
66
python3 ${BIN_DIR}/train.py \
77
--data=${preprocess_path} \
88
--output=${train_output_path} \
9-
--device="gpu"
9+
--ngpu=1
+115
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# This is the hyperparameter configuration file for Parallel WaveGAN.
2+
# Please make sure this is adjusted for the VCTK corpus. If you want to
3+
# apply to the other dataset, you might need to carefully change some parameters.
4+
# This configuration requires 12 GB GPU memory and takes ~3 days on RTX TITAN.
5+
6+
###########################################################
7+
# FEATURE EXTRACTION SETTING #
8+
###########################################################
9+
fs: 24000 # Sampling rate.
10+
n_fft: 2048 # FFT size. (in samples)
11+
n_shift: 300 # Hop size. (in samples)
12+
win_length: 1200 # Window length. (in samples)
13+
# If set to null, it will be the same as fft_size.
14+
window: "hann" # Window function.
15+
n_mels: 80 # Number of mel basis.
16+
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
17+
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
18+
19+
###########################################################
20+
# GENERATOR NETWORK ARCHITECTURE SETTING #
21+
###########################################################
22+
generator_params:
23+
in_channels: 1 # Number of input channels.
24+
out_channels: 1 # Number of output channels.
25+
kernel_size: 3 # Kernel size of dilated convolution.
26+
layers: 30 # Number of residual block layers.
27+
stacks: 3 # Number of stacks i.e., dilation cycles.
28+
residual_channels: 64 # Number of channels in residual conv.
29+
gate_channels: 128 # Number of channels in gated conv.
30+
skip_channels: 64 # Number of channels in skip conv.
31+
aux_channels: 80 # Number of channels for auxiliary feature conv.
32+
# Must be the same as num_mels.
33+
aux_context_window: 2 # Context window size for auxiliary feature.
34+
# If set to 2, previous 2 and future 2 frames will be considered.
35+
dropout: 0.0 # Dropout rate. 0.0 means no dropout applied.
36+
use_weight_norm: true # Whether to use weight norm.
37+
# If set to true, it will be applied to all of the conv layers.
38+
upsample_scales: [4, 5, 3, 5] # Upsampling scales. Prodcut of these must be the same as hop size.
39+
40+
###########################################################
41+
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
42+
###########################################################
43+
discriminator_params:
44+
in_channels: 1 # Number of input channels.
45+
out_channels: 1 # Number of output channels.
46+
kernel_size: 3 # Number of output channels.
47+
layers: 10 # Number of conv layers.
48+
conv_channels: 64 # Number of chnn layers.
49+
bias: true # Whether to use bias parameter in conv.
50+
use_weight_norm: true # Whether to use weight norm.
51+
# If set to true, it will be applied to all of the conv layers.
52+
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
53+
nonlinear_activation_params: # Nonlinear function parameters
54+
negative_slope: 0.2 # Alpha in LeakyReLU.
55+
56+
###########################################################
57+
# STFT LOSS SETTING #
58+
###########################################################
59+
stft_loss_params:
60+
fft_sizes: [1024, 2048, 512] # List of FFT size for STFT-based loss.
61+
hop_sizes: [120, 240, 50] # List of hop size for STFT-based loss
62+
win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
63+
window: "hann" # Window function for STFT-based loss
64+
65+
###########################################################
66+
# ADVERSARIAL LOSS SETTING #
67+
###########################################################
68+
lambda_adv: 4.0 # Loss balancing coefficient.
69+
70+
###########################################################
71+
# DATA LOADER SETTING #
72+
###########################################################
73+
batch_size: 8 # Batch size.
74+
batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by hop_size.
75+
pin_memory: true # Whether to pin memory in Pytorch DataLoader.
76+
num_workers: 4 # Number of workers in Pytorch DataLoader.
77+
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
78+
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
79+
80+
###########################################################
81+
# OPTIMIZER & SCHEDULER SETTING #
82+
###########################################################
83+
generator_optimizer_params:
84+
epsilon: 1.0e-6 # Generator's epsilon.
85+
weight_decay: 0.0 # Generator's weight decay coefficient.
86+
generator_scheduler_params:
87+
learning_rate: 0.0001 # Generator's learning rate.
88+
step_size: 200000 # Generator's scheduler step size.
89+
gamma: 0.5 # Generator's scheduler gamma.
90+
# At each step size, lr will be multiplied by this parameter.
91+
generator_grad_norm: 10 # Generator's gradient norm.
92+
discriminator_optimizer_params:
93+
epsilon: 1.0e-6 # Discriminator's epsilon.
94+
weight_decay: 0.0 # Discriminator's weight decay coefficient.
95+
discriminator_scheduler_params:
96+
learning_rate: 0.00005 # Discriminator's learning rate.
97+
step_size: 200000 # Discriminator's scheduler step size.
98+
gamma: 0.5 # Discriminator's scheduler gamma.
99+
# At each step size, lr will be multiplied by this parameter.
100+
discriminator_grad_norm: 1 # Discriminator's gradient norm.
101+
102+
###########################################################
103+
# INTERVAL SETTING #
104+
###########################################################
105+
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
106+
train_max_steps: 1000000 # Number of training steps.
107+
save_interval_steps: 5000 # Interval steps to save checkpoint.
108+
eval_interval_steps: 1000 # Interval steps to evaluate the network.
109+
110+
###########################################################
111+
# OTHER SETTING #
112+
###########################################################
113+
num_save_intermediate_results: 4 # Number of results to be saved as intermediate results.
114+
num_snapshots: 10 # max number of snapshots to keep while training
115+
seed: 42 # random seed for paddle, random, and np.random
+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
#!/bin/bash
2+
3+
stage=0
4+
stop_stage=100
5+
6+
config_path=$1
7+
8+
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
9+
# get durations from MFA's result
10+
echo "Generate durations.txt from MFA results ..."
11+
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
12+
--inputdir=./aishell3_alignment_tone \
13+
--output=durations.txt \
14+
--config=${config_path}
15+
fi
16+
17+
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
18+
# extract features
19+
echo "Extract features ..."
20+
python3 ${BIN_DIR}/../preprocess.py \
21+
--rootdir=~/datasets/data_aishell3/ \
22+
--dataset=aishell3 \
23+
--dumpdir=dump \
24+
--dur-file=durations.txt \
25+
--config=${config_path} \
26+
--cut-sil=True \
27+
--num-cpu=20
28+
fi
29+
30+
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
31+
# get features' stats(mean and std)
32+
echo "Get features' stats ..."
33+
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
34+
--metadata=dump/train/raw/metadata.jsonl \
35+
--field-name="feats"
36+
fi
37+
38+
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
39+
# normalize, dev and test should use train's stats
40+
echo "Normalize ..."
41+
42+
python3 ${BIN_DIR}/../normalize.py \
43+
--metadata=dump/train/raw/metadata.jsonl \
44+
--dumpdir=dump/train/norm \
45+
--stats=dump/train/feats_stats.npy
46+
python3 ${BIN_DIR}/../normalize.py \
47+
--metadata=dump/dev/raw/metadata.jsonl \
48+
--dumpdir=dump/dev/norm \
49+
--stats=dump/train/feats_stats.npy
50+
51+
python3 ${BIN_DIR}/../normalize.py \
52+
--metadata=dump/test/raw/metadata.jsonl \
53+
--dumpdir=dump/test/norm \
54+
--stats=dump/train/feats_stats.npy
55+
fi
+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
3+
config_path=$1
4+
train_output_path=$2
5+
ckpt_name=$3
6+
7+
FLAGS_allocator_strategy=naive_best_fit \
8+
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
9+
python3 ${BIN_DIR}/synthesize.py \
10+
--config=${config_path} \
11+
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
12+
--test-metadata=dump/test/norm/metadata.jsonl \
13+
--output-dir=${train_output_path}/test

examples/aishell3/voc1/local/train.sh

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
3+
config_path=$1
4+
train_output_path=$2
5+
6+
FLAGS_cudnn_exhaustive_search=true \
7+
FLAGS_conv_workspace_size_limit=4000 \
8+
python ${BIN_DIR}/train.py \
9+
--train-metadata=dump/train/norm/metadata.jsonl \
10+
--dev-metadata=dump/dev/norm/metadata.jsonl \
11+
--config=${config_path} \
12+
--output-dir=${train_output_path} \
13+
--ngpu=1

examples/aishell3/voc1/path.sh

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
export MAIN_ROOT=`realpath ${PWD}/../../../`
3+
4+
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
5+
export LC_ALL=C
6+
7+
export PYTHONDONTWRITEBYTECODE=1
8+
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
9+
export PYTHONIOENCODING=UTF-8
10+
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
11+
12+
MODEL=parallelwave_gan
13+
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}

examples/aishell3/voc1/run.sh

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#!/bin/bash
2+
3+
set -e
4+
source path.sh
5+
6+
gpus=0
7+
stage=0
8+
stop_stage=100
9+
10+
conf_path=conf/default.yaml
11+
train_output_path=exp/default
12+
ckpt_name=snapshot_iter_5000.pdz
13+
14+
# with the following command, you can choice the stage range you want to run
15+
# such as `./run.sh --stage 0 --stop-stage 0`
16+
# this can not be mixed use with `$1`, `$2` ...
17+
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
18+
19+
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
20+
# prepare data
21+
./local/preprocess.sh ${conf_path} || exit -1
22+
fi
23+
24+
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
25+
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
26+
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
27+
fi
28+
29+
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
30+
# synthesize
31+
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
32+
fi

0 commit comments

Comments
 (0)