Skip to content

Commit 408235a

Browse files
Add final training configs as well as release 16 KHz model (#19)
* adding final configs for all models * changs for 16khz * add latest version for 16khz model * update package version --------- Co-authored-by: Ishaan Kumar <[email protected]>
1 parent 06e8049 commit 408235a

File tree

10 files changed

+384
-8
lines changed

10 files changed

+384
-8
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@ pip install git+https://github.com/descriptinc/descript-audio-codec
3232

3333
### Weights
3434
Weights are released as part of this repo under MIT license.
35-
We release weights for models that can natively support 24kHz and 44.1kHz sampling rates.
35+
We release weights for models that can natively support 16 kHz, 24kHz, and 44.1kHz sampling rates.
3636
Weights are automatically downloaded when you first run `encode` or `decode` command. You can cache them using one of the following commands
3737
```bash
3838
python3 -m dac download # downloads the default 44kHz variant
3939
python3 -m dac download --model_type 44khz # downloads the 44kHz variant
4040
python3 -m dac download --model_type 24khz # downloads the 24kHz variant
41+
python3 -m dac download --model_type 16khz # downloads the 16kHz variant
4142
```
4243
We provide a Dockerfile that installs all required dependencies for encoding and decoding. The build process caches the default model weights inside the image. This allows the image to be used without an internet connection. [Please refer to instructions below.](#docker-image)
4344

conf/final/16khz.yml

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Model setup
2+
DAC.sample_rate: 16000
3+
DAC.encoder_dim: 64
4+
DAC.encoder_rates: [2, 4, 5, 8]
5+
DAC.decoder_dim: 1536
6+
DAC.decoder_rates: [8, 5, 4, 2]
7+
8+
# Quantization
9+
DAC.n_codebooks: 12
10+
DAC.codebook_size: 1024
11+
DAC.codebook_dim: 8
12+
DAC.quantizer_dropout: 0.5
13+
14+
# Discriminator
15+
Discriminator.sample_rate: 16000
16+
Discriminator.rates: []
17+
Discriminator.periods: [2, 3, 5, 7, 11]
18+
Discriminator.fft_sizes: [2048, 1024, 512]
19+
Discriminator.bands:
20+
- [0.0, 0.1]
21+
- [0.1, 0.25]
22+
- [0.25, 0.5]
23+
- [0.5, 0.75]
24+
- [0.75, 1.0]
25+
26+
# Optimization
27+
AdamW.betas: [0.8, 0.99]
28+
AdamW.lr: 0.0001
29+
ExponentialLR.gamma: 0.999996
30+
31+
amp: false
32+
val_batch_size: 100
33+
device: cuda
34+
num_iters: 400000
35+
save_iters: [10000, 50000, 100000, 200000]
36+
valid_freq: 1000
37+
sample_freq: 10000
38+
num_workers: 32
39+
val_idx: [0, 1, 2, 3, 4, 5, 6, 7]
40+
seed: 0
41+
lambdas:
42+
mel/loss: 15.0
43+
adv/feat_loss: 2.0
44+
adv/gen_loss: 1.0
45+
vq/commitment_loss: 0.25
46+
vq/codebook_loss: 1.0
47+
48+
VolumeNorm.db: [const, -16]
49+
50+
# Transforms
51+
build_transform.preprocess:
52+
- Identity
53+
build_transform.augment_prob: 0.0
54+
build_transform.augment:
55+
- Identity
56+
build_transform.postprocess:
57+
- VolumeNorm
58+
- RescaleAudio
59+
- ShiftPhase
60+
61+
# Loss setup
62+
MultiScaleSTFTLoss.window_lengths: [2048, 512]
63+
MelSpectrogramLoss.n_mels: [5, 10, 20, 40, 80, 160, 320]
64+
MelSpectrogramLoss.window_lengths: [32, 64, 128, 256, 512, 1024, 2048]
65+
MelSpectrogramLoss.mel_fmin: [0, 0, 0, 0, 0, 0, 0]
66+
MelSpectrogramLoss.mel_fmax: [null, null, null, null, null, null, null]
67+
MelSpectrogramLoss.pow: 1.0
68+
MelSpectrogramLoss.clamp_eps: 1.0e-5
69+
MelSpectrogramLoss.mag_weight: 0.0
70+
71+
# Data
72+
batch_size: 72
73+
train/AudioDataset.duration: 0.38
74+
train/AudioDataset.n_examples: 10000000
75+
76+
val/AudioDataset.duration: 5.0
77+
val/build_transform.augment_prob: 1.0
78+
val/AudioDataset.n_examples: 250
79+
80+
test/AudioDataset.duration: 10.0
81+
test/build_transform.augment_prob: 1.0
82+
test/AudioDataset.n_examples: 1000
83+
84+
AudioLoader.shuffle: true
85+
AudioDataset.without_replacement: true
86+
87+
train/build_dataset.folders:
88+
speech_fb:
89+
- /data/daps/train
90+
speech_hq:
91+
- /data/vctk
92+
- /data/vocalset
93+
- /data/read_speech
94+
- /data/french_speech
95+
speech_uq:
96+
- /data/emotional_speech/
97+
- /data/common_voice/
98+
- /data/german_speech/
99+
- /data/russian_speech/
100+
- /data/spanish_speech/
101+
music_hq:
102+
- /data/musdb/train
103+
music_uq:
104+
- /data/jamendo
105+
general:
106+
- /data/audioset/data/unbalanced_train_segments/
107+
- /data/audioset/data/balanced_train_segments/
108+
109+
val/build_dataset.folders:
110+
speech_hq:
111+
- /data/daps/val
112+
music_hq:
113+
- /data/musdb/test
114+
general:
115+
- /data/audioset/data/eval_segments/
116+
117+
test/build_dataset.folders:
118+
speech_hq:
119+
- /data/daps/test
120+
music_hq:
121+
- /data/musdb/test
122+
general:
123+
- /data/audioset/data/eval_segments/

conf/final/24khz.yml

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Model setup
2+
DAC.sample_rate: 24000
3+
DAC.encoder_dim: 64
4+
DAC.encoder_rates: [2, 4, 5, 8]
5+
DAC.decoder_dim: 1536
6+
DAC.decoder_rates: [8, 5, 4, 2]
7+
8+
# Quantization
9+
DAC.n_codebooks: 32
10+
DAC.codebook_size: 1024
11+
DAC.codebook_dim: 8
12+
DAC.quantizer_dropout: 0.5
13+
14+
# Discriminator
15+
Discriminator.sample_rate: 24000
16+
Discriminator.rates: []
17+
Discriminator.periods: [2, 3, 5, 7, 11]
18+
Discriminator.fft_sizes: [2048, 1024, 512]
19+
Discriminator.bands:
20+
- [0.0, 0.1]
21+
- [0.1, 0.25]
22+
- [0.25, 0.5]
23+
- [0.5, 0.75]
24+
- [0.75, 1.0]
25+
26+
# Optimization
27+
AdamW.betas: [0.8, 0.99]
28+
AdamW.lr: 0.0001
29+
ExponentialLR.gamma: 0.999996
30+
31+
amp: false
32+
val_batch_size: 100
33+
device: cuda
34+
num_iters: 400000
35+
save_iters: [10000, 50000, 100000, 200000]
36+
valid_freq: 1000
37+
sample_freq: 10000
38+
num_workers: 32
39+
val_idx: [0, 1, 2, 3, 4, 5, 6, 7]
40+
seed: 0
41+
lambdas:
42+
mel/loss: 15.0
43+
adv/feat_loss: 2.0
44+
adv/gen_loss: 1.0
45+
vq/commitment_loss: 0.25
46+
vq/codebook_loss: 1.0
47+
48+
VolumeNorm.db: [const, -16]
49+
50+
# Transforms
51+
build_transform.preprocess:
52+
- Identity
53+
build_transform.augment_prob: 0.0
54+
build_transform.augment:
55+
- Identity
56+
build_transform.postprocess:
57+
- VolumeNorm
58+
- RescaleAudio
59+
- ShiftPhase
60+
61+
# Loss setup
62+
MultiScaleSTFTLoss.window_lengths: [2048, 512]
63+
MelSpectrogramLoss.n_mels: [5, 10, 20, 40, 80, 160, 320]
64+
MelSpectrogramLoss.window_lengths: [32, 64, 128, 256, 512, 1024, 2048]
65+
MelSpectrogramLoss.mel_fmin: [0, 0, 0, 0, 0, 0, 0]
66+
MelSpectrogramLoss.mel_fmax: [null, null, null, null, null, null, null]
67+
MelSpectrogramLoss.pow: 1.0
68+
MelSpectrogramLoss.clamp_eps: 1.0e-5
69+
MelSpectrogramLoss.mag_weight: 0.0
70+
71+
# Data
72+
batch_size: 72
73+
train/AudioDataset.duration: 0.38
74+
train/AudioDataset.n_examples: 10000000
75+
76+
val/AudioDataset.duration: 5.0
77+
val/build_transform.augment_prob: 1.0
78+
val/AudioDataset.n_examples: 250
79+
80+
test/AudioDataset.duration: 10.0
81+
test/build_transform.augment_prob: 1.0
82+
test/AudioDataset.n_examples: 1000
83+
84+
AudioLoader.shuffle: true
85+
AudioDataset.without_replacement: true
86+
87+
train/build_dataset.folders:
88+
speech_fb:
89+
- /data/daps/train
90+
speech_hq:
91+
- /data/vctk
92+
- /data/vocalset
93+
- /data/read_speech
94+
- /data/french_speech
95+
speech_uq:
96+
- /data/emotional_speech/
97+
- /data/common_voice/
98+
- /data/german_speech/
99+
- /data/russian_speech/
100+
- /data/spanish_speech/
101+
music_hq:
102+
- /data/musdb/train
103+
music_uq:
104+
- /data/jamendo
105+
general:
106+
- /data/audioset/data/unbalanced_train_segments/
107+
- /data/audioset/data/balanced_train_segments/
108+
109+
val/build_dataset.folders:
110+
speech_hq:
111+
- /data/daps/val
112+
music_hq:
113+
- /data/musdb/test
114+
general:
115+
- /data/audioset/data/eval_segments/
116+
117+
test/build_dataset.folders:
118+
speech_hq:
119+
- /data/daps/test
120+
music_hq:
121+
- /data/musdb/test
122+
general:
123+
- /data/audioset/data/eval_segments/

conf/final/44khz.yml

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Model setup
2+
DAC.sample_rate: 44100
3+
DAC.encoder_dim: 64
4+
DAC.encoder_rates: [2, 4, 8, 8]
5+
DAC.decoder_dim: 1536
6+
DAC.decoder_rates: [8, 8, 4, 2]
7+
8+
# Quantization
9+
DAC.n_codebooks: 9
10+
DAC.codebook_size: 1024
11+
DAC.codebook_dim: 8
12+
DAC.quantizer_dropout: 0.5
13+
14+
# Discriminator
15+
Discriminator.sample_rate: 44100
16+
Discriminator.rates: []
17+
Discriminator.periods: [2, 3, 5, 7, 11]
18+
Discriminator.fft_sizes: [2048, 1024, 512]
19+
Discriminator.bands:
20+
- [0.0, 0.1]
21+
- [0.1, 0.25]
22+
- [0.25, 0.5]
23+
- [0.5, 0.75]
24+
- [0.75, 1.0]
25+
26+
# Optimization
27+
AdamW.betas: [0.8, 0.99]
28+
AdamW.lr: 0.0001
29+
ExponentialLR.gamma: 0.999996
30+
31+
amp: false
32+
val_batch_size: 100
33+
device: cuda
34+
num_iters: 400000
35+
save_iters: [10000, 50000, 100000, 200000]
36+
valid_freq: 1000
37+
sample_freq: 10000
38+
num_workers: 32
39+
val_idx: [0, 1, 2, 3, 4, 5, 6, 7]
40+
seed: 0
41+
lambdas:
42+
mel/loss: 15.0
43+
adv/feat_loss: 2.0
44+
adv/gen_loss: 1.0
45+
vq/commitment_loss: 0.25
46+
vq/codebook_loss: 1.0
47+
48+
VolumeNorm.db: [const, -16]
49+
50+
# Transforms
51+
build_transform.preprocess:
52+
- Identity
53+
build_transform.augment_prob: 0.0
54+
build_transform.augment:
55+
- Identity
56+
build_transform.postprocess:
57+
- VolumeNorm
58+
- RescaleAudio
59+
- ShiftPhase
60+
61+
# Loss setup
62+
MultiScaleSTFTLoss.window_lengths: [2048, 512]
63+
MelSpectrogramLoss.n_mels: [5, 10, 20, 40, 80, 160, 320]
64+
MelSpectrogramLoss.window_lengths: [32, 64, 128, 256, 512, 1024, 2048]
65+
MelSpectrogramLoss.mel_fmin: [0, 0, 0, 0, 0, 0, 0]
66+
MelSpectrogramLoss.mel_fmax: [null, null, null, null, null, null, null]
67+
MelSpectrogramLoss.pow: 1.0
68+
MelSpectrogramLoss.clamp_eps: 1.0e-5
69+
MelSpectrogramLoss.mag_weight: 0.0
70+
71+
# Data
72+
batch_size: 72
73+
train/AudioDataset.duration: 0.38
74+
train/AudioDataset.n_examples: 10000000
75+
76+
val/AudioDataset.duration: 5.0
77+
val/build_transform.augment_prob: 1.0
78+
val/AudioDataset.n_examples: 250
79+
80+
test/AudioDataset.duration: 10.0
81+
test/build_transform.augment_prob: 1.0
82+
test/AudioDataset.n_examples: 1000
83+
84+
AudioLoader.shuffle: true
85+
AudioDataset.without_replacement: true
86+
87+
train/build_dataset.folders:
88+
speech_fb:
89+
- /data/daps/train
90+
speech_hq:
91+
- /data/vctk
92+
- /data/vocalset
93+
- /data/read_speech
94+
- /data/french_speech
95+
speech_uq:
96+
- /data/emotional_speech/
97+
- /data/common_voice/
98+
- /data/german_speech/
99+
- /data/russian_speech/
100+
- /data/spanish_speech/
101+
music_hq:
102+
- /data/musdb/train
103+
music_uq:
104+
- /data/jamendo
105+
general:
106+
- /data/audioset/data/unbalanced_train_segments/
107+
- /data/audioset/data/balanced_train_segments/
108+
109+
val/build_dataset.folders:
110+
speech_hq:
111+
- /data/daps/val
112+
music_hq:
113+
- /data/musdb/test
114+
general:
115+
- /data/audioset/data/eval_segments/
116+
117+
test/build_dataset.folders:
118+
speech_hq:
119+
- /data/daps/test
120+
music_hq:
121+
- /data/musdb/test
122+
general:
123+
- /data/audioset/data/eval_segments/

dac/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "0.0.4"
1+
__version__ = "0.0.5"
22

33
# preserved here for legacy reasons
44
__model_version__ = "latest"

0 commit comments

Comments
 (0)