Skip to content

Commit c7cfc5d

Browse files
pseethpremeeishaan
authored
Chunked inference for codec (#22)
* Adding some devtools. * Adding delay calculation * Chunked inference for codec. * Version bump * Removing prod.yml, updating to recent main. * Turning padding off only when chunking codes. * Updating README, removing unused things. * Missed a padding. * Adding some checks to make sure pads are the same. * Factoring out latent dim, backwards compatible. * Adding latent dim, and the 44khz 16kbps model config. * Ran pre-commit. * Chunked vs unchunked inference. * Fixing padding stuff. * n quantizers back in encode * don't load unsupported versions * correct docstring * bitrate config + 16kbps models * update audiotools dep * fix argbind issue * minor correction * bump version * change model path * update audiotools deps --------- Co-authored-by: prem <[email protected]> Co-authored-by: Ishaan Kumar <[email protected]>
1 parent 0202b4a commit c7cfc5d

17 files changed

+703
-337
lines changed

Dockerfile.dev

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
ARG IMAGE=pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
2+
ARG GITHUB_TOKEN=none
3+
4+
FROM $IMAGE
5+
6+
RUN echo machine github.com login ${GITHUB_TOKEN} > ~/.netrc
7+
8+
COPY requirements.txt /requirements.txt
9+
10+
RUN apt update && apt install -y git
11+
12+
# install the package
13+
RUN pip install --upgrade -r /requirements.txt

README.md

+46-15
Original file line numberDiff line numberDiff line change
@@ -66,33 +66,42 @@ for more options.
6666
### Programmatic Usage
6767
```py
6868
import dac
69-
from dac.utils import load_model
70-
from dac.model import DAC
71-
72-
from dac.utils.encode import process as encode
73-
from dac.utils.decode import process as decode
74-
7569
from audiotools import AudioSignal
7670

77-
# Init an empty model
78-
model = DAC()
71+
# Download a model
72+
model_path = dac.utils.download(model_type="44khz")
73+
model = dac.DAC.load(model_path)
7974

80-
# Load compatible pre-trained model
81-
model = load_model(tag="latest", model_type="44khz")
82-
model.eval()
8375
model.to('cuda')
8476

8577
# Load audio signal file
8678
signal = AudioSignal('input.wav')
8779

88-
# Encode audio signal
89-
encoded_out = encode(signal, 'cuda', model)
80+
# Encode audio signal as one long file
81+
# (may run out of GPU memory on long files)
82+
signal.to(model.device)
83+
84+
x = model.preprocess(signal.audio_data, signal.sample_rate)
85+
z, codes, latents, _, _ = model.encode(x)
9086

9187
# Decode audio signal
92-
recon = decode(encoded_out, 'cuda', model, preserve_sample_rate=True)
88+
y = model.decode(z)
89+
90+
# Alternatively, use the `compress` and `decompress` functions
91+
# to compress long files.
92+
93+
signal = signal.cpu()
94+
x = model.compress(signal)
95+
96+
# Save and load to and from disk
97+
x.save("compressed.dac")
98+
x = dac.DACFile.load("compressed.dac")
99+
100+
# Decompress it back to an AudioSignal
101+
y = model.decompress(x)
93102

94103
# Write to file
95-
recon.write('recon.wav')
104+
y.write('output.wav')
96105
```
97106

98107
### Docker image
@@ -131,6 +140,28 @@ Please install the correct dependencies
131140
pip install -e ".[dev]"
132141
```
133142
143+
## Environment setup
144+
145+
We have provided a Dockerfile and docker compose setup that makes running experiments easy.
146+
147+
To build the docker image do:
148+
149+
```
150+
docker compose build
151+
```
152+
153+
Then, to launch a container, do:
154+
155+
```
156+
docker compose run -p 8888:8888 -p 6006:6006 dev
157+
```
158+
159+
The port arguments (`-p`) are optional, but useful if you want to launch a Jupyter and Tensorboard instances within the container. The
160+
default password for Jupyter is `password`, and the current directory
161+
is mounted to `/u/home/src`, which also becomes the working directory.
162+
163+
Then, run your training command.
164+
134165
135166
### Single GPU training
136167
```

conf/final/44khz-16kbps.yml

+124
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Model setup
2+
DAC.sample_rate: 44100
3+
DAC.encoder_dim: 64
4+
DAC.encoder_rates: [2, 4, 8, 8]
5+
DAC.latent_dim: 128
6+
DAC.decoder_dim: 1536
7+
DAC.decoder_rates: [8, 8, 4, 2]
8+
9+
# Quantization
10+
DAC.n_codebooks: 18 # Max bitrate of 16kbps
11+
DAC.codebook_size: 1024
12+
DAC.codebook_dim: 8
13+
DAC.quantizer_dropout: 0.5
14+
15+
# Discriminator
16+
Discriminator.sample_rate: 44100
17+
Discriminator.rates: []
18+
Discriminator.periods: [2, 3, 5, 7, 11]
19+
Discriminator.fft_sizes: [2048, 1024, 512]
20+
Discriminator.bands:
21+
- [0.0, 0.1]
22+
- [0.1, 0.25]
23+
- [0.25, 0.5]
24+
- [0.5, 0.75]
25+
- [0.75, 1.0]
26+
27+
# Optimization
28+
AdamW.betas: [0.8, 0.99]
29+
AdamW.lr: 0.0001
30+
ExponentialLR.gamma: 0.999996
31+
32+
amp: false
33+
val_batch_size: 100
34+
device: cuda
35+
num_iters: 400000
36+
save_iters: [10000, 50000, 100000, 200000]
37+
valid_freq: 1000
38+
sample_freq: 10000
39+
num_workers: 32
40+
val_idx: [0, 1, 2, 3, 4, 5, 6, 7]
41+
seed: 0
42+
lambdas:
43+
mel/loss: 15.0
44+
adv/feat_loss: 2.0
45+
adv/gen_loss: 1.0
46+
vq/commitment_loss: 0.25
47+
vq/codebook_loss: 1.0
48+
49+
VolumeNorm.db: [const, -16]
50+
51+
# Transforms
52+
build_transform.preprocess:
53+
- Identity
54+
build_transform.augment_prob: 0.0
55+
build_transform.augment:
56+
- Identity
57+
build_transform.postprocess:
58+
- VolumeNorm
59+
- RescaleAudio
60+
- ShiftPhase
61+
62+
# Loss setup
63+
MultiScaleSTFTLoss.window_lengths: [2048, 512]
64+
MelSpectrogramLoss.n_mels: [5, 10, 20, 40, 80, 160, 320]
65+
MelSpectrogramLoss.window_lengths: [32, 64, 128, 256, 512, 1024, 2048]
66+
MelSpectrogramLoss.mel_fmin: [0, 0, 0, 0, 0, 0, 0]
67+
MelSpectrogramLoss.mel_fmax: [null, null, null, null, null, null, null]
68+
MelSpectrogramLoss.pow: 1.0
69+
MelSpectrogramLoss.clamp_eps: 1.0e-5
70+
MelSpectrogramLoss.mag_weight: 0.0
71+
72+
# Data
73+
batch_size: 72
74+
train/AudioDataset.duration: 0.38
75+
train/AudioDataset.n_examples: 10000000
76+
77+
val/AudioDataset.duration: 5.0
78+
val/build_transform.augment_prob: 1.0
79+
val/AudioDataset.n_examples: 250
80+
81+
test/AudioDataset.duration: 10.0
82+
test/build_transform.augment_prob: 1.0
83+
test/AudioDataset.n_examples: 1000
84+
85+
AudioLoader.shuffle: true
86+
AudioDataset.without_replacement: true
87+
88+
train/build_dataset.folders:
89+
speech_fb:
90+
- /data/daps/train
91+
speech_hq:
92+
- /data/vctk
93+
- /data/vocalset
94+
- /data/read_speech
95+
- /data/french_speech
96+
speech_uq:
97+
- /data/emotional_speech/
98+
- /data/common_voice/
99+
- /data/german_speech/
100+
- /data/russian_speech/
101+
- /data/spanish_speech/
102+
music_hq:
103+
- /data/musdb/train
104+
music_uq:
105+
- /data/jamendo
106+
general:
107+
- /data/audioset/data/unbalanced_train_segments/
108+
- /data/audioset/data/balanced_train_segments/
109+
110+
val/build_dataset.folders:
111+
speech_hq:
112+
- /data/daps/val
113+
music_hq:
114+
- /data/musdb/test
115+
general:
116+
- /data/audioset/data/eval_segments/
117+
118+
test/build_dataset.folders:
119+
speech_hq:
120+
- /data/daps/test
121+
music_hq:
122+
- /data/musdb/test
123+
general:
124+
- /data/audioset/data/eval_segments/

dac/__init__.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "0.0.5"
1+
__version__ = "1.0.0"
22

33
# preserved here for legacy reasons
44
__model_version__ = "latest"
@@ -11,3 +11,6 @@
1111

1212
from . import nn
1313
from . import model
14+
from . import utils
15+
from .model import DAC
16+
from .model import DACFile

dac/__main__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
import argbind
44

5-
from dac.utils import ensure_default_model as download
5+
from dac.utils import download
66
from dac.utils.decode import decode
77
from dac.utils.encode import encode
88

dac/model/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
from .base import CodecMixin
2+
from .base import DACFile
23
from .dac import DAC
34
from .discriminator import Discriminator

0 commit comments

Comments
 (0)