DeepMind's settings #229

nakosung · 2017-02-23T13:59:52Z

#227

residual ch: 512
skip_ch: 256 (edited)
dilations: 1,..,512 * 3

ibab#227 residual ch: 512 skip_ch: 512 dilations: 1,..,1024 * 3

akademi4eg

Have you tried to train with this params? Are the results noticeably better quality?

nakosung · 2017-02-23T23:46:12Z

I have not. 512 channels don't fit in my gpu (TitanXP).

veqtor · 2017-02-24T09:19:55Z

So, if these are DeepMind's settings, they must have parallelised the entire graph, or, I don't know their exact performance relative to GPU's like TitanXP, but perhaps it's possible to run on a TPU...?
Perhaps we could settle for an intermediary 256 residual channels?

veqtor

Feedback

veqtor · 2017-02-24T12:08:43Z

wavenet_params.json

-    "residual_channels": 32,
+    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
+                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
+                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]


forgot a comma sign at end of dilations

veqtor · 2017-02-24T12:10:27Z

wavenet_params.json

+    "dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
+                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
+                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
+    "residual_channels": 512,


what about residual channels 256? 128?
It is still a 8x increase in channels

jyegerlehner · 2017-02-24T16:56:50Z

That 512 channels was a real eye-opener.

they must have parallelised the entire graph,

I don't think so. They may have done what we can now do, since some of the more recent changes (mainly koz4k's VALID conv change). If you look at how the sample_size is used, we can set that as low as we want. Since the receptive field is now filled (If I'm remembering how that change worked), we can do so without degrading the quality of training. The default 100000 makes our activation tensors 10 times as large as if we specify --sample_size=10000. The effect would be that we will march through a single .wav file in smaller chunks. Each training step will involve fewer audio samples, and will presumably run faster, and it will take more steps to get through a single file. And the memory required for the activation and gradient tensors will be 10 times smaller. The parameter tensors would stay the same size of course, but the memory usage of the activation tensors dwarfs that of the parameters. You can see that if you look at the size of our checkpoint files. I forget the numbers, but it's some MB, so most of the memory on a 12 GB GPU must be consumed by the activations and gradients, which are proportional in size to sample_size.

So perhaps we could go for the full 512 channels by changing our sample_size default to a smaller number (e.g. 10000). The branch I'm working in doesn't have koz4k's change (at least yet) so I haven't tried any of this yet.

but perhaps it's possible to run on a TPU...?

I'm pretty sure the TPU is just for inference (to run the deployed nets lower res in production more energy efficiently). I don't think they train on them.

nakosung · 2017-02-25T03:20:26Z

Also he mentioned that scalar regression wasn't good enough, so we could eliminate scalar option to maintain code brief.

lmartak · 2017-02-26T01:37:23Z

@nakosung I think that scalar option doesn't add that much code that it'd be worth eliminating since when training on music data I found it somewhat useful as I noticed faster and overall better convergence against one-hot in some of my training sessions.

I also noticed that in case of scalar == True the input is not mu_law encoded, which is not completely in conformance with the paper. Any thoughts on this?

dannybtran · 2017-02-28T20:31:10Z

Per @jyegerlehner's idea I reduced my sample size to 6200 (a little over two receptive field lengths) and I was able to run it with 512 res channels, 512 dilation channels, and 256 skip channels with 3 stacks of (1,...,512) dilations.

It was on a very small dataset though.

nakosung · 2017-03-01T07:45:09Z

Baidu's settings: (https://arxiv.org/pdf/1702.07825.pdf)

20 layers, 32 residual channels, 128 skip channels.
40 layers, 64 residual channels, 256 skip channels.

cbquillen · 2017-03-04T02:17:01Z

@Nimeas I've also seen faster and better ultimate convergence with scalar input in my own re-implementation of this. You shouldn't need to quantize the input---that effectively is the same as adding quantization noise, and you would not expect any benefit from that in training, at least not until very late in the process. (There already is plenty of noise---the difference between the prediction of the next sample and the next sample input is effectively the noise the system is seeing during training. It's basically a denoising auto-encoder.)

It's possible to implement training so that there is no effect of going to very short utterance lengths. You need to carry state for all the convolutions across utterance chunks properly. You effectively end up training on an infinite utterance length. The downside is you end up not training a clean state to start from in generation, because you only see utterance start once in training. One way of getting around that is to run a big block of zeros through the system to "warm it up" before generating. That way you generate from a reasonable starting state.

So far my optimal settings are:

initial_filter_width = 2 (Similar to Baidu I think.) The 32 in the default parameters doesn't seem to help.
skip_channels = 256 (I was at 512 until I saw the Baidu paper. But 256 seems just as good.)
residual_channels = 64 (I got to that independently of Baidu, for what it's worth.)

I'm also not seeing any loss to dropping a lot of the different dilations, e.g. a stack like:
[[1, 2, 4, 8, 16, 32, 64],
[1, 2, 4, 8, 16, 32, 64, 128],
[1, 2, 4, 8, 16, 32, 64, 128, 256],
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
[4, 8, 16, 32, 64, 128, 256, 512],
[16, 32, 64, 128, 256, 512],
[32, 64, 128, 256, 512]]

seems to work fine. The reasoning is that early dilations would do things like simple extrapolations and not require much nonlinearity to work well. The output sees them directly through the skip network, so they provide all the short-term information that you need. Later layers get more nonlinearity and probably only need to see the bigger dilations. That's what I told myself anyway when I tried this, and it worked fine, at least in the cross-entropy sense.

lemonzi · 2017-03-08T21:33:34Z

Re: setting these DeepMind settings as default, I think it would be better to leave as default something that runs on most machines, so that people who clone the repo and want to try it out don't get memory errors or it takes forever, and then maybe have a separate file with a more "advanced" configuration.

DeepMind's settings

cb05c8a

ibab#227 residual ch: 512 skip_ch: 512 dilations: 1,..,1024 * 3

akademi4eg approved these changes Feb 23, 2017

View reviewed changes

veqtor suggested changes Feb 24, 2017

View reviewed changes

Update wavenet_params.json

a0c836c

Update wavenet_params.json

297a89c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepMind's settings #229

DeepMind's settings #229

nakosung commented Feb 23, 2017 •

edited

Loading

akademi4eg left a comment •

edited

Loading

nakosung commented Feb 23, 2017

veqtor commented Feb 24, 2017 •

edited

Loading

veqtor left a comment

veqtor Feb 24, 2017

veqtor Feb 24, 2017

jyegerlehner commented Feb 24, 2017 •

edited

Loading

nakosung commented Feb 25, 2017

lmartak commented Feb 26, 2017

dannybtran commented Feb 28, 2017

nakosung commented Mar 1, 2017

cbquillen commented Mar 4, 2017 •

edited

Loading

lemonzi commented Mar 8, 2017

DeepMind's settings #229

Are you sure you want to change the base?

DeepMind's settings #229

Conversation

nakosung commented Feb 23, 2017 • edited Loading

akademi4eg left a comment • edited Loading

Choose a reason for hiding this comment

nakosung commented Feb 23, 2017

veqtor commented Feb 24, 2017 • edited Loading

veqtor left a comment

Choose a reason for hiding this comment

veqtor Feb 24, 2017

Choose a reason for hiding this comment

veqtor Feb 24, 2017

Choose a reason for hiding this comment

jyegerlehner commented Feb 24, 2017 • edited Loading

nakosung commented Feb 25, 2017

lmartak commented Feb 26, 2017

dannybtran commented Feb 28, 2017

nakosung commented Mar 1, 2017

cbquillen commented Mar 4, 2017 • edited Loading

lemonzi commented Mar 8, 2017

nakosung commented Feb 23, 2017 •

edited

Loading

akademi4eg left a comment •

edited

Loading

veqtor commented Feb 24, 2017 •

edited

Loading

jyegerlehner commented Feb 24, 2017 •

edited

Loading

cbquillen commented Mar 4, 2017 •

edited

Loading