Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepMind's settings #229

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

DeepMind's settings #229

wants to merge 3 commits into from

Conversation

nakosung
Copy link
Contributor

@nakosung nakosung commented Feb 23, 2017

#227

residual ch: 512
skip_ch: 256 (edited)
dilations: 1,..,512 * 3

ibab#227

residual ch: 512
skip_ch: 512
dilations: 1,..,1024 * 3
Copy link
Collaborator

@akademi4eg akademi4eg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to train with this params? Are the results noticeably better quality?

@nakosung
Copy link
Contributor Author

I have not. 512 channels don't fit in my gpu (TitanXP).

@veqtor
Copy link

veqtor commented Feb 24, 2017

So, if these are DeepMind's settings, they must have parallelised the entire graph, or, I don't know their exact performance relative to GPU's like TitanXP, but perhaps it's possible to run on a TPU...?
Perhaps we could settle for an intermediary 256 residual channels?

Copy link

@veqtor veqtor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback

"residual_channels": 32,
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot a comma sign at end of dilations

"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
"residual_channels": 512,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about residual channels 256? 128?
It is still a 8x increase in channels

@jyegerlehner
Copy link
Contributor

jyegerlehner commented Feb 24, 2017

That 512 channels was a real eye-opener.

they must have parallelised the entire graph,

I don't think so. They may have done what we can now do, since some of the more recent changes (mainly koz4k's VALID conv change). If you look at how the sample_size is used, we can set that as low as we want. Since the receptive field is now filled (If I'm remembering how that change worked), we can do so without degrading the quality of training. The default 100000 makes our activation tensors 10 times as large as if we specify --sample_size=10000. The effect would be that we will march through a single .wav file in smaller chunks. Each training step will involve fewer audio samples, and will presumably run faster, and it will take more steps to get through a single file. And the memory required for the activation and gradient tensors will be 10 times smaller. The parameter tensors would stay the same size of course, but the memory usage of the activation tensors dwarfs that of the parameters. You can see that if you look at the size of our checkpoint files. I forget the numbers, but it's some MB, so most of the memory on a 12 GB GPU must be consumed by the activations and gradients, which are proportional in size to sample_size.

So perhaps we could go for the full 512 channels by changing our sample_size default to a smaller number (e.g. 10000). The branch I'm working in doesn't have koz4k's change (at least yet) so I haven't tried any of this yet.

but perhaps it's possible to run on a TPU...?

I'm pretty sure the TPU is just for inference (to run the deployed nets lower res in production more energy efficiently). I don't think they train on them.

@nakosung
Copy link
Contributor Author

Also he mentioned that scalar regression wasn't good enough, so we could eliminate scalar option to maintain code brief.

@lmartak
Copy link

lmartak commented Feb 26, 2017

@nakosung I think that scalar option doesn't add that much code that it'd be worth eliminating since when training on music data I found it somewhat useful as I noticed faster and overall better convergence against one-hot in some of my training sessions.

I also noticed that in case of scalar == True the input is not mu_law encoded, which is not completely in conformance with the paper. Any thoughts on this?

@dannybtran
Copy link

Per @jyegerlehner's idea I reduced my sample size to 6200 (a little over two receptive field lengths) and I was able to run it with 512 res channels, 512 dilation channels, and 256 skip channels with 3 stacks of (1,...,512) dilations.

It was on a very small dataset though.

@nakosung
Copy link
Contributor Author

nakosung commented Mar 1, 2017

Baidu's settings: (https://arxiv.org/pdf/1702.07825.pdf)

image

20 layers, 32 residual channels, 128 skip channels.
40 layers, 64 residual channels, 256 skip channels.

@cbquillen
Copy link

cbquillen commented Mar 4, 2017

@Nimeas I've also seen faster and better ultimate convergence with scalar input in my own re-implementation of this. You shouldn't need to quantize the input---that effectively is the same as adding quantization noise, and you would not expect any benefit from that in training, at least not until very late in the process. (There already is plenty of noise---the difference between the prediction of the next sample and the next sample input is effectively the noise the system is seeing during training. It's basically a denoising auto-encoder.)

It's possible to implement training so that there is no effect of going to very short utterance lengths. You need to carry state for all the convolutions across utterance chunks properly. You effectively end up training on an infinite utterance length. The downside is you end up not training a clean state to start from in generation, because you only see utterance start once in training. One way of getting around that is to run a big block of zeros through the system to "warm it up" before generating. That way you generate from a reasonable starting state.

So far my optimal settings are:

initial_filter_width = 2 (Similar to Baidu I think.) The 32 in the default parameters doesn't seem to help.
skip_channels = 256 (I was at 512 until I saw the Baidu paper. But 256 seems just as good.)
residual_channels = 64 (I got to that independently of Baidu, for what it's worth.)

I'm also not seeing any loss to dropping a lot of the different dilations, e.g. a stack like:
[[1, 2, 4, 8, 16, 32, 64],
[1, 2, 4, 8, 16, 32, 64, 128],
[1, 2, 4, 8, 16, 32, 64, 128, 256],
[1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
[4, 8, 16, 32, 64, 128, 256, 512],
[16, 32, 64, 128, 256, 512],
[32, 64, 128, 256, 512]]

seems to work fine. The reasoning is that early dilations would do things like simple extrapolations and not require much nonlinearity to work well. The output sees them directly through the skip network, so they provide all the short-term information that you need. Later layers get more nonlinearity and probably only need to see the bigger dilations. That's what I told myself anyway when I tried this, and it worked fine, at least in the cross-entropy sense.

@lemonzi
Copy link
Collaborator

lemonzi commented Mar 8, 2017

Re: setting these DeepMind settings as default, I think it would be better to leave as default something that runs on most machines, so that people who clone the repo and want to try it out don't get memory errors or it takes forever, and then maybe have a separate file with a more "advanced" configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants