-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepMind's settings #229
base: master
Are you sure you want to change the base?
DeepMind's settings #229
Conversation
ibab#227 residual ch: 512 skip_ch: 512 dilations: 1,..,1024 * 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to train with this params? Are the results noticeably better quality?
I have not. 512 channels don't fit in my gpu (TitanXP). |
So, if these are DeepMind's settings, they must have parallelised the entire graph, or, I don't know their exact performance relative to GPU's like TitanXP, but perhaps it's possible to run on a TPU...? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback
wavenet_params.json
Outdated
"residual_channels": 32, | ||
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, | ||
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, | ||
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot a comma sign at end of dilations
wavenet_params.json
Outdated
"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, | ||
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, | ||
1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] | ||
"residual_channels": 512, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about residual channels 256? 128?
It is still a 8x increase in channels
That 512 channels was a real eye-opener.
I don't think so. They may have done what we can now do, since some of the more recent changes (mainly koz4k's VALID conv change). If you look at how the sample_size is used, we can set that as low as we want. Since the receptive field is now filled (If I'm remembering how that change worked), we can do so without degrading the quality of training. The default 100000 makes our activation tensors 10 times as large as if we specify So perhaps we could go for the full 512 channels by changing our sample_size default to a smaller number (e.g. 10000). The branch I'm working in doesn't have koz4k's change (at least yet) so I haven't tried any of this yet.
I'm pretty sure the TPU is just for inference (to run the deployed nets lower res in production more energy efficiently). I don't think they train on them. |
Also he mentioned that scalar regression wasn't good enough, so we could eliminate scalar option to maintain code brief. |
@nakosung I think that scalar option doesn't add that much code that it'd be worth eliminating since when training on music data I found it somewhat useful as I noticed faster and overall better convergence against one-hot in some of my training sessions. I also noticed that in case of |
Per @jyegerlehner's idea I reduced my sample size to 6200 (a little over two receptive field lengths) and I was able to run it with 512 res channels, 512 dilation channels, and 256 skip channels with 3 stacks of (1,...,512) dilations. It was on a very small dataset though. |
Baidu's settings: (https://arxiv.org/pdf/1702.07825.pdf) 20 layers, 32 residual channels, 128 skip channels. |
@Nimeas I've also seen faster and better ultimate convergence with scalar input in my own re-implementation of this. You shouldn't need to quantize the input---that effectively is the same as adding quantization noise, and you would not expect any benefit from that in training, at least not until very late in the process. (There already is plenty of noise---the difference between the prediction of the next sample and the next sample input is effectively the noise the system is seeing during training. It's basically a denoising auto-encoder.) It's possible to implement training so that there is no effect of going to very short utterance lengths. You need to carry state for all the convolutions across utterance chunks properly. You effectively end up training on an infinite utterance length. The downside is you end up not training a clean state to start from in generation, because you only see utterance start once in training. One way of getting around that is to run a big block of zeros through the system to "warm it up" before generating. That way you generate from a reasonable starting state. So far my optimal settings are: initial_filter_width = 2 (Similar to Baidu I think.) The 32 in the default parameters doesn't seem to help. I'm also not seeing any loss to dropping a lot of the different dilations, e.g. a stack like: seems to work fine. The reasoning is that early dilations would do things like simple extrapolations and not require much nonlinearity to work well. The output sees them directly through the skip network, so they provide all the short-term information that you need. Later layers get more nonlinearity and probably only need to see the bigger dilations. That's what I told myself anyway when I tried this, and it worked fine, at least in the cross-entropy sense. |
Re: setting these DeepMind settings as default, I think it would be better to leave as default something that runs on most machines, so that people who clone the repo and want to try it out don't get memory errors or it takes forever, and then maybe have a separate file with a more "advanced" configuration. |
#227
residual ch: 512
skip_ch: 256 (edited)
dilations: 1,..,512 * 3