-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning the initial weights of the heads leads to NaN #2
Comments
This may be due to the norm constraint of the weights (and the initial weights) being violated during training. The weights are required to sum to one, but the (vanilla) training procedure does not guarantee it. The sum-to-one constraint is critical since other setups may lead to entries in |
Learning the initial weights might be something we'll eventually need. Initializing them to a uniform probability over all the addresses almost necessarily forces the first step to write in a distributed way (over multiple addresses, instead of hard addressing). Instead of learning the raw w_0 = normalize(rectify(weight_init)) With an additional |
hi @tristandeleu, when learning the initial weights, are you making sure they are behind of a softmax? In other word, are you learning the initial logits instead? If so, there is no problem if they get negative values. I had this problem in my NTM implementation as well. |
When I originally opened this issue I didn't, which was a mistake on my end. I haven't tried to learn the logit. But indeed you're right, I think this is the right solution (I only sketched the idea in this issue). |
I'm new to your codebase, could you point me out where you get the initial weights, I could try to check that out. |
The initial weights are defined here: https://github.com/snipsco/ntm-lasagne/blob/master/ntm/heads.py#L102 |
When setting the
learn_init=True
parameter on the heads, the error and parameters of the heads become NaN after a few iterations (not necessarily on the first one, but it can happen after 100+ iterations).How to reproduce it:
This is a non-blocking issue since learning these weights may not actually make sense (we can just leave equi-probability as is as the first step).
The text was updated successfully, but these errors were encountered: