-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Associative Recall task #10
Comments
Training the NTM (partial training)In this experiment, I trained the NTM on sequences of 2 to 6 items of length from 1 to 3. Earlier experiments on such data showed that the training procedure often gets stuck in some local minimum (which I suspect is when the NTM figures out where to write its output). At test time on this experiment, it shows that the NTM actually learned to write its inputs sequentially in memory this time, but it is still not able to consistently perform well on test data (even for similar data it was trained on).
Moreover, when the query corresponds to the 2nd item, the NTM starts reading the addresses corresponding to the 2nd item in memory ; after the first few vectors of the query (maybe if the the first 1 or 2 vectors of the query match the ones of the 2nd item), it reads the 127th address. We can notice the same behavior when the NTM first sees the 2nd item, so maybe that's how it remembers it really corresponds to the 2nd item. This seems to be confirmed by the generalization example (see below). Learning curveParameters of the experimentI made a lot of changes to #10 (comment) for this experiment:
|
In [1], they suggest that the Associative Recall task is significantly harder to train than the Copy task, and does not always converge.
[1] Wei Zhang, Yang Yu, Structured Memory for Neural Turing Machines [arXiv] |
Training the NTM with more iterations (partial training)Initializing the model with the weights obtained in experiment #10 (comment), I decided to keep the same setup and let the training procedure run longer than just 500k iterations. I left the training procedure run for 2M+ iterations to see if the model is at the moment inherently unable to solve this task or if it is simply a matter of running time. The NTM was again trained on sequences of 2 to 6 items, each of length 1 to 3. Compared to the performances observed in the previous experiment, the performances for this experiment are far better (with an improvement in the cross-entropy loss of only 1 order in magnitude, from ~1e-2 down to ~1e-3). Here are some examples
the NTM does not seem to write any such compressed representations of the items for each delimiter, but instead overwrites the last vector of the item when the delimiter is presented. In that case it may rely heavily on the read weights -- that are activated, even when the items are presented -- and may not scale as well as the behavior from DeepMind. However, even though it appears that most of the test examples are almost perfectly predicted, it still happens that the NTM is not able to make correct predictions, even on test samples similar to the training examples (same range of items' number, same range of items' length).
Learning curveParameters of the experimentSame parameters as in #10 (comment), where I initialized the model with the results given in #10 (comment) -- more or less as if I left the training procedure running across the two experiments. The training stopped here at 2.1M iterations due to #11. |
Training the NTM with 'Copy task' initializationSimilarly to #10 (comment), I decided for this experiment to tune the initialization and used a set of parameters learned for the Copy task (#6) as initial weights. The intuition is that in the previous experiment, the NTM was overwriting the last element of every item with the delimiter, instead of creating a compressed representation of the whole item at another location in memory; the goal here was to suggest the NTM to write all its inputs sequentially in memory. I trained the model on sequences of 2 to 6 items, each of them being of length 1 to 3. It converged quickly compared to previous experiments, even though it sometimes struggled to stay in good minima. The error decreased by around 2 orders of magnitude between the previous experiment and this one. The performances are a lot better than the one observed previously and the behavior of the NTM matches the results from DeepMind. Here are some examples As it is reading the query, the NTM knows almost instantly where to look that in memory, reads both the query and the answer and returns the answer. A particular behavior here (which seems to be consistent across all the tests) is that when the model first sees the delimiter for the query, it reads the memory at the location 124 and then starts reading the query item in memory. I need to investigate on what is really happening at that location. The generalization performances are very good on both dimensions:
Performance metricsI now have metrics to evaluate how well the NTM generalizes on this task. I only checked the performance in terms of the number of items for now, with items of length 3. As we can see the model is able to make less than 1-bit per sequence errors on sequences up to length 30. The error then explodes because we are actually reaching the limit in the number of locations inside the memory ( We are here actually outperforming DeepMind's results here since they got 1-bps error for sequences of 15 items and ~7-bps for sequences of 20 items. Learning curveThe learning curve suggests that the learning rate is too high (currently at Parameters of the experimentI switched back for the most part to the setup I used in #8.
|
The Adam optimizer seems to be crucial for this task to make the model converge much faster. I do not have any success with the convergence of the NTM with RMSProp yet. A smaller learning rate for the Adam optimizer still led to a behavior similar to #10 (comment). |
Training the NTM with 'Copy task' initialization (2)I performed the same experiment as in #10 (comment), but this time I trained the whole model with Learning curve |
The above experiments use only 1 read head & 1 write head whereas the original paper is using 4 heads (of each?). |
Associative Recall task
I will gather all the progress on the Associative Recall task in this issue. I will likely update this issue regularly (hopefully), so you may want to unsubscribe from this issue if you don't want to get all the spam.
The text was updated successfully, but these errors were encountered: