You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?
My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.
I can't explain the reason for low loss fully, I tried it and got low loss as well, but what essentially is happening is that you are forcing your attention to prioritise the last character. Because after applying mask, if we apply softmax correctly, for the case where the sequence length is 1, i.e we only have one character, then that character gets the weight 1.
In your case of applying softmax incorrectly, when we have sequence of size context_length (or block size). you give score 1 to the last character in the sequence, while the rest of the scores for previous characters will be low. Essentially you are forcing the model to prioritise the last characters like a uni-directional RNN (last character, then second last character, ........, first character). This actually is not that bad of an idea for generalisation, as in most scenarios the last few characters will be crucial (I think, lol)
This does not happen in the case when we apply softmax correctly and model actually has to learn through data that perhaps it should pay more attention to last few characters.
ng-video-lecture/gpt.py
Line 85 in 5220142
I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?
My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.
@karpathy Any idea why this is the case?
The text was updated successfully, but these errors were encountered: