Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange model behavior when taking the softmax in the wrong dimension #42

Open
Cloud299 opened this issue Feb 14, 2024 · 1 comment
Open

Comments

@Cloud299
Copy link

Cloud299 commented Feb 14, 2024

wei = F.softmax(wei, dim=-1) # (B, T, T)

I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?

My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.

image

@karpathy Any idea why this is the case?

@jaskirat-1998
Copy link

jaskirat-1998 commented Jun 21, 2024

You have created somewhat of a RNN

I can't explain the reason for low loss fully, I tried it and got low loss as well, but what essentially is happening is that you are forcing your attention to prioritise the last character. Because after applying mask, if we apply softmax correctly, for the case where the sequence length is 1, i.e we only have one character, then that character gets the weight 1.

In your case of applying softmax incorrectly, when we have sequence of size context_length (or block size). you give score 1 to the last character in the sequence, while the rest of the scores for previous characters will be low. Essentially you are forcing the model to prioritise the last characters like a uni-directional RNN (last character, then second last character, ........, first character). This actually is not that bad of an idea for generalisation, as in most scenarios the last few characters will be crucial (I think, lol)

This does not happen in the case when we apply softmax correctly and model actually has to learn through data that perhaps it should pay more attention to last few characters.

correct softmax:
[0.33, 0.00, 0.00]   [1.00, 0.00, 0.00]
[0.33, 0.33, 0.00]   [0.50, 0.50, 0.00]
[0.33, 0.33, 0.33]   [0.33, 0.33, 0.33]

incorrect softmax:
[0.33, 0.00, 0.00]   [0.33, 0.0, 0.0]
[0.33, 0.33, 0.00]   [0.33, 0.50, 0.00]
[0.33, 0.33, 0.33]   [0.33, 0.50, 1.00]

This is one example attention result for one of the heads after training with wrong direction softmax:
image

This is an example attention result for one of the heads after training with the correct direction of softmax:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants