Strange model behavior when taking the softmax in the wrong dimension #42

Cloud299 · 2024-02-14T17:01:59Z

Line 85 in 5220142

wei = F.softmax(wei, dim=-1) # (B, T, T)

I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?

My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.

@karpathy Any idea why this is the case?

jaskirat-1998 · 2024-06-21T07:49:21Z

You have created somewhat of a RNN

I can't explain the reason for low loss fully, I tried it and got low loss as well, but what essentially is happening is that you are forcing your attention to prioritise the last character. Because after applying mask, if we apply softmax correctly, for the case where the sequence length is 1, i.e we only have one character, then that character gets the weight 1.

In your case of applying softmax incorrectly, when we have sequence of size context_length (or block size). you give score 1 to the last character in the sequence, while the rest of the scores for previous characters will be low. Essentially you are forcing the model to prioritise the last characters like a uni-directional RNN (last character, then second last character, ........, first character). This actually is not that bad of an idea for generalisation, as in most scenarios the last few characters will be crucial (I think, lol)

This does not happen in the case when we apply softmax correctly and model actually has to learn through data that perhaps it should pay more attention to last few characters.

correct softmax:
[0.33, 0.00, 0.00] [1.00, 0.00, 0.00]
[0.33, 0.33, 0.00] [0.50, 0.50, 0.00]
[0.33, 0.33, 0.33] [0.33, 0.33, 0.33]

incorrect softmax:
[0.33, 0.00, 0.00] [0.33, 0.0, 0.0]
[0.33, 0.33, 0.00] [0.33, 0.50, 0.00]
[0.33, 0.33, 0.33] [0.33, 0.50, 1.00]

This is one example attention result for one of the heads after training with wrong direction softmax:

This is an example attention result for one of the heads after training with the correct direction of softmax:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange model behavior when taking the softmax in the wrong dimension #42

Strange model behavior when taking the softmax in the wrong dimension #42

Cloud299 commented Feb 14, 2024 •

edited

Loading

jaskirat-1998 commented Jun 21, 2024 •

edited

Loading

Strange model behavior when taking the softmax in the wrong dimension #42

Strange model behavior when taking the softmax in the wrong dimension #42

Comments

Cloud299 commented Feb 14, 2024 • edited Loading

jaskirat-1998 commented Jun 21, 2024 • edited Loading

You have created somewhat of a RNN

Cloud299 commented Feb 14, 2024 •

edited

Loading

jaskirat-1998 commented Jun 21, 2024 •

edited

Loading