-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokens per second calculation #2296
Comments
@ebsmothers and I had this discussion before. It depends on what we'd want TPS to indicate. If TPS should indicate raw performance and speed of the model processing tokens regardless of whether it is masked in the loss or not, then yes we should also include masked tokens in the calculation. If TPS should indicate model convergence speed, then we should keep it as is, as the masked tokens are not contributing to the actual weight updates. I don't have a strong opinion here, from my understanding the standard has been to keep masked tokens in the calculation, but I can see why you might want to see actual training speed. I'm personally onboard with showing both active and total TPS like you suggested. I'll let @ebsmothers comment his thoughts. |
Yeah I have ranted a bit about this before. I know HF and others tend to report total tokens (and we did that previously), but I think this is especially misleading when we look at the case of packed vs unpacked datasets. To take a simple example: suppose that half the samples in our dataset have seq len 50 and the other half have seq len 100. If we are training on a batch size of 16 and padding to max seq length in the batch then pretty much 50% of our throughput will be padding (since most every batch will have at least one sample of seq len 100 and in aggregate half the samples will have seq len 50). Now suppose we pack to a sequence length of 1000. Then most of our packs will have either sequence length 950 or 1000 (if we pack optimally all will have sequence length 1000, but currently we do it greedily). So with the packed dataset <5% of our tokens will be padding. This means the model will train much much faster, since padding is essentially wasted computation. This is also reflected in E2E training time for a single epoch. But if we look at total tokens/sec we are effectively giving a 2x advantage to the unpacked dataset. This example is a bit contrived, but it will occur in most any case that we pack a dataset where the average sample sequence length is much shorter than the packed sequence length. You can test this for yourself by running both types of tok/sec calculations on packed and unpacked versions of alpaca, which has pretty short sequences. Anyways sorry I am quite opinionated on this one 😅. I definitely understand the value of considering all tokens for the purposes of raw throughput, but I claim we on torchtune should care more about making models train faster than we do about hitting peak FLOPs, hence why the current definition is the preferable one. Happy to discuss whether it makes sense to also expose total toks/sec though. |
Do you mean in terms of convergence, because I think the throughput should be roughly similar between the two? For my own learning, how do you use the current tps in practice, is it mostly to verify the difference between packed vs unpacked datasets?
I do think it's important to expose a raw throughput metric:
|
@EugenHotaj I mean in terms of progress through the dataset. If the dataset has N tokens, the total number of tokens seen in the unpacked case will be about 1.05*N, while the total number of tokens seen in the packed case will be about 2*N. Assuming that both cases complete one epoch in the exact same amount of time (of course they don't), we would claim almost 2x throughput on the unpacked dataset just because of inefficient padding. To me this is kinda misleading. Now that I'm off my soapbox, ultimately what's more important than how torchtune devs use tps is how people finetuning models use tps. So if you think total toks/sec is important to have I am fine to bring it back in. We do still need unpadded tokens for proper normalization of cross-entropy loss though. We could define a utility like the following:
Then add e.g. a config field
I think there is a small tax to this as we are adding a separate config for a relatively niche distinction. But probably not the end of the world. |
If we don't want the extra tax in the config, do you think it makes sense to log both? E.g. we could introduce a new Tangential: the motivation for padding makes sense, what I think is slightly confusing is that masking has a dramatic difference to tps. Although we don't compute the loss on those tokens, they definitely affect training, both by changing the conditional distribution of downstream tokens in the forward pass, as well as the gradients in the backward pass. |
@EugenHotaj good point, I think logging both is reasonable, do you want to open a PR for it? (Otherwise I can open one when I get some time.) I would lean towards keeping the existing name
I know it's kind of an aside to this issue, but can you elaborate on this point? I assume by "downstream" you mean subsequent tokens in the sequence, right? But in that case if we always pad on the right side, there are no downstream tokens, right? Similarly for the grads, I think zeroing out any loss from padded positions should propagate back up to imply that the grad of each param is also independent of the padding tokens. I put together this hacky script to test this, you can see that the losses are the same over the first few steps whether we add extra padding tokens or not:
|
Oh I'm talking specifically about masking here, not padding tokens, agreed with you on the latter. E.g. suppose we have the following:
Hopefully my crude illustration makes sense but lines 2 and 3 are meant to depict two different masking strategies: masking all but the last assistant turn vs only user queries respectively. In this case, although we're processing exactly the same dataset, the
Sure, happy to send one! |
Ahh thanks for the example @EugenHotaj, totally understand your point now. (And can also see why counting all tokens may be more representative for this case.) Looking forward to the PR! |
I've noticed that
tokens_per_second
fluctuates wildly depending on the dataset, padding, masking strategies, etc. I believe the reason for this is that we only consider unmasked tokens as part of the tps calculation (here).This is somewhat misleading because the masked tokens are still processed during the forward pass (and possibly the backward pass, but not 100% certain). So we are expending FLOPs that are not being counted in the tps calculation.
This also leads to confusing situations where the exact same dataset with masking causes the tps to drop precipitously even though the same (or potentially even less) computation is happening under the hood. This makes the metric somewhat meaningless to understand how fast we are actually training.
I'm happy to send a PR to update if the team agrees with this take. Another option could be to add both numbers and e.g. have
active_tps
andtotal_tps
.The text was updated successfully, but these errors were encountered: