Tokens per second calculation #2296

EugenHotaj · 2025-01-24T00:37:48Z

I've noticed that tokens_per_second fluctuates wildly depending on the dataset, padding, masking strategies, etc. I believe the reason for this is that we only consider unmasked tokens as part of the tps calculation (here).

This is somewhat misleading because the masked tokens are still processed during the forward pass (and possibly the backward pass, but not 100% certain). So we are expending FLOPs that are not being counted in the tps calculation.

This also leads to confusing situations where the exact same dataset with masking causes the tps to drop precipitously even though the same (or potentially even less) computation is happening under the hood. This makes the metric somewhat meaningless to understand how fast we are actually training.

I'm happy to send a PR to update if the team agrees with this take. Another option could be to add both numbers and e.g. have active_tps and total_tps.

The text was updated successfully, but these errors were encountered:

RdoubleA · 2025-01-24T00:42:43Z

@ebsmothers and I had this discussion before. It depends on what we'd want TPS to indicate. If TPS should indicate raw performance and speed of the model processing tokens regardless of whether it is masked in the loss or not, then yes we should also include masked tokens in the calculation. If TPS should indicate model convergence speed, then we should keep it as is, as the masked tokens are not contributing to the actual weight updates.

I don't have a strong opinion here, from my understanding the standard has been to keep masked tokens in the calculation, but I can see why you might want to see actual training speed. I'm personally onboard with showing both active and total TPS like you suggested. I'll let @ebsmothers comment his thoughts.

ebsmothers · 2025-01-24T16:55:29Z

Yeah I have ranted a bit about this before. I know HF and others tend to report total tokens (and we did that previously), but I think this is especially misleading when we look at the case of packed vs unpacked datasets.

To take a simple example: suppose that half the samples in our dataset have seq len 50 and the other half have seq len 100. If we are training on a batch size of 16 and padding to max seq length in the batch then pretty much 50% of our throughput will be padding (since most every batch will have at least one sample of seq len 100 and in aggregate half the samples will have seq len 50). Now suppose we pack to a sequence length of 1000. Then most of our packs will have either sequence length 950 or 1000 (if we pack optimally all will have sequence length 1000, but currently we do it greedily). So with the packed dataset <5% of our tokens will be padding. This means the model will train much much faster, since padding is essentially wasted computation. This is also reflected in E2E training time for a single epoch. But if we look at total tokens/sec we are effectively giving a 2x advantage to the unpacked dataset.

This example is a bit contrived, but it will occur in most any case that we pack a dataset where the average sample sequence length is much shorter than the packed sequence length. You can test this for yourself by running both types of tok/sec calculations on packed and unpacked versions of alpaca, which has pretty short sequences.

Anyways sorry I am quite opinionated on this one 😅. I definitely understand the value of considering all tokens for the purposes of raw throughput, but I claim we on torchtune should care more about making models train faster than we do about hitting peak FLOPs, hence why the current definition is the preferable one. Happy to discuss whether it makes sense to also expose total toks/sec though.

EugenHotaj · 2025-01-24T18:33:07Z

But if we look at total tokens/sec we are effectively giving a 2x advantage to the unpacked dataset.

Do you mean in terms of convergence, because I think the throughput should be roughly similar between the two? For my own learning, how do you use the current tps in practice, is it mostly to verify the difference between packed vs unpacked datasets?

Happy to discuss whether it makes sense to also expose total toks/sec though.

I do think it's important to expose a raw throughput metric:

Good sanity check that throughput is optimal in a new setting (instead of having to a/b test against known datasets)
Easier to have apples-to-apples throughput comparisons to other frameworks
As you mentioned, (for better or worse) it's the standard way tps is reported in other frameworks so the current metric is kind of confusing to new users

ebsmothers · 2025-01-24T20:46:57Z

Do you mean in terms of convergence, because I think the throughput should be roughly similar between the two? For my own learning, how do you use the current tps in practice, is it mostly to verify the difference between packed vs unpacked datasets?

@EugenHotaj I mean in terms of progress through the dataset. If the dataset has N tokens, the total number of tokens seen in the unpacked case will be about 1.05*N, while the total number of tokens seen in the packed case will be about 2*N. Assuming that both cases complete one epoch in the exact same amount of time (of course they don't), we would claim almost 2x throughput on the unpacked dataset just because of inefficient padding. To me this is kinda misleading.

Now that I'm off my soapbox, ultimately what's more important than how torchtune devs use tps is how people finetuning models use tps. So if you think total toks/sec is important to have I am fine to bring it back in. We do still need unpadded tokens for proper normalization of cross-entropy loss though. We could define a utility like the following:

def get_num_tokens(labels: torch.Tensor, pad_idx: int) -> Tuple[int, int]:
    total_num_tokens = labels.numel()
    num_unpadded_tokens = (labels != pad_idx).sum()
    return total_num_tokens, num_unpadded_tokens

Then add e.g. a config field tps_log_method $$\in$$ ["all", "unpadded"] (probably with clearer naming). Then in the recipe:

self.tps_log_method = cfg.get(tps_log_method, "unpadded") # for backwards compatibility

total_num_tokens, num_unpadded_tokens = training.get_num_tokens(batch["labels"], self._loss_fn.ignore_index)
tokens_to_log = num_unpadded_tokens if self.tps_method == "unpadded" else total_num_tokens
...

I think there is a small tax to this as we are adding a separate config for a relatively niche distinction. But probably not the end of the world.

EugenHotaj · 2025-01-24T20:57:47Z

If we don't want the extra tax in the config, do you think it makes sense to log both? E.g. we could introduce a new raw_tps (or total_tps?) and keep tps to mean the same as it does today. Or if we don't think renaming will cause too much confusion we could do something like active_tps and raw_tps.

Tangential: the motivation for padding makes sense, what I think is slightly confusing is that masking has a dramatic difference to tps. Although we don't compute the loss on those tokens, they definitely affect training, both by changing the conditional distribution of downstream tokens in the forward pass, as well as the gradients in the backward pass.

ebsmothers · 2025-01-25T00:35:45Z

@EugenHotaj good point, I think logging both is reasonable, do you want to open a PR for it? (Otherwise I can open one when I get some time.) I would lean towards keeping the existing name tokens_per_second_per_gpu (even if it is a bit long), otherwise either raw_* or total_* sounds good to me.

Although we don't compute the loss on those tokens, they definitely affect training, both by changing the conditional distribution of downstream tokens in the forward pass, as well as the gradients in the backward pass.Although we don't compute the loss on those tokens, they definitely affect training, both by changing the conditional distribution of downstream tokens in the forward pass, as well as the gradients in the backward pass.

I know it's kind of an aside to this issue, but can you elaborate on this point? I assume by "downstream" you mean subsequent tokens in the sequence, right? But in that case if we always pad on the right side, there are no downstream tokens, right? Similarly for the grads, I think zeroing out any loss from padded positions should propagate back up to imply that the grad of each param is also independent of the padding tokens. I put together this hacky script to test this, you can see that the losses are the same over the first few steps whether we add extra padding tokens or not:

Losses for vanilla dataloader: [2.2297537326812744, 1.8560733795166016, 2.437910318374634, 2.2531728744506836, 1.8822435140609741]
Losses for extra_padding dataloader: [2.2297534942626953, 1.8560755252838135, 2.4379658699035645, 2.253178119659424, 1.8821792602539062]

EugenHotaj · 2025-01-25T01:15:52Z

I know it's kind of an aside to this issue, but can you elaborate on this point? I assume by "downstream" you mean subsequent tokens in the sequence, right? But in that case if we always pad on the right side, there are no downstream tokens, right?

Oh I'm talking specifically about masking here, not padding tokens, agreed with you on the latter. E.g. suppose we have the following:

User: Hello. Assistant: Hi there! User: What day is it? Assistant: Jan 24.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx                      xxxxxxxxxxxxxxxxxxxxx

Hopefully my crude illustration makes sense but lines 2 and 3 are meant to depict two different masking strategies: masking all but the last assistant turn vs only user queries respectively. In this case, although we're processing exactly the same dataset, the tps with line 3 masking will be much higher than line 2. However, the extra masked tokens in line 2 still affect the fwd/bwd pass, even though we're not explicitly computing the loss on them. So in this case, these tokens are not wasted computation like the padding tokens, but we're not counting them in the tps computation. Does that make sense or am I just completely missing something?

@EugenHotaj good point, I think logging both is reasonable, do you want to open a PR for it? (Otherwise I can open one when I get some time.) I would lean towards keeping the existing name tokens_per_second_per_gpu (even if it is a bit long), otherwise either raw_* or total_* sounds good to me.

Sure, happy to send one!

ebsmothers · 2025-01-25T15:24:13Z

Ahh thanks for the example @EugenHotaj, totally understand your point now. (And can also see why counting all tokens may be more representative for this case.) Looking forward to the PR!

felipemello1 added best practice Things we should be doing but aren't triage review This issue should be discussed in weekly review labels Jan 24, 2025

felipemello1 assigned EugenHotaj Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokens per second calculation #2296

Tokens per second calculation #2296

EugenHotaj commented Jan 24, 2025 •

edited

Loading

RdoubleA commented Jan 24, 2025 •

edited

Loading

ebsmothers commented Jan 24, 2025 •

edited

Loading

EugenHotaj commented Jan 24, 2025

ebsmothers commented Jan 24, 2025 •

edited

Loading

EugenHotaj commented Jan 24, 2025

ebsmothers commented Jan 25, 2025

EugenHotaj commented Jan 25, 2025 •

edited

Loading

ebsmothers commented Jan 25, 2025

Tokens per second calculation #2296

Tokens per second calculation #2296

Comments

EugenHotaj commented Jan 24, 2025 • edited Loading

RdoubleA commented Jan 24, 2025 • edited Loading

ebsmothers commented Jan 24, 2025 • edited Loading

EugenHotaj commented Jan 24, 2025

ebsmothers commented Jan 24, 2025 • edited Loading

EugenHotaj commented Jan 24, 2025

ebsmothers commented Jan 25, 2025

EugenHotaj commented Jan 25, 2025 • edited Loading

ebsmothers commented Jan 25, 2025

EugenHotaj commented Jan 24, 2025 •

edited

Loading

RdoubleA commented Jan 24, 2025 •

edited

Loading

ebsmothers commented Jan 24, 2025 •

edited

Loading

ebsmothers commented Jan 24, 2025 •

edited

Loading

EugenHotaj commented Jan 25, 2025 •

edited

Loading