Hello,
I have query regarding metrics definition and settings used in LLM benchmarks in MLCommons (https://mlcommons.org/benchmarks/inference-datacenter/).
For LLM-Q/A task in the benchmark table in above link:
- How TPOT and TTFT are calculated? Can you source code for it? For TPOT how many tokens are generated?
- For openocra dataset, input prompt provided is "system_prompt" + "question" or "question" only? Since openocra dataset has both colmuns
- For quality metrics in table the values for ROUGE-1 is its precision , recall or fmeasure?