This document provides a summary and detailed explanation of all evaluation metrics used in the framework.
For more detailed documentation regarding which metrics can be used for which tasks and task categories, refer to Task Config Overview.
NOTE For consistency across metrics, the final reported score of each supported metric is standardized within the range of [0.0, 100.0] with 2-decimal precision.
| Metric Name | Description | Reported metric values |
|---|---|---|
word_error_rate_metrics (↓) |
Measures ASR errors via insertions, deletions | average_sample_wer overall_wer |
diarization_metrics (↓) |
LLM-Adaptive diarization-relevent metrics | avg_sample_wder overall_wder avg_sample_cpwer overall_cpwer avg_speaker_count_absolute_error |
llm_judge_binary (↑) |
Binary LLM-based correctness judgment | llm_judge_binary |
llm_judge_detailed (↑) |
Detailed scoring across multiple dimensions | llm_judge_detailed |
llm_judge_big_bench_audio (↑) |
LLM-based evaluations for BigBench-like tasks | llm_judge_big_bench_audio |
llm_judge_redteaming (↑) |
LLM-based evaluations for red-teaming/ safety | llm_judge_redteaming |
mt_bench_llm_judge (↑) |
LLM-based evaluation for Multi-turn systems (i.e. MT-Bench) | mt_bench_llm_judge |
bleu (↑) |
N-gram overlap score | bleu |
bertscore (↑) |
Semantic similarity using BERT embeddings | bertscore |
comet (↑) |
Semantic similarity measure for translation tasks | comet |
meteor (↑) |
Alignment-based score with synonym handling | meteor |
bfcl_match_score (↑) |
Structured logic form comparison | bfcl_match_score |
sql_score (↑) |
SQL correctness and execution match | text2sql_score |
instruction_following (↑) |
LLM-judged instruction following capability | final |
multiple_choice_accuracy (↑) |
Accuracy of prediction the correct option letter in multiple choice tasks | multiple_choice_accuracy |
gsm8k_exact_match (↑) |
Exact-match accuracy of the final numerical answer. | gsm8k_exact_match |
joint_goal_accuracy (↑) |
Dialogue state tracking - all slots match | joint_goal_accuracy |
slot_accuracy (↑) |
Dialogue state tracking - per-slot accuracy | slot_accuracy |
slot_f1 (↑) |
Dialogue state tracking - slot extraction F1 | slot_f1 |
multiple_choice_accuracy (↑) |
Accuracy of prediction the correct option letter in multiple choice tasks | multiple_choice_accuracy |
- Type: Speech recognition metric
- Description: Measure the correctness of the generated hypothesis vs reference transcript
- Reported Value:
average_sample_wer: Averaging WER of each sample across the evaluated dataset samplesoverall_wer: (Total deletions, insertions and substitutions) / Total words
- Scoring (record-level): Scoring between 0.0 and 1.0 (lower is better)
- Used In:
asr,long_form_asr,code_switching_asr
- Type: LLM-Adaptive Diariziation-relevant metric
- Description: Measure the diarization-relevant metrics for the generated LLM-generated hypothesis. Metrics are mostly computed for who-spoke-what to avoid the requirements of exact timestamp predictions.
- Reported Value:
average_sample_wder: Averaging WDER of each sample across the evaluated dataset samplesoverall_wder: Overall Errors / Overall Total Countsavg_sample_cpwer: Averaging cpWER of each sample across the evaluated dataset samplesoverall_cpwer: Overall Errors / Overall Total Countsavg_speaker_count_absolute_error: Mean absolute errors (MAE) of the predicted number of speakers
- Scoring (record-level) Scoring between 0.0 and 1.0 (lower is better)
- Used In:
speaker_diarization
- Type: Binary classification metric
- Description: Judges whether a model output is correct or not using an LLM.
- Scoring (record-level)
1for correct,0for incorrect. Higher is better. - Used In:
emotion_recognition,accent_recognition,gender_recognition,intent_classification,spoofing
- Type: Multi-dimensional judgment metric
- Description: Uses an LLM to assess output quality based on attributes like fluency, relevance, and completenes (with or without ground truth reference)
- Scoring (record-level) Scoring between
0and5for each sample. Higher is better. - Used In:
spoken_dialogue_summrization,scene_understanding
- Type: LLM-based QA judgment metric
- Description: Evaluates performance on BigBench-like audio QA tasks.
- Scoring (record-level) Scoring
correctorincorrectbased on different aspects of QA tasks. Higher is better - Used In:
sqqa
- Type: LLM-based judgement metric for red-teaming/ safety.
- Description: Evaluates performance on the safety-related aspects for LALMs.
- Scoring (record-level) Scoring
1for refusing to answer the given audio (correct),0for anwering the given audio (incorrect). Higher is better. - Used In:
safety
- Type: LLM-based judgement metric for Multi-turn systems.
- Description: Evaluates performance at multiple-turn conversation systems
- Scoring (record-level) Scoring between
0and10for each sample. Higher is better. - Used In:
mtbench
- Type: N-gram precision metric
- Description: Measures how many n-grams in the prediction match the reference.
- Scoring (record-level) Score between
0and100, higher is better. - Used In:
translation
- Type: Semantic similarity
- Description: Uses contextual BERT embeddings to match tokens semantically.
- Scoring (record-level) Outputs F1 (between
0and1), higher is better. - Used In:
translation
- Type: Semantic similarity for translation tasks
- Description: Uses contextual embeddings to compute semantic similarity between source and target language pair
- Scoring (record-level) Score between
0and1, higher is better. - Used In:
translation
- Type: Alignment metric
- Description: Improves on BLEU by considering synonyms, stemming, and paraphrase.
- Scoring (record-level) Score between
0and1, higher is better. - Used In:
translation
- Type: Function calling metric
- Description: Evaluate the function calling capabilities.
- Scoring (record-level) Score between
0and1, higher is better. - Used In: Speech Function calling (
bfcl)
- Type: Coding correctness metric
- Description: Evaluate the correctness of the generated SQL.
- Scoring (record-level) Score between
0and1, higher is better. - Used In: Speech-to-SQL-coding (
speech_to_sql)
- Type: Instruction following evaluation metric
- Description: Measure the instruction following capabilities of LALMs by averaging accuracy across (1) strict-prompt, (2) strict-instruction, (3)loose-prompt and (4) loose-instruction.
- Scoring (record-level) Score between
0and1, higher is better. - Used In: Audio Instruction Following (
ifeval)
- Type: Multiple choice accuracy metric
- Description: Measure the accuracy of prediction the correct option letter in multiple choice tasks. The correct option is expected in the format
Answer: A - Scoring (record-level) Score between
0and100, higher is better. - Used In: Audio GPQA Diamond (
gpqa_diamond)
- Type: Math correctness metric
- Description: Measure the exact-match accuracy of the final numerical answer (expected within
\boxed{}) with the reference numerical answer. - Scoring (record-level) Score between
0and100, higher is better. - Used In: Math (
gsm8k)
- Type: Dialogue state tracking metric
- Description: Evaluates whether all predicted slots exactly match the ground truth dialogue state. A sample scores 1 only if every slot-value pair is correct.
- Scoring (record-level) Score
0or1, higher is better. - Used In: Task-Oriented Dialogue (
spoken_dialogue)
- Type: Dialogue state tracking metric
- Description: Computes the proportion of individual slots correctly predicted across all samples.
- Scoring (record-level) Score between
0and1, higher is better. - Used In: Task-Oriented Dialogue (
spoken_dialogue)
- Type: Dialogue state tracking metric
- Description: Computes F1 score for slot value extraction, balancing precision and recall of predicted slot-value pairs.
- Scoring (record-level) Score between
0and1, higher is better. - Used In: Task-Oriented Dialogue (
spoken_dialogue)
- Type: Multiple choice accuracy metric
- Description: Measure the accuracy of prediction the correct option letter in multiple choice tasks. The correct option is expected in the format
Answer: A - Scoring (record-level) Score between
0and100, higher is better. - Used In: Audio Instruction Following (
ifeval)