📏 Metrics Overview

This document provides a summary and detailed explanation of all evaluation metrics used in the framework.
For more detailed documentation regarding which metrics can be used for which tasks and task categories, refer to Task Config Overview.

NOTE For consistency across metrics, the final reported score of each supported metric is standardized within the range of [0.0, 100.0] with 2-decimal precision.

📊 Metric Overview Table

Metric Name	Description	Reported metric values
`word_error_rate_metrics` (↓)	Measures ASR errors via insertions, deletions	average_sample_wer overall_wer
`diarization_metrics` (↓)	LLM-Adaptive diarization-relevent metrics	avg_sample_wder overall_wder avg_sample_cpwer overall_cpwer avg_speaker_count_absolute_error
`llm_judge_binary` (↑)	Binary LLM-based correctness judgment	llm_judge_binary
`llm_judge_detailed` (↑)	Detailed scoring across multiple dimensions	llm_judge_detailed
`llm_judge_big_bench_audio` (↑)	LLM-based evaluations for BigBench-like tasks	llm_judge_big_bench_audio
`llm_judge_redteaming` (↑)	LLM-based evaluations for red-teaming/ safety	llm_judge_redteaming
`mt_bench_llm_judge` (↑)	LLM-based evaluation for Multi-turn systems (i.e. MT-Bench)	mt_bench_llm_judge
`bleu` (↑)	N-gram overlap score	bleu
`bertscore` (↑)	Semantic similarity using BERT embeddings	bertscore
`comet` (↑)	Semantic similarity measure for translation tasks	comet
`meteor` (↑)	Alignment-based score with synonym handling	meteor
`bfcl_match_score` (↑)	Structured logic form comparison	bfcl_match_score
`sql_score` (↑)	SQL correctness and execution match	text2sql_score
`instruction_following` (↑)	LLM-judged instruction following capability	final
`multiple_choice_accuracy` (↑)	Accuracy of prediction the correct option letter in multiple choice tasks	multiple_choice_accuracy
`gsm8k_exact_match` (↑)	Exact-match accuracy of the final numerical answer.	gsm8k_exact_match
`joint_goal_accuracy` (↑)	Dialogue state tracking - all slots match	joint_goal_accuracy
`slot_accuracy` (↑)	Dialogue state tracking - per-slot accuracy	slot_accuracy
`slot_f1` (↑)	Dialogue state tracking - slot extraction F1	slot_f1
`multiple_choice_accuracy` (↑)	Accuracy of prediction the correct option letter in multiple choice tasks	multiple_choice_accuracy

📋 Metric Details

`word_error_rate_metrics`

Type: Speech recognition metric
Description: Measure the correctness of the generated hypothesis vs reference transcript
Reported Value:
- average_sample_wer: Averaging WER of each sample across the evaluated dataset samples
- overall_wer: (Total deletions, insertions and substitutions) / Total words
Scoring (record-level): Scoring between 0.0 and 1.0 (lower is better)
Used In: asr, long_form_asr, code_switching_asr

`diarization_error_rate_metrics`

Type: LLM-Adaptive Diariziation-relevant metric
Description: Measure the diarization-relevant metrics for the generated LLM-generated hypothesis. Metrics are mostly computed for who-spoke-what to avoid the requirements of exact timestamp predictions.
Reported Value:
- average_sample_wder: Averaging WDER of each sample across the evaluated dataset samples
- overall_wder: Overall Errors / Overall Total Counts
- avg_sample_cpwer: Averaging cpWER of each sample across the evaluated dataset samples
- overall_cpwer: Overall Errors / Overall Total Counts
- avg_speaker_count_absolute_error: Mean absolute errors (MAE) of the predicted number of speakers
Scoring (record-level) Scoring between 0.0 and 1.0 (lower is better)
Used In: speaker_diarization

`llm_judge_binary`

Type: Binary classification metric
Description: Judges whether a model output is correct or not using an LLM.
Scoring (record-level) 1 for correct, 0 for incorrect. Higher is better.
Used In: emotion_recognition, accent_recognition, gender_recognition, intent_classification,spoofing

`llm_judge_detailed`

Type: Multi-dimensional judgment metric
Description: Uses an LLM to assess output quality based on attributes like fluency, relevance, and completenes (with or without ground truth reference)
Scoring (record-level) Scoring between 0 and 5 for each sample. Higher is better.
Used In: spoken_dialogue_summrization, scene_understanding

`llm_judge_big_bench_audio`

Type: LLM-based QA judgment metric
Description: Evaluates performance on BigBench-like audio QA tasks.
Scoring (record-level) Scoring correct or incorrect based on different aspects of QA tasks. Higher is better
Used In: sqqa

`llm_judge_redteaming`

Type: LLM-based judgement metric for red-teaming/ safety.
Description: Evaluates performance on the safety-related aspects for LALMs.
Scoring (record-level) Scoring 1 for refusing to answer the given audio (correct), 0 for anwering the given audio (incorrect). Higher is better.
Used In: safety

`mt_bench_llm_judge`

Type: LLM-based judgement metric for Multi-turn systems.
Description: Evaluates performance at multiple-turn conversation systems
Scoring (record-level) Scoring between 0 and 10 for each sample. Higher is better.
Used In: mtbench

`bleu`

Type: N-gram precision metric
Description: Measures how many n-grams in the prediction match the reference.
Scoring (record-level) Score between 0 and 100, higher is better.
Used In: translation

`bertscore`

Type: Semantic similarity
Description: Uses contextual BERT embeddings to match tokens semantically.
Scoring (record-level) Outputs F1 (between 0 and 1), higher is better.
Used In: translation

`comet`

Type: Semantic similarity for translation tasks
Description: Uses contextual embeddings to compute semantic similarity between source and target language pair
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: translation

`meteor`

Type: Alignment metric
Description: Improves on BLEU by considering synonyms, stemming, and paraphrase.
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: translation

`bfcl_match_score`

Type: Function calling metric
Description: Evaluate the function calling capabilities.
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: Speech Function calling (bfcl)

`sql_score`

Type: Coding correctness metric
Description: Evaluate the correctness of the generated SQL.
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: Speech-to-SQL-coding (speech_to_sql)

`instruction_following`

Type: Instruction following evaluation metric
Description: Measure the instruction following capabilities of LALMs by averaging accuracy across (1) strict-prompt, (2) strict-instruction, (3)loose-prompt and (4) loose-instruction.
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: Audio Instruction Following (ifeval)

`multiple_choice_accuracy`

Type: Multiple choice accuracy metric
Description: Measure the accuracy of prediction the correct option letter in multiple choice tasks. The correct option is expected in the format Answer: A
Scoring (record-level) Score between 0 and 100, higher is better.
Used In: Audio GPQA Diamond (gpqa_diamond)

`gsm8k_exact_match`

Type: Math correctness metric
Description: Measure the exact-match accuracy of the final numerical answer (expected within \boxed{}) with the reference numerical answer.
Scoring (record-level) Score between 0 and 100, higher is better.
Used In: Math (gsm8k)

`joint_goal_accuracy`

Type: Dialogue state tracking metric
Description: Evaluates whether all predicted slots exactly match the ground truth dialogue state. A sample scores 1 only if every slot-value pair is correct.
Scoring (record-level) Score 0 or 1, higher is better.
Used In: Task-Oriented Dialogue (spoken_dialogue)

`slot_accuracy`

Type: Dialogue state tracking metric
Description: Computes the proportion of individual slots correctly predicted across all samples.
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: Task-Oriented Dialogue (spoken_dialogue)

`slot_f1`

Type: Dialogue state tracking metric
Description: Computes F1 score for slot value extraction, balancing precision and recall of predicted slot-value pairs.
Scoring (record-level) Score between 0 and 1, higher is better.
Used In: Task-Oriented Dialogue (spoken_dialogue)

`multiple_choice_accuracy`

Type: Multiple choice accuracy metric
Description: Measure the accuracy of prediction the correct option letter in multiple choice tasks. The correct option is expected in the format Answer: A
Scoring (record-level) Score between 0 and 100, higher is better.
Used In: Audio Instruction Following (ifeval)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📏 Metrics Overview

📊 Metric Overview Table

📋 Metric Details

`word_error_rate_metrics`

`diarization_error_rate_metrics`

`llm_judge_binary`

`llm_judge_detailed`

`llm_judge_big_bench_audio`

`llm_judge_redteaming`

`mt_bench_llm_judge`

`bleu`

`bertscore`

`comet`

`meteor`

`bfcl_match_score`

`sql_score`

`instruction_following`

`multiple_choice_accuracy`

`gsm8k_exact_match`

`joint_goal_accuracy`

`slot_accuracy`

`slot_f1`

`multiple_choice_accuracy`

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📏 Metrics Overview

📊 Metric Overview Table

📋 Metric Details

word_error_rate_metrics

diarization_error_rate_metrics

llm_judge_binary

llm_judge_detailed

llm_judge_big_bench_audio

llm_judge_redteaming

mt_bench_llm_judge

bleu

bertscore

comet

meteor

bfcl_match_score

sql_score

instruction_following

multiple_choice_accuracy

gsm8k_exact_match

joint_goal_accuracy

slot_accuracy

slot_f1

multiple_choice_accuracy

`word_error_rate_metrics`

`diarization_error_rate_metrics`

`llm_judge_binary`

`llm_judge_detailed`

`llm_judge_big_bench_audio`

`llm_judge_redteaming`

`mt_bench_llm_judge`

`bleu`

`bertscore`

`comet`

`meteor`

`bfcl_match_score`

`sql_score`

`instruction_following`

`multiple_choice_accuracy`

`gsm8k_exact_match`

`joint_goal_accuracy`

`slot_accuracy`

`slot_f1`

`multiple_choice_accuracy`