This guide explains how to evaluate your model and submit results to the CASE Benchmark leaderboard.
-
Install case-benchmark:
pip install case-benchmark
-
Download benchmark data:
case-benchmark download --output-dir /path/to/benchmark
-
Verify download:
python -m case_benchmark.download --verify --output-dir /path/to/benchmark
For supported models (SpeechBrain, WeSpeaker, pyannote, NeMo, Resemblyzer):
# Install model dependencies
pip install case-benchmark[speechbrain]
# Run evaluation
case-benchmark evaluate \
--model speechbrain \
--benchmark-dir /path/to/benchmark \
--output-dir results/ \
--device cpuCreate a model wrapper implementing the EmbeddingModel interface:
from case_benchmark.models.base import EmbeddingModel
from case_benchmark import CASEBenchmark
import numpy as np
from pathlib import Path
class MyModel(EmbeddingModel):
def load(self, device: str = "cpu") -> None:
# Load your model
self.model = load_my_model(device)
self._device = device
self._loaded = True
def extract_embedding(self, audio_path: Path) -> np.ndarray:
# Extract embedding from audio file
audio = load_audio(audio_path) # Your audio loading
embedding = self.model.encode(audio)
return embedding.numpy()
@property
def embedding_dim(self) -> int:
return 192 # Your embedding dimension
@property
def name(self) -> str:
return "My Custom Model"
# Run evaluation
benchmark = CASEBenchmark("/path/to/benchmark")
model = MyModel()
model.load("cuda")
results = benchmark.evaluate(model)
results.print_summary()
results.save("results/my_model.json")Your results JSON file should contain:
{
"model_name": "My Model",
"clean_eer": 0.0058,
"absolute_eer": 0.0301,
"degradation_factor": 0.0243,
"case_score_v1": 5.03,
"config": {
"benchmark_dir": "/path/to/benchmark",
"device": "cuda"
},
"category_breakdown": {
"clean": 0.0058,
"codec": 0.0173,
"mic": 0.0059,
"noise": 0.0073,
"reverb": 0.0588,
"playback": 0.0857
},
"protocol_results": {
"clean_clean": {"eer": 0.0058, "min_dcf": 0.018, "num_trials": 10000},
"clean_codec_gsm": {"eer": 0.0210, "min_dcf": 0.198, "num_trials": 10000},
...
}
}Key metrics for leaderboard ranking:
clean_eer: Baseline performance (lower is better)degradation_factor: Robustness to carrier effects (lower is better)
See Metrics for full explanation of each metric.
To submit to the leaderboard, you need:
- JSON file with all protocol results
- Must include all 24 protocols
- Generated by
case-benchmark evaluateor compatible code
A markdown file describing your model:
# Model Name
## Architecture
- Type: ECAPA-TDNN / ResNet / Transformer / etc.
- Parameters: X million
- Embedding dimension: 192
## Training
- Data: VoxCeleb2 (no overlap with VoxCeleb1-O test set)
- Augmentations: [list augmentations used]
- Loss: AAM-Softmax / Contrastive / etc.
- Training time: X GPU-hours
## Preprocessing
- Sample rate: 16kHz
- Features: 80-dim mel spectrogram
- Duration: variable / fixed X seconds
## Reproducibility
- Code: [link to code if available]
- Checkpoint: [link to weights if available]We require that:
- Your training data does NOT include VoxCeleb1-O test speakers
- Results are reproducible (we may re-run evaluation)
- Model card accurately describes the system
- Fork the
gittb/case-benchmarkrepository - Add your results to
results/<model_name>/:results.json- evaluation resultsmodel_card.md- model description
- Open a pull request
- Open an issue in the repository
- Attach your results JSON and model card
- Include contact information for verification
- No VoxCeleb1-O training: Models must not be trained on VoxCeleb1-O test set speakers
- Reproducibility: Results must be reproducible
- Single model: No ensembles (unless clearly labeled)
- No test-time augmentation: Standard inference only
- 16kHz input: All models must accept 16kHz audio
Yes, as long as it doesn't include VoxCeleb1-O test speakers.
Yes, and we encourage it! The CASE Benchmark specifically measures robustness to carrier conditions.
Resample to 16kHz before evaluation. The benchmark audio is all 16kHz.
Yes, each model should be submitted separately with its own model card.
We aim to update within 1 week of submission verification.