Skip to content

Latest commit

 

History

History
200 lines (152 loc) · 5.06 KB

File metadata and controls

200 lines (152 loc) · 5.06 KB

Submitting to the CASE Benchmark Leaderboard

This guide explains how to evaluate your model and submit results to the CASE Benchmark leaderboard.

Prerequisites

  1. Install case-benchmark:

    pip install case-benchmark
  2. Download benchmark data:

    case-benchmark download --output-dir /path/to/benchmark
  3. Verify download:

    python -m case_benchmark.download --verify --output-dir /path/to/benchmark

Running Evaluation

Using Built-in Model Wrappers

For supported models (SpeechBrain, WeSpeaker, pyannote, NeMo, Resemblyzer):

# Install model dependencies
pip install case-benchmark[speechbrain]

# Run evaluation
case-benchmark evaluate \
    --model speechbrain \
    --benchmark-dir /path/to/benchmark \
    --output-dir results/ \
    --device cpu

Using Custom Models

Create a model wrapper implementing the EmbeddingModel interface:

from case_benchmark.models.base import EmbeddingModel
from case_benchmark import CASEBenchmark
import numpy as np
from pathlib import Path

class MyModel(EmbeddingModel):
    def load(self, device: str = "cpu") -> None:
        # Load your model
        self.model = load_my_model(device)
        self._device = device
        self._loaded = True

    def extract_embedding(self, audio_path: Path) -> np.ndarray:
        # Extract embedding from audio file
        audio = load_audio(audio_path)  # Your audio loading
        embedding = self.model.encode(audio)
        return embedding.numpy()

    @property
    def embedding_dim(self) -> int:
        return 192  # Your embedding dimension

    @property
    def name(self) -> str:
        return "My Custom Model"

# Run evaluation
benchmark = CASEBenchmark("/path/to/benchmark")
model = MyModel()
model.load("cuda")

results = benchmark.evaluate(model)
results.print_summary()
results.save("results/my_model.json")

Result Format

Your results JSON file should contain:

{
  "model_name": "My Model",
  "clean_eer": 0.0058,
  "absolute_eer": 0.0301,
  "degradation_factor": 0.0243,
  "case_score_v1": 5.03,
  "config": {
    "benchmark_dir": "/path/to/benchmark",
    "device": "cuda"
  },
  "category_breakdown": {
    "clean": 0.0058,
    "codec": 0.0173,
    "mic": 0.0059,
    "noise": 0.0073,
    "reverb": 0.0588,
    "playback": 0.0857
  },
  "protocol_results": {
    "clean_clean": {"eer": 0.0058, "min_dcf": 0.018, "num_trials": 10000},
    "clean_codec_gsm": {"eer": 0.0210, "min_dcf": 0.198, "num_trials": 10000},
    ...
  }
}

Key metrics for leaderboard ranking:

  • clean_eer: Baseline performance (lower is better)
  • degradation_factor: Robustness to carrier effects (lower is better)

See Metrics for full explanation of each metric.

Submission Requirements

To submit to the leaderboard, you need:

1. Results File

  • JSON file with all protocol results
  • Must include all 24 protocols
  • Generated by case-benchmark evaluate or compatible code

2. Model Card

A markdown file describing your model:

# Model Name

## Architecture
- Type: ECAPA-TDNN / ResNet / Transformer / etc.
- Parameters: X million
- Embedding dimension: 192

## Training
- Data: VoxCeleb2 (no overlap with VoxCeleb1-O test set)
- Augmentations: [list augmentations used]
- Loss: AAM-Softmax / Contrastive / etc.
- Training time: X GPU-hours

## Preprocessing
- Sample rate: 16kHz
- Features: 80-dim mel spectrogram
- Duration: variable / fixed X seconds

## Reproducibility
- Code: [link to code if available]
- Checkpoint: [link to weights if available]

3. Verification

We require that:

  • Your training data does NOT include VoxCeleb1-O test speakers
  • Results are reproducible (we may re-run evaluation)
  • Model card accurately describes the system

How to Submit

Option 1: GitHub Pull Request

  1. Fork the gittb/case-benchmark repository
  2. Add your results to results/<model_name>/:
    • results.json - evaluation results
    • model_card.md - model description
  3. Open a pull request

Option 2: GitHub Issue

  1. Open an issue in the repository
  2. Attach your results JSON and model card
  3. Include contact information for verification

Leaderboard Rules

  1. No VoxCeleb1-O training: Models must not be trained on VoxCeleb1-O test set speakers
  2. Reproducibility: Results must be reproducible
  3. Single model: No ensembles (unless clearly labeled)
  4. No test-time augmentation: Standard inference only
  5. 16kHz input: All models must accept 16kHz audio

FAQ

Can I use external data for training?

Yes, as long as it doesn't include VoxCeleb1-O test speakers.

Can I use data augmentation during training?

Yes, and we encourage it! The CASE Benchmark specifically measures robustness to carrier conditions.

My model uses a different sample rate. What should I do?

Resample to 16kHz before evaluation. The benchmark audio is all 16kHz.

Can I submit multiple models?

Yes, each model should be submitted separately with its own model card.

How often is the leaderboard updated?

We aim to update within 1 week of submission verification.