Systematic evaluation of tonal fidelity in facebook/omniASR-CTC-1B when processing Igbo, a tonal Niger-Congo language with ~45 million speakers.
This project reveals systematic tonal diacritic loss in a state-of-the-art multilingual ASR model:
- 75.5% diacritic loss on tonal markers (bootstrap 95% CI: [57.1%, 89.7%])
- Minimal pair collapse: Model cannot distinguish phonemically contrastive tones
- Orthographic bias: Model hallucinates tone marks on monotone speech
Key Insight: The model appears to generate diacritics probabilistically based on lexical priors rather than acoustic conditioning.
21 audio samples across 4 error categories:
- Cross-lingual Orthographic Interference (5 samples)
- Phonemic Tone Sensitivity (6 samples)
- Language Boundary Effects (5 samples)
- Domain-Specific Lexical Coverage (5 samples)
Audio files are included in this repository (M4A format). Click to play directly on GitHub:
Tonal Minimal Pairs:
- 06_tonal_akwa.m4a - 4 different words collapsed to random outputs
Monotone Hallucination:
- 09_tonal_flat.m4a - Flat speech, model ADDED tones that weren't spoken
Code-Switching:
- 11_codeswitch_en2ig.m4a - English perfect, Igbo loses tones
git clone https://github.com/chizkidd/igbo-asr-tonal-evaluation.git
cd igbo-asr-tonal-evaluation
pip install -r requirements.txtjupyter notebook analysis.ipynbigbo-asr-tonal-evaluation/
├── data/
│ ├── audio/
│ │ ├── 01_script_names.m4a # Cross-lingual interference samples
│ │ ├── 02_script_formal.m4a
│ │ ├── 03_script_numbers.m4a
│ │ ├── 04_script_proverb.m4a
│ │ ├── 05_script_slow.m4a
│ │ ├── 06_tonal_akwa.m4a # Tonal minimal pairs
│ │ ├── 07_tonal_oke.m4a
│ │ ├── 08_tonal_dense.m4a
│ │ ├── 09_tonal_flat.m4a # Monotone control (key diagnostic)
│ │ ├── 10_tonal_yoruba.m4a
│ │ ├── 11_codeswitch_en2ig.m4a # Code-switching samples
│ │ ├── 12_codeswitch_ig2en.m4a
│ │ ├── 13_codeswitch_alternate.m4a
│ │ ├── 14_codeswitch_embedded.m4a
│ │ ├── 15_codeswitch_pidgin.m4a
│ │ ├── 16_context_places.m4a # Domain-specific samples
│ │ ├── 17_context_food.m4a
│ │ ├── 18_context_proverb.m4a
│ │ ├── 19_context_french.m4a
│ │ ├── 20_context_noise.m4a
│ │ ├── 21_tonal_yoruba_formal.m4a
│ │ ├── igbo_clean.m4a # Test samples
│ │ ├── igbo_codeswitch.m4a
│ │ └── igbo_tonal.m4a
│ └── metadata.csv # Ground truth, model outputs, metrics
├── docs/
│ └── METHODOLOGY.md # Detailed research methodology
├── results/
│ └── visualizations/
│ ├── fig1_loss_by_category.png
│ ├── fig2_cer_vs_diacritic_loss.png
│ └── fig3_bootstrap_ci.png
├── src/
│ ├── evaluate.py # Evaluation metrics (DER, bootstrap CIs)
│ ├── visualize.py # Plotting functions
│ └── utils.py # Helper functions
├── .gitignore
├── analysis.ipynb # Full analysis notebook
├── LICENSE
├── README.md # This file
└── requirements.txt # Python dependencies
| Category | Samples | Diacritic Loss | Avg CER |
|---|---|---|---|
| Phonemic Tone Sensitivity | 6 | 75.5% | 50.6% |
| Cross-lingual Interference | 5 | -38.9% (hallucination) | 28.8% |
| Domain-Specific Coverage | 5 | 6.3% | 30.1% |
| Language Boundary Effects | 5 | 14.3% | 20.0% |
| Overall | 21 | 26.8% | 32.5% |
- Tonal category: 75.5% (95% CI: [57.1%, 89.7%])
- Overall: 52.6% (95% CI: [30.3%, 69.7%])
Even the worst-case lower bound (57.1%) indicates severe tonal degradation.
Input: "akwa, akwa, akwa. Akwà, akwà, akwà. Àkwà, àkwà, àkwà. Ákwá, ákwá, ákwá."
(4 distinct Igbo words with different meanings)
Model Output: "akua akua akua akua akwa akwa akwa akua akwa ọkua ọkua ọkua"
(Random variations, semantic distinctions lost)
Impact:
- akwà (cloth) → akwa (could mean "crying")
- àkwà (egg) → akwa (meaning lost)
- ákwá (bridge) → akua (wrong word)
- Model: facebook/omniASR-CTC-1B
- Parameters: 975M
- Architecture: CTC-based ASR (wav2vec2-style)
- Languages: 1,600+ (including Igbo)
- Speaker: Native Igbo speaker (Afikpo dialect, Ebonyi State)
- Device: iPhone SE 2nd Generation
- Format: M4A (AAC codec, original iPhone Voice Memos format)
- Duration: 4-15 seconds per sample
- DER (Diacritic Error Rate): Captures dropped + hallucinated tone marks
- Bootstrap CIs: 10,000 iterations at utterance level
- CER (Character Error Rate): Standard transcription accuracy
See METHODOLOGY.md for detailed research design.
jupyter notebook analysis.ipynbfrom src.evaluate import compute_all_metrics, bootstrap_ci
from src.visualize import plot_loss_by_category
from src.utils import load_metadata
# Load data
df = load_metadata("data/metadata.csv")
# Compute metrics
df = compute_all_metrics(df)
# Generate visualizations
plot_loss_by_category(df, output_path="results/visualizations/fig1.png")To regenerate all results from scratch:
jupyter notebook analysis.ipynb # Run all cells
# Results will be saved to results/If you use the dataset, please cite:
@misc{obasi2026igbo,
title={Igbo Blind Spot Dataset for omniASR-CTC-1B: Systematic Evaluation of Tonal Diacritic Loss},
author={Obasi, Chizoba},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/datasets/chiz/omniASR-igbo-blindspots}},
note={Model evaluated: facebook/omniASR-CTC-1B (975M parameters)}
}- Dataset: HuggingFace Hub
- Model: omniASR-CTC-1B
- Paper: Meta AI - Omnilingual ASR (arXiv:2511.09690)
- Scale to multi-speaker evaluation (10+ speakers across dialects)
- Comparative model audit (Whisper, MMS, USM, Azure Speech)
- Fine-tuning intervention with tone-annotated data
- Downstream impact studies in voice assistants
- Code: MIT License
- Audio recordings: CC-BY-4.0 (attribution required)
- Metadata/annotations: CC0 (public domain)
See LICENSE for details.
Chizoba Obasi
HuggingFace | GitHub