Benchmarks comparing speed and accuracy of Python sentence tokenization libraries.
| Library | Type | Notes |
|---|---|---|
| BlingFire | Rule-based (C++) | Microsoft, fast |
| NLTK | Rule-based (Python) | Punkt tokenizer |
| pySBD | Rule-based (Python) | 22 languages |
| syntok | Rule-based (Python) | Also does word tokenization |
| wtpsplit | ML-based | SOTA accuracy, multilingual |
# Create venv and install core tokenizers
uv venv && uv sync
# Include wtpsplit (PyTorch)
uv sync --extra wtpsplit
# Include wtpsplit with ONNX Runtime (faster)
uv sync --extra wtpsplit-ort-gpu # GPU
uv sync --extra wtpsplit-ort-cpu # CPU only
# All optional dependencies
uv sync --all-extras# Run all benchmarks (speed + edge case accuracy)
python benchmark.py
# Speed only
python benchmark.py --speed
# Edge case accuracy only
python benchmark.py --accuracy
# Corpus evaluation (NLTK treebank + UD English)
python benchmark.py --corpus
# Single corpus evaluation
python benchmark.py --corpus-only treebank
python benchmark.py --corpus-only ud
# Customize speed benchmark
python benchmark.py --speed -n 5000 -t simple
# Generate charts
python benchmark.py --speed --plot # Speed chart
python benchmark.py --corpus --plot # Corpus F1 chart
python benchmark.py --speed --corpus --plot # Both charts
python benchmark.py --plot --plot-dir ./results # Custom output dir| Library | Time | Per text | vs BlingFire |
|---|---|---|---|
| BlingFire | 0.06s | 0.06ms | 1.0x |
| NLTK | 0.32s | 0.32ms | 5.2x slower |
| syntok | 0.39s | 0.39ms | 6.4x slower |
| pySBD | 1.21s | 1.21ms | 20x slower |
| wtpsplit (ORT GPU) | ~0.73s | ~0.73ms | ~12x slower |
| wtpsplit (PyTorch GPU) | ~1.04s | ~1.04ms | ~17x slower |
| wtpsplit (CPU) | ~27s | ~27ms | ~450x slower |
Test: "Dr. Smith went to Washington D.C. on Jan. 5th. He met with Sen. Johnson at 3 p.m."
| Library | Result | Correct? |
|---|---|---|
| BlingFire | ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."] |
✅ |
| NLTK | ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."] |
✅ |
| pySBD | ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."] |
✅ |
| syntok | ["Dr. Smith went to Washington D.C. on Jan.", "5th.", "He met with Sen. Johnson at 3 p.m."] |
❌ |
| wtpsplit | ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."] |
✅ |
- Best overall: BlingFire (fastest + accurate)
- Best fallback: NLTK (pure Python, no binary deps, good accuracy)
- Best accuracy: wtpsplit (ML-based, but much slower)
- Avoid: syntok (fails on common abbreviation patterns)
MIT