Skip to content

429er/sentence-tokenizer-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentence Tokenizer Benchmarks

Benchmarks comparing speed and accuracy of Python sentence tokenization libraries.

Libraries Tested

Library Type Notes
BlingFire Rule-based (C++) Microsoft, fast
NLTK Rule-based (Python) Punkt tokenizer
pySBD Rule-based (Python) 22 languages
syntok Rule-based (Python) Also does word tokenization
wtpsplit ML-based SOTA accuracy, multilingual

Installation

# Create venv and install core tokenizers
uv venv && uv sync

# Include wtpsplit (PyTorch)
uv sync --extra wtpsplit

# Include wtpsplit with ONNX Runtime (faster)
uv sync --extra wtpsplit-ort-gpu  # GPU
uv sync --extra wtpsplit-ort-cpu  # CPU only

# All optional dependencies
uv sync --all-extras

Usage

# Run all benchmarks (speed + edge case accuracy)
python benchmark.py

# Speed only
python benchmark.py --speed

# Edge case accuracy only
python benchmark.py --accuracy

# Corpus evaluation (NLTK treebank + UD English)
python benchmark.py --corpus

# Single corpus evaluation
python benchmark.py --corpus-only treebank
python benchmark.py --corpus-only ud

# Customize speed benchmark
python benchmark.py --speed -n 5000 -t simple

# Generate charts
python benchmark.py --speed --plot                    # Speed chart
python benchmark.py --corpus --plot                   # Corpus F1 chart
python benchmark.py --speed --corpus --plot           # Both charts
python benchmark.py --plot --plot-dir ./results       # Custom output dir

Results

Speed (1000 complex texts)

Library Time Per text vs BlingFire
BlingFire 0.06s 0.06ms 1.0x
NLTK 0.32s 0.32ms 5.2x slower
syntok 0.39s 0.39ms 6.4x slower
pySBD 1.21s 1.21ms 20x slower
wtpsplit (ORT GPU) ~0.73s ~0.73ms ~12x slower
wtpsplit (PyTorch GPU) ~1.04s ~1.04ms ~17x slower
wtpsplit (CPU) ~27s ~27ms ~450x slower

Accuracy (Edge Cases)

Test: "Dr. Smith went to Washington D.C. on Jan. 5th. He met with Sen. Johnson at 3 p.m."

Library Result Correct?
BlingFire ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]
NLTK ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]
pySBD ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]
syntok ["Dr. Smith went to Washington D.C. on Jan.", "5th.", "He met with Sen. Johnson at 3 p.m."]
wtpsplit ["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]

Recommendations

  • Best overall: BlingFire (fastest + accurate)
  • Best fallback: NLTK (pure Python, no binary deps, good accuracy)
  • Best accuracy: wtpsplit (ML-based, but much slower)
  • Avoid: syntok (fails on common abbreviation patterns)

License

MIT

About

Benchmarks comparing speed and accuracy of Python sentence tokenization libraries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages