Sentence Tokenizer Benchmarks

Benchmarks comparing speed and accuracy of Python sentence tokenization libraries.

Libraries Tested

Library	Type	Notes
BlingFire	Rule-based (C++)	Microsoft, fast
NLTK	Rule-based (Python)	Punkt tokenizer
pySBD	Rule-based (Python)	22 languages
syntok	Rule-based (Python)	Also does word tokenization
wtpsplit	ML-based	SOTA accuracy, multilingual

Installation

# Create venv and install core tokenizers
uv venv && uv sync

# Include wtpsplit (PyTorch)
uv sync --extra wtpsplit

# Include wtpsplit with ONNX Runtime (faster)
uv sync --extra wtpsplit-ort-gpu  # GPU
uv sync --extra wtpsplit-ort-cpu  # CPU only

# All optional dependencies
uv sync --all-extras

Usage

# Run all benchmarks (speed + edge case accuracy)
python benchmark.py

# Speed only
python benchmark.py --speed

# Edge case accuracy only
python benchmark.py --accuracy

# Corpus evaluation (NLTK treebank + UD English)
python benchmark.py --corpus

# Single corpus evaluation
python benchmark.py --corpus-only treebank
python benchmark.py --corpus-only ud

# Customize speed benchmark
python benchmark.py --speed -n 5000 -t simple

# Generate charts
python benchmark.py --speed --plot                    # Speed chart
python benchmark.py --corpus --plot                   # Corpus F1 chart
python benchmark.py --speed --corpus --plot           # Both charts
python benchmark.py --plot --plot-dir ./results       # Custom output dir

Results

Speed (1000 complex texts)

Library	Time	Per text	vs BlingFire
BlingFire	0.06s	0.06ms	1.0x
NLTK	0.32s	0.32ms	5.2x slower
syntok	0.39s	0.39ms	6.4x slower
pySBD	1.21s	1.21ms	20x slower
wtpsplit (ORT GPU)	~0.73s	~0.73ms	~12x slower
wtpsplit (PyTorch GPU)	~1.04s	~1.04ms	~17x slower
wtpsplit (CPU)	~27s	~27ms	~450x slower

Accuracy (Edge Cases)

Test: "Dr. Smith went to Washington D.C. on Jan. 5th. He met with Sen. Johnson at 3 p.m."

Library	Result	Correct?
BlingFire	`["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]`	✅
NLTK	`["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]`	✅
pySBD	`["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]`	✅
syntok	`["Dr. Smith went to Washington D.C. on Jan.", "5th.", "He met with Sen. Johnson at 3 p.m."]`	❌
wtpsplit	`["Dr. Smith went to Washington D.C. on Jan. 5th.", "He met with Sen. Johnson at 3 p.m."]`	✅

Recommendations

Best overall: BlingFire (fastest + accurate)
Best fallback: NLTK (pure Python, no binary deps, good accuracy)
Best accuracy: wtpsplit (ML-based, but much slower)
Avoid: syntok (fails on common abbreviation patterns)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentence Tokenizer Benchmarks

Libraries Tested

Installation

Usage

Results

Speed (1000 complex texts)

Accuracy (Edge Cases)

Recommendations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentence Tokenizer Benchmarks

Libraries Tested

Installation

Usage

Results

Speed (1000 complex texts)

Accuracy (Edge Cases)

Recommendations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages