Add tokenizer sweep digit variants by pc0618 · Pull Request #6571 · marin-community/marin

pc0618 · 2026-06-23T04:30:49Z

Summary

Add a Datakit tokenizer sweep DAG for issue Experiment: Measure MoE sensitivity to tokenizer choice #5821 covering GPT-OSS and Llama HF-family tokenizers, both vanilla and place-aligned-digit variants.
Train 262k vocabularies on a deterministic 50B-token-equivalent sample, then derive 128k, 32k, and 8k BPE tokenizers by rank truncation.
Add the bounded issue Number tokenization #4915 numeric pre-tokenizer: isolate numeric runs, split right-to-left into place-aligned 3-digit groups, and cap regex digit runs at 510 chars to avoid catastrophic backtracking.
Add retokenization/metadata plumbing for train and holdout caches.

python3 -m py_compile experiments/datakit_testbed/tokenizer_sweep_20260526.py tests/datakit_testbed/test_tokenizer_sweep_20260526.py
git diff --cached --check before commit
uv run pytest tests/datakit_testbed/test_tokenizer_sweep_20260526.py is currently blocked locally during repo conftest import because the unsynced environment cannot import workspace package fray (ModuleNotFoundError: No module named 'fray').

…-digits-variants

pc0618 added 3 commits June 22, 2026 21:30

Add tokenizer sweep digit variants

51ce814

Generalize tokenizer sweep recipe

e562b63

Merge remote-tracking branch 'origin/main' into codex/tokenizer-sweep…

922b68f

…-digits-variants