Skip to content

Add tokenizer sweep digit variants#6571

Draft
pc0618 wants to merge 3 commits into
mainfrom
codex/tokenizer-sweep-digits-variants
Draft

Add tokenizer sweep digit variants#6571
pc0618 wants to merge 3 commits into
mainfrom
codex/tokenizer-sweep-digits-variants

Conversation

@pc0618

@pc0618 pc0618 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a Datakit tokenizer sweep DAG for issue Experiment: Measure MoE sensitivity to tokenizer choice #5821 covering GPT-OSS and Llama HF-family tokenizers, both vanilla and place-aligned-digit variants.
  • Train 262k vocabularies on a deterministic 50B-token-equivalent sample, then derive 128k, 32k, and 8k BPE tokenizers by rank truncation.
  • Add the bounded issue Number tokenization #4915 numeric pre-tokenizer: isolate numeric runs, split right-to-left into place-aligned 3-digit groups, and cap regex digit runs at 510 chars to avoid catastrophic backtracking.
  • Add retokenization/metadata plumbing for train and holdout caches.

Tests

  • python3 -m py_compile experiments/datakit_testbed/tokenizer_sweep_20260526.py tests/datakit_testbed/test_tokenizer_sweep_20260526.py
  • git diff --cached --check before commit
  • uv run pytest tests/datakit_testbed/test_tokenizer_sweep_20260526.py is currently blocked locally during repo conftest import because the unsynced environment cannot import workspace package fray (ModuleNotFoundError: No module named 'fray').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant