Skip to content

A project aimed at understanding components of LLM tokenizers

License

Notifications You must be signed in to change notification settings

mvish7/tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Understanding Tokenizers

This project contains rudimentary implementation of two tokenizers, namely BPE and SentencePiece. Purpose of this project is to implement basic components of these tokenizers without any external libraries.

Possible key takeways:

Using this project, you can:

  • Understand building blocks of tokenizers
  • Play around with basic components of BPE tokenizers
  • Check out how SentencePiece tokenizer can be trained

BPE Tokenizer

A popular tokenizer that can be found in tiktoken library, most famously used by OpenAI in GPT-series models. The bpe_tokenizer.ipynb follows Karpathy's Tokenizer lecture.

SentencePiece Tokenizer

Another poppular tokenizer that is used by famoulsy llama and Mistral models, and can be found in SentencePiece Library. Here sentencepeace.py provides a clean and, simple implementation of it.

About

A project aimed at understanding components of LLM tokenizers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published