This repository provides a preprocessing pipeline for preparing English–Hindi parallel sentences for machine translation tasks.
It handles tokenization, vocabulary construction, and conversion of raw text into numerical IDs suitable for model training.
- Load bilingual dataset (English → Hindi).
- Tokenize English sentences using T5 tokenizer.
- Tokenize Hindi sentences using Indic NLP Library.
- Build vocabulary for Hindi tokens.
- Convert tokens into integer IDs for both languages.
- Ready to be used with PyTorch
DatasetandDataLoader.
├── data/
├── preprocess.ipynb
└── README.md
- Place your dataset (CSV) in the data/ directory. Example format:
SrcSent (English) DstSent (Hindi)
How are you? तुम कैसे हो?
Thank you धन्यवाद
- Run preprocessing script / notebook:
python preprocess.py
- Outputs:
- English sentences → token IDs (T5 vocabulary).
- Hindi sentences → token IDs (custom vocabulary).
- Hindi vocabulary dictionary saved for training.
This project is licensed under the MIT License.