Skip to content

This repository provides a preprocessing pipeline for preparing English–Hindi parallel sentences for machine translation tasks. It handles tokenization, vocabulary construction, and conversion of raw text into numerical IDs suitable for model training.

Notifications You must be signed in to change notification settings

Unnati-Gupta24/Building-NeuralNetwork

Repository files navigation

English–Hindi Sentence Preprocessing Pipeline

This repository provides a preprocessing pipeline for preparing English–Hindi parallel sentences for machine translation tasks.
It handles tokenization, vocabulary construction, and conversion of raw text into numerical IDs suitable for model training.


📌 Features

  • Load bilingual dataset (English → Hindi).
  • Tokenize English sentences using T5 tokenizer.
  • Tokenize Hindi sentences using Indic NLP Library.
  • Build vocabulary for Hindi tokens.
  • Convert tokens into integer IDs for both languages.
  • Ready to be used with PyTorch Dataset and DataLoader.

📂 Project Structure

├── data/                
├── preprocess.ipynb
└── README.md           

🚀 Usage

  1. Place your dataset (CSV) in the data/ directory. Example format:
SrcSent (English)	DstSent (Hindi)
How are you?	तुम कैसे हो?
Thank you	धन्यवाद
  1. Run preprocessing script / notebook:
python preprocess.py
  1. Outputs:
  • English sentences → token IDs (T5 vocabulary).
  • Hindi sentences → token IDs (custom vocabulary).
  • Hindi vocabulary dictionary saved for training.

📜 License

This project is licensed under the MIT License.

About

This repository provides a preprocessing pipeline for preparing English–Hindi parallel sentences for machine translation tasks. It handles tokenization, vocabulary construction, and conversion of raw text into numerical IDs suitable for model training.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published