English–Hindi Sentence Preprocessing Pipeline

This repository provides a preprocessing pipeline for preparing English–Hindi parallel sentences for machine translation tasks.
It handles tokenization, vocabulary construction, and conversion of raw text into numerical IDs suitable for model training.

📌 Features

Load bilingual dataset (English → Hindi).
Tokenize English sentences using T5 tokenizer.
Tokenize Hindi sentences using Indic NLP Library.
Build vocabulary for Hindi tokens.
Convert tokens into integer IDs for both languages.
Ready to be used with PyTorch Dataset and DataLoader.

📂 Project Structure

├── data/                
├── preprocess.ipynb
└── README.md

🚀 Usage

Place your dataset (CSV) in the data/ directory. Example format:

SrcSent (English)	DstSent (Hindi)
How are you?	तुम कैसे हो?
Thank you	धन्यवाद

Run preprocessing script / notebook:

python preprocess.py

Outputs:

English sentences → token IDs (T5 vocabulary).
Hindi sentences → token IDs (custom vocabulary).
Hindi vocabulary dictionary saved for training.

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Sentence pairs in English-Hindi - 2025-02-11.tsv		Sentence pairs in English-Hindi - 2025-02-11.tsv
language-translation.ipynb		language-translation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

English–Hindi Sentence Preprocessing Pipeline

📌 Features

📂 Project Structure

🚀 Usage

📜 License

About

Uh oh!

Releases

Packages

Languages

Unnati-Gupta24/Building-NeuralNetwork

Folders and files

Latest commit

History

Repository files navigation

English–Hindi Sentence Preprocessing Pipeline

📌 Features

📂 Project Structure

🚀 Usage

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages