Comprehensive Awadhi dialect speech dataset for linguistic research and speech recognition enhancement
Welcome to the repository for the Awadhi_Speech_Dataset. Awadhi, a dialect of Hindi Language, is an Indo-Aryan language spoken primarily in the Awadh region of Uttar Pradesh, India, and in parts of Nepal. It has approximately 3.85 million speakers in India and 500,000 in Nepal (as of 2011). This dataset aims to facilitate linguistic research, natural language processing, and cultural preservation efforts related to the Awadhi language.
The dataset includes a variety of linguistic resources for Awadhi, such as:
- Speech recordings and transcriptions
- Text samples and translations
- Annotations for linguistic analysis
DataSet | Description |
---|---|
KMI Awadhi Corpus | It contains a raw corpus of approximately 70,000 tokens and a POS-annotated corpus of approximately 20,000 tokens. The raw corpus is in the directory called 'source'. And the annotated corpus is in the directory called 'annotation'. The annotations are in CONLL 2000 format. |
VarDial 2018 Language Identification Dataset | have Awadhi inside a dataset |
Awadhi speech dataset | contains the transcription of the speech data |
To use this dataset in your projects, clone this repository:
git clone https://github.com/yourusername/awadhi-language-dataset.git
Contributions to expand and enhance this dataset are welcome. Please refer to our Contribution.md for guidelines.
Join our community to discuss potential applications of this dataset and collaborate on Awadhi language projects. For support or inquiries, please open an issue in this repository.
We extend our gratitude to all contributors and researchers who have enriched this repository with valuable data and insights.