GitHub - s1998/progressiveTrainCodeSwitch: Code for the Findings of EMNLP 2022 paper "Progressive Sentiment Analysis for Code-Switched Text Data"

Progressive Sentiment Analysis for Code-Switched Text Data

Model
Training
Citation

Model

Most of the experiments have been carried out using bert-base-multilingual-cased as the backbone model.

The framework is present in the file models/ds_model.py.

Required inputs

External english data for pretraining should be present in the data/english_data file.

For code-switched datasets, create a folder with the dataset name and create train.txt and validation.txt.

For example, for the sail dataset used in GLUECoS, files should be data/sail/train.txt and data/sail/validation.txt.

Each row should contain the text and the label (i.e. positive or negative). The words in Hindi/Tamil should be transliterated to devanagari script.

You can obtain the Hindi-English (sail) or Spanish-English (enes) dataset from here and put it in the data folder. Tamil-English (taen) dataset can be downloaded from here .

Commands

The main_ds.py can take following arguments:

arg_data to denote the dataset, currently takes sail , enes or taen as input.
external_data_imbalance_fix to deal with imbalance in the source dataset used for pretraining.
seed to fix the seed scross experiments
zsl_ds_us_data_merged_multiple_m_half_data or zsl_ds_us_data_merged_multiple_m_half_data_many_runs or supervised to do single run or multiple runs or supervised run.

Example commands to run:

python main_ds.py --external_data_imbalance_fix upsample  --seed 22 --zsl_ds_us_data_merged_multiple_m_half_data_many_runs --arg_data sail > logs/sail_half_data_hrd_lbl_merged_bkts_ds_us_run22 &

python main_ds.py --external_data_imbalance_fix upsample  --seed 22 --zsl_ds_us_data_merged_multiple_m_half_data_many_runs --arg_data taen > logs/taen_half_data_hrd_lbl_merged_bkts_ds_us_run22 &

python main_ds.py --external_data_imbalance_fix upsample  --seed 22 --zsl_ds_us_data_merged_multiple_m_half_data_many_runs --arg_data enes > logs/enes_half_data_hrd_lbl_merged_bkts_ds_us_run22 &

Requirements

This project is based on python==3.6.10. The dependencies are as follow:

torch==1.9.1
argparse
transformers==3.5.1
nltk==3.5
sklearn
ai4bharat==0.5.0.3

Citation

@misc{https://doi.org/10.48550/arxiv.2210.14380,
  doi = {10.48550/ARXIV.2210.14380},
  url = {https://arxiv.org/abs/2210.14380},
  author = {Ranjan, Sudhanshu and Mekala, Dheeraj and Shang, Jingbo},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Progressive Sentiment Analysis for Code-Switched Text Data},
  publisher = {arXiv},  
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
base_code		base_code
data		data
models		models
Readme.md		Readme.md
load_data.py		load_data.py
main_ds.py		main_ds.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Progressive Sentiment Analysis for Code-Switched Text Data

Model

Required inputs

Commands

Requirements

Citation

About

Uh oh!

Releases

Packages

Languages

s1998/progressiveTrainCodeSwitch

Folders and files

Latest commit

History

Repository files navigation

Progressive Sentiment Analysis for Code-Switched Text Data

Model

Required inputs

Commands

Requirements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages