Skip to content

qjiang002/Propaganda-Detection

Repository files navigation

Propaganda Detection

This is a solution for datathon Hack the News Datathon Case – Propaganda Detection task 1 and task 2.
The solutoin report can be found here. Propaganda Detection in Social Media

Environment setup

python3
tensorflow >= 1.11.0
install package spacy
install en_core_web_sm by python -m spacy download en_core_web_sm
install vader_lexicon by nltk.download('vader_lexicon')
Download bert into folder bert
Download BERT-Base, Uncased into folder checkpoint

Datasets

The original datasets can be downloaded from the datathon website.
In data folder, the data has been processed for BERT input.

Machine Learning models

task1_ML.ipynb: ML models include SVM, Logistic Regression, Random Forest and KNN for task 1, using full articles and article summaries.
task2_ML.ipynb: ML models include SVM, Logistic Regression, Random Forest and KNN for task 2, using text sentences and supplementary features (named entities and sentimental polarities).

BERT-based models

  • BERT_classifier.py: BERT classification
    Example:
python BERT_classifier.py --data_dir=data/task_2 --bert_config_file=checkpoint/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=checkpoint/uncased_L-12_H-768_A-12/bert_model.ckpt --vocab_file=checkpoint/uncased_L-12_H-768_A-12/vocab.txt --output_dir=./output/BERT_classifier --max_seq_length 128 --do_train --do_eval --do_predict 2>&1 | tee output/BERT_classifier/training.log
  • BERT_post_matching.py: integrate supplementary features (named entities and sentimental polarities) into BERT by post-matching. Details are in the solution report.
    Training parameters:
--data_dir=data/task_2
--post_matching = mean/concat (default: mean)
--use_ner = True/False (default: True)
--use_polarity = True/False (default: True)

Example:

python BERT_post_matching.py --data_dir=data/task_2 --bert_config_file=checkpoint/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=checkpoint/uncased_L-12_H-768_A-12/bert_model.ckpt --vocab_file=checkpoint/uncased_L-12_H-768_A-12/vocab.txt --output_dir=./output/BERT_post_matching --max_seq_length 128 --do_train --do_eval --do_predict --post_matching=mean --use_ner --use_polarity 2>&1 | tee output/BERT_post_matching/training.log
  • BERT_ner_embedding.py: integrate named entity features into BERT by input embedding. Polarity features are optionally integrated by post-matching. Details are in the solution report.
    Training parameters:
--data_dir=data/task_2
--post_matching = mean/concat (default: mean)
--use_polarity = True/False (default: True)

Example:

python BERT_ner_embedding.py --data_dir=data/task_2 --bert_config_file=checkpoint/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=checkpoint/uncased_L-12_H-768_A-12/bert_model.ckpt --vocab_file=checkpoint/uncased_L-12_H-768_A-12/vocab.txt --output_dir=./output/BERT_ner_embedding --max_seq_length 128 --do_train --do_eval --do_predict --post_matching=mean --use_polarity 2>&1 | tee output/BERT_ner_embedding/training.log
  • BERT_multitask.py: multi-task training tasks include propaganda text classification, NER sequence labelling and sentimental polarity text classification. Details are in the solution report.
    Training parameters:
--data_dir=data/task_2_ner
--polarity_threshold = 0.4 (The threshold of the absolute value of polarity compound score. default: 0.4)
--use_ner = True/False (default: True)
--use_polarity = True/False (default: True)

Example:

python BERT_multitask.py --data_dir=data/task_2_ner --bert_config_file=checkpoint/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=checkpoint/uncased_L-12_H-768_A-12/bert_model.ckpt --vocab_file=checkpoint/uncased_L-12_H-768_A-12/vocab.txt --output_dir=./output/BERT_multitask --max_seq_length 128 --do_train --do_eval --do_predict --use_ner --use_polarity --polarity_threshold=0.4 2>&1 | tee output/BERT_multitask/training.log

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published