nlp-binary-event-detection-models

Code to fine-tune models for binary event detection on annotated documents on an HPC and evaluate locally.

This is work that builds on https://github.com/globalise-huygens/nlp-event-testset-experiment

Fine-tuning models with 15-fold cross-validation

This repo contains code to fine-tune models on a 15-fold datasplit. For each datasplit, one annotated document is excluded from the training set on which a model is fine-tuned. This document is referred to with its inventory number (as documented in the VOC archives). The excluded document serves as validation data in the evaluation step that happens later. The code in this repo fine-tunes four different models, each with five set seeds. The shell script to run this code is finetune.sh. This shell script runs finetune_with_click.py, which in turn relies on i) get_data_selection.py for creating the datasplits and storing metadata on each test file, on ii) train_utils.py to initialize the appropriate tokenizer and on iii) utils.py for data (pre)processing.

The models used in this version of the code are 4 Dutch models and 2 multilingual ones, namely GysBERT, GysBERT-v2, BERTje, RobBERT, XLM-Roberta and Multilingual BERT. Changing some variables in the code enables you to use some English models; BERT, RoBERTa, MacBERTh.

The data on which I fine-tune is stored in "data". get_data_selection.py uses the subfolderstructure as well as the filenames in this directory to make datasplits and gather metadata.

When running finetune.sh, 450 seperate models are fine-tuned (4 models x 5 seeds x 15 datasets), of which predictions are seperately stored. finetune_with_click.py exports predictions of each fine-tuned model on the held-out document from the 15 ones available to a json file. In this json file, metadata on the test set and on the training arguments are also stored. Predictions on each of the 15 documents are gathered in folders that represent a model + seed combination (also reflected in the foldername, i.e. "GysBERT-553311"). These folders with predictions can be found in the "output_in_batches_nov20" folder.

Evaluating the models

evaluation_in_batches.py processes the predictions stored in "output_in_batches_nov20" and the corresponding gold data and evaluates on token level (as opposed to subtoken level) and on mention level. Token level evaluation means binary token classification (IO). See an example underneath.

For token-level binary event detection, precision, recall, and f1 are calculated for the event class (I) as well as for the O class, and a macro-average for each scoretype. Additionally, the same scores are calculated for a lexical baseline. This baseline uses lexicon_v4, which was developed in an iterative manner through analysis of annotations, professional expertise of historians working with the VOC archives and a word2vec model trained on the archives. This lexicon only matches single tokens to an event class. These scores are calculated for each model+seed combination and can be found in the "tables" folder. Each table shows scores for each of the 15 datasplits. It also contains some information about the amount of tokens in each document evaluated on and the event density in the gold data.

Mention-level scores are calculated as an accuracy score. If one or more of the tokens within a gold event mention span (i.e. "ordonnantie" in "d'ordonnantie ende last") is recognized as an event token in the predictions, we see it as overlap. For the example data given above, the accuracy would be 100%.

The next step is analysing all results on model-level. In the "results" folder, AVERAGES.csv schows the average score per model (i.e. averaged over seeds and datasplits). For example, the scores for xlm-r are the averages of the scores reorted in tables "table_xlm-roberta-base_6834.csv", "table_xlm-roberta-base_888.csv", "table_xlm-roberta-base_553311.csv", "table_xlm-roberta-base_21102024.csv", and "table_xlm-roberta-base_23052024.csv". In the averages table, only precision, recall and f1 for the event class are reported, as this is the only score that is truly of interest.

CRF baseline

I train a Conditional Random Forest with features to compare the Language Models' performance with (as well as a lexical baseline). The main feature for this model are token embeddings derived with a specialized word2vec model, trained on our VOC corpus. The model and its documentation can be found here. To run crf_baselines.py you will need to download the word2vec model (and save it locally, in the code the hardcoded filepath is 'word2vec/GLOBALISE.word2vec')

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
__pycache__		__pycache__
checking_output		checking_output
data/json_per_doc/train		data/json_per_doc/train
output_in_batches		output_in_batches
output_in_batches_GloBERTise		output_in_batches_GloBERTise
output_in_batches_english_models		output_in_batches_english_models
results		results
tables		tables
README.md		README.md
crf_baselines.py		crf_baselines.py
evaluation_in_batches.py		evaluation_in_batches.py
finetune.sh		finetune.sh
finetune_with_click.py		finetune_with_click.py
get_data_selection.py		get_data_selection.py
inference.py		inference.py
lexical_baseline.py		lexical_baseline.py
lexicon_v4.csv		lexicon_v4.csv
requirements.txt		requirements.txt
train_utils.py		train_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nlp-binary-event-detection-models

Fine-tuning models with 15-fold cross-validation

Evaluating the models

CRF baseline

About

Uh oh!

Releases

Packages

Languages

globalise-huygens/nlp-binary-event-detection-models

Folders and files

Latest commit

History

Repository files navigation

nlp-binary-event-detection-models

Fine-tuning models with 15-fold cross-validation

Evaluating the models

CRF baseline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages