Skip to content

Code to fine-tune models for binary event detection on annotated documents on an HPC and evaluate locally.

Notifications You must be signed in to change notification settings

globalise-huygens/nlp-binary-event-detection-models

Repository files navigation

nlp-binary-event-detection-models

Code to fine-tune models for binary event detection on annotated documents on an HPC and evaluate locally.

This is work that builds on https://github.com/globalise-huygens/nlp-event-testset-experiment

Fine-tuning models with 15-fold cross-validation

This repo contains code to fine-tune models on a 15-fold datasplit. For each datasplit, one annotated document is excluded from the training set on which a model is fine-tuned. This document is referred to with its inventory number (as documented in the VOC archives). The excluded document serves as validation data in the evaluation step that happens later. The code in this repo fine-tunes four different models, each with five set seeds. The shell script to run this code is finetune.sh. This shell script runs finetune_with_click.py, which in turn relies on i) get_data_selection.py for creating the datasplits and storing metadata on each test file, on ii) train_utils.py to initialize the appropriate tokenizer and on iii) utils.py for data (pre)processing.

The models used in this version of the code are 4 Dutch models and 2 multilingual ones, namely GysBERT, GysBERT-v2, BERTje, RobBERT, XLM-Roberta and Multilingual BERT. Changing some variables in the code enables you to use some English models; BERT, RoBERTa, MacBERTh.

The data on which I fine-tune is stored in "data". get_data_selection.py uses the subfolderstructure as well as the filenames in this directory to make datasplits and gather metadata.

When running finetune.sh, 450 seperate models are fine-tuned (4 models x 5 seeds x 15 datasets), of which predictions are seperately stored. finetune_with_click.py exports predictions of each fine-tuned model on the held-out document from the 15 ones available to a json file. In this json file, metadata on the test set and on the training arguments are also stored. Predictions on each of the 15 documents are gathered in folders that represent a model + seed combination (also reflected in the foldername, i.e. "GysBERT-553311"). These folders with predictions can be found in the "output_in_batches_nov20" folder.

Evaluating the models

evaluation_in_batches.py processes the predictions stored in "output_in_batches_nov20" and the corresponding gold data and evaluates on token level (as opposed to subtoken level) and on mention level. Token level evaluation means binary token classification (IO). See an example underneath.

Screenshot 2025-01-06 at 10 45 20

For token-level binary event detection, precision, recall, and f1 are calculated for the event class (I) as well as for the O class, and a macro-average for each scoretype. Additionally, the same scores are calculated for a lexical baseline. This baseline uses lexicon_v4, which was developed in an iterative manner through analysis of annotations, professional expertise of historians working with the VOC archives and a word2vec model trained on the archives. This lexicon only matches single tokens to an event class. These scores are calculated for each model+seed combination and can be found in the "tables" folder. Each table shows scores for each of the 16 datasplits. It also contains some information about the amount of tokens in each document evaluated on and the event density in the gold data.

Mention-level scores are calculated as an accuracy score. If one or more of the tokens within a gold event mention span (i.e. "ordonnantie" in "d'ordonnantie ende last") is recognized as an event token in the predictions, we see it as overlap. For the example data given above, the accuracy would be 100%.

The next step is analysing all results on model-level. In the "results" folder, AVERAGES.csv schows the average score per model (i.e. averaged over seeds and datasplits). For example, the scores for xlm-r are the averages of the scores reorted in tables "table_xlm-roberta-base_6834.csv", "table_xlm-roberta-base_888.csv", "table_xlm-roberta-base_553311.csv", "table_xlm-roberta-base_21102024.csv", and "table_xlm-roberta-base_23052024.csv". In the averages table, only precision, recall and f1 for the event class are reported, as this is the only score that is truly of interest.

About

Code to fine-tune models for binary event detection on annotated documents on an HPC and evaluate locally.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published