Skip to content

Latest commit

 

History

History
42 lines (34 loc) · 2.65 KB

README.md

File metadata and controls

42 lines (34 loc) · 2.65 KB

Scene Recognition Evaluation

This involves evaluating the results of the scenes-with-text classification task. While SWT returns both timepoint and timeframe annotations, this subdirectory is focused on timepoints. The goal is to have a simple way of comparing different results from SWT.

Required Input

To run this evaluation script, you need the following:

  • Set of predictions in MMIF format (either from the preds folder in this repo or generated from the SWT app
  • Set of golds in csv format (either downloaded from the annotations repository using goldretriever.py, or your own set that exactly matches the format present in aapb-annotations

There are three arguments when running the script: -mmif-dir, -gold-dir, and count-subtypes. The first two are directories that contain the predictions and golds, respectively. The third is a boolean value that determines if the evaluation takes into account subtype labels or not.

  • Our standard for naming prediction (mmif) directories is as follows:
  • preds@app-swt-detection<VERSION-NUMBER>@<BATCH-NAME>.

Note that only the first one is required, as -gold-dir defaults to the set of golds downloaded (using goldretriever) from the aapb-annotations repo, and count-subtypes defaults to False.

Usage

To run the evaluation, run the following in the sr-eval directory:

python evaluate.py --mmif-dir <pred_directory> --gold-dir <gold_directory> --count-subtypes True

Output Format

Currently, the evaluation script produces a set of {guid}.csv files for each document in the set of predictions, and a dataset-scores.csv.

  • {guid}.csv has the label scores for a given document, including a macro-average of label scores.
  • dataset-scores.csv has the total label scores across the dataset, including a final micro-average of all labels.

These contain the precision, recall, and f1 scores by label. In each document, the first row is the negative label -, and specifically the dataset-scores has the all label as its second row which represents the final micro-average of all the labels.

The output files are placed in a directory whose name is derived from the final portion (split on @)of the basename for the given prediction directory. Using our format described in Required Input, this would result in the name being scores@<BATCH-NAME>.