1
1
# Scene Recognition Evaluation
2
- This involves evaluating the results of the scenes-with-text classification task.
3
- The goal is to have a simple way of comparing different results from SWT.
2
+ This involves evaluating the results of the scenes-with-text classification task. While SWT returns both timepoint and
3
+ timeframe annotations, this subdirectory is focused on timepoints.
4
+ The goal is to have a simple way of comparing different results from SWT.
4
5
5
6
# Required Input
6
7
To run this evaluation script, you need the following:
@@ -13,24 +14,29 @@ using goldretriever.py, or your own set that exactly matches the format present
13
14
There are three arguments when running the script: ` -mmif-dir ` , ` -gold-dir ` , and ` count-subtypes ` .
14
15
The first two are directories that contain the predictions and golds, respectively. The third is a boolean value that
15
16
determines if the evaluation takes into account subtype labels or not.
17
+ * Our standard for naming prediction (mmif) directories is as follows:
18
+ * ` preds@app-swt-detection<VERSION-NUMBER>@<BATCH-NAME> ` .
19
+
16
20
Note that only the first one is required, as ` -gold-dir ` defaults to the set of golds downloaded (using ` goldretriever ` )
17
21
from the [ aapb-annotations] ( https://github.com/clamsproject/aapb-annotations/tree/main/scene-recognition/golds ) repo,
18
22
and ` count-subtypes ` defaults to ` False ` .
19
23
20
24
# Usage
21
25
To run the evaluation, run the following in the ` sr-eval ` directory:
22
26
```
23
- python evaluate.py -mmif-dir <pred_directory> -gold-dir <gold_directory> -count-subtypes True
27
+ python evaluate.py -- mmif-dir <pred_directory> -- gold-dir <gold_directory> - -count-subtypes True
24
28
```
25
29
26
30
# Output Format
27
- Currently, the evaluation script produces two output files: ` document-scores.csv ` and ` dataset-scores.csv `
28
- * ` document-scores.csv ` has the label scores by document, including a macro-average of label scores.
29
- * ` dataset-scores.csv ` has the total label scores across the dataset, including micro-averaged results.
31
+ Currently, the evaluation script produces a set of ` {guid}.csv ` files for each document in the set of predictions, and a
32
+ ` dataset-scores.csv ` .
33
+ * ` {guid}.csv ` has the label scores for a given document, including a macro-average of label scores.
34
+ * ` dataset-scores.csv ` has the total label scores across the dataset, including a final micro-average of all labels.
30
35
31
- These contain the precision, recall, and f1 scores by label. At the moment, the scores themselves are outputted in a
32
- dictionary format, but this is subject to change.
36
+ These contain the precision, recall, and f1 scores by label. In each document, the first row
37
+ is the negative label ` - ` , and specifically the ` dataset-scores ` has the ` all ` label as its second row which
38
+ represents the final micro-average of all the labels.
33
39
34
- # Notes
35
- As mentioned previously, this is the first version of this evaluation script and some things are subject to change
36
- including output format and location .
40
+ The output files are placed in a directory whose name is derived from the final portion (split on ` @ ` )of the basename
41
+ for the given prediction directory. Using our format described in [ Required Input ] ( #required-input ) , this would result
42
+ in the name being ` scores@<BATCH-NAME> ` .
0 commit comments