Skip to content

Commit 5d87995

Browse files
committed
Added score report and score outputs (+ directory), modified README & requirements.txt, and removed goldretriever.py, all addressing requested changes in PR
1 parent 253d4e3 commit 5d87995

6 files changed

+106
-68
lines changed

sr-eval/README.md

+17-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Scene Recognition Evaluation
2-
This involves evaluating the results of the scenes-with-text classification task.
3-
The goal is to have a simple way of comparing different results from SWT.
2+
This involves evaluating the results of the scenes-with-text classification task. While SWT returns both timepoint and
3+
timeframe annotations, this subdirectory is focused on timepoints.
4+
The goal is to have a simple way of comparing different results from SWT.
45

56
# Required Input
67
To run this evaluation script, you need the following:
@@ -13,24 +14,29 @@ using goldretriever.py, or your own set that exactly matches the format present
1314
There are three arguments when running the script: `-mmif-dir`, `-gold-dir`, and `count-subtypes`.
1415
The first two are directories that contain the predictions and golds, respectively. The third is a boolean value that
1516
determines if the evaluation takes into account subtype labels or not.
17+
* Our standard for naming prediction (mmif) directories is as follows:
18+
* `preds@app-swt-detection<VERSION-NUMBER>@<BATCH-NAME>`.
19+
1620
Note that only the first one is required, as `-gold-dir` defaults to the set of golds downloaded (using `goldretriever`)
1721
from the [aapb-annotations](https://github.com/clamsproject/aapb-annotations/tree/main/scene-recognition/golds) repo,
1822
and `count-subtypes` defaults to `False`.
1923

2024
# Usage
2125
To run the evaluation, run the following in the `sr-eval` directory:
2226
```
23-
python evaluate.py -mmif-dir <pred_directory> -gold-dir <gold_directory> -count-subtypes True
27+
python evaluate.py --mmif-dir <pred_directory> --gold-dir <gold_directory> --count-subtypes True
2428
```
2529

2630
# Output Format
27-
Currently, the evaluation script produces two output files: `document-scores.csv` and `dataset-scores.csv`
28-
* `document-scores.csv` has the label scores by document, including a macro-average of label scores.
29-
* `dataset-scores.csv` has the total label scores across the dataset, including micro-averaged results.
31+
Currently, the evaluation script produces a set of `{guid}.csv` files for each document in the set of predictions, and a
32+
`dataset-scores.csv`.
33+
* `{guid}.csv` has the label scores for a given document, including a macro-average of label scores.
34+
* `dataset-scores.csv` has the total label scores across the dataset, including a final micro-average of all labels.
3035

31-
These contain the precision, recall, and f1 scores by label. At the moment, the scores themselves are outputted in a
32-
dictionary format, but this is subject to change.
36+
These contain the precision, recall, and f1 scores by label. In each document, the first row
37+
is the negative label `-`, and specifically the `dataset-scores` has the `all` label as its second row which
38+
represents the final micro-average of all the labels.
3339

34-
# Notes
35-
As mentioned previously, this is the first version of this evaluation script and some things are subject to change
36-
including output format and location.
40+
The output files are placed in a directory whose name is derived from the final portion (split on `@`)of the basename
41+
for the given prediction directory. Using our format described in [Required Input](#required-input), this would result
42+
in the name being `scores@<BATCH-NAME>`.

sr-eval/goldretriever.py

-55
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Scenes With Text Evaluation Report
2+
3+
4+
## Report Instance of Evaluation Information
5+
* 2024-07-03
6+
* [app-SWT-detection](https://github.com/clamsproject/app-swt-detection), version 5.0
7+
* [sr-eval golds](https://github.com/clamsproject/aapb-annotations/tree/main/scene-recognition/golds).
8+
* [[email protected]@240117-aapb-collaboration-27-d](https://github.com/clamsproject/aapb-evaluations/tree/sr-eval-final/sr-eval/preds%40app-swt-detection5.0%40240117-aapb-collaboration-27-d).
9+
* [sr-eval/evaluation.py](https://github.com/clamsproject/aapb-evaluations/blob/sr-eval-final/sr-eval/evaluate.py).
10+
* `python evaluate.py -m /path/to/repo/aapb-evaluations/sr-eval/[email protected]@240117-aapb-collaboration-27-d`
11+
12+
## Metrics
13+
This dataset was evaluated using the typical `Precision, Recall, F1` scores for classification. Specifically, each
14+
predicted timepoint label is compared to the gold standard label annotation. Each document
15+
(in this case the single `cpb-aacip-259-wh2dcb8p.csv`) has a set of macro-averaged scores across all labels, and then the
16+
`dataset_scores.csv` contains the total (micro-averaged across the entire dataset) scores per label, and a final overall
17+
(micro) averaged set of scores (the `all` column)
18+
Note that labels which do not appear (on a per-document basis or otherwise) are not included in these averages.
19+
20+
## Results
21+
`cpb-aacip-259-wh2dcb8p.csv`
22+
23+
| label |precision |recall|f1 |
24+
|---------|------------------|------|------------------|
25+
| - |0.8874425727411945|0.8827113480578828|0.8850706376479572|
26+
| B |1.0 |1.0 |1.0 |
27+
| S |1.0 |0.8571428571428571|0.923076923076923 |
28+
| O |0.0 |0.0 |0.0 |
29+
| L |1.0 |0.3333333333333333|0.5 |
30+
| M |0.32142857142857145|0.75 |0.45000000000000007|
31+
| G |0.390625 |0.5208333333333334|0.44642857142857145|
32+
| Y |0.0 |0.0 |0.0 |
33+
| F |0.6818181818181818|0.45918367346938777|0.548780487804878 |
34+
| P |0.5296610169491526|0.5605381165919282|0.5446623093681917|
35+
| T |0.0 |0.0 |0.0 |
36+
| I |1.0 |0.7142857142857143|0.8333333333333333|
37+
| E |0.0 |0.0 |0.0 |
38+
| C |0.5306122448979592|0.65 |0.5842696629213483|
39+
| R |0.0 |0.0 |0.0 |
40+
| average |0.4894391725223373|0.4485352250809625|0.44770812837208024|
41+
42+
`dataset_scores.csv` has the same values for each label, as in this case there is only one documnent. However, the
43+
overall micro-averaged set of scores (from the `all` row) is as follows:
44+
45+
| label |precision |recall|f1 |
46+
|-------|------------------|------|------------------|
47+
| all |0.7916666666666666|0.7916666666666666|0.7916666666666666|
48+
49+
50+
## Limitations/Issues
51+
* This script seems to be somewhat inefficient in terms of runtime. This may be due to not utilizing various evaluation
52+
libraries (like numpy), and it might be worthwhile in the future to refactor the script such that those libraries are
53+
used.

sr-eval/requirements.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
mmif_python==1.0.14
1+
clams_utils
22
pandas==2.0.3
3-
Requests==2.32.3
3+
Requests==2.32.3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
,precision,recall,f1
2+
-,0.8874425727411945,0.8827113480578828,0.8850706376479572
3+
B,1.0,1.0,1.0
4+
S,1.0,0.8571428571428571,0.923076923076923
5+
O,0.0,0.0,0.0
6+
L,1.0,0.3333333333333333,0.5
7+
M,0.32142857142857145,0.75,0.45000000000000007
8+
G,0.390625,0.5208333333333334,0.44642857142857145
9+
Y,0.0,0.0,0.0
10+
F,0.6818181818181818,0.45918367346938777,0.548780487804878
11+
P,0.5296610169491526,0.5605381165919282,0.5446623093681917
12+
T,0.0,0.0,0.0
13+
I,1.0,0.7142857142857143,0.8333333333333333
14+
E,0.0,0.0,0.0
15+
C,0.5306122448979592,0.65,0.5842696629213483
16+
R,0.0,0.0,0.0
17+
average,0.4894391725223373,0.4485352250809625,0.44770812837208024
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
,precision,recall,f1
2+
-,0.8874425727411945,0.8827113480578828,0.8850706376479572
3+
all,0.7916666666666666,0.7916666666666666,0.7916666666666666
4+
B,1.0,1.0,1.0
5+
S,1.0,0.8571428571428571,0.923076923076923
6+
O,0.0,0.0,0.0
7+
L,1.0,0.3333333333333333,0.5
8+
M,0.32142857142857145,0.75,0.45000000000000007
9+
G,0.390625,0.5208333333333334,0.44642857142857145
10+
Y,0.0,0.0,0.0
11+
F,0.6818181818181818,0.45918367346938777,0.548780487804878
12+
P,0.5296610169491526,0.5605381165919282,0.5446623093681917
13+
T,0.0,0.0,0.0
14+
I,1.0,0.7142857142857143,0.8333333333333333
15+
E,0.0,0.0,0.0
16+
C,0.5306122448979592,0.65,0.5842696629213483
17+
R,0.0,0.0,0.0

0 commit comments

Comments
 (0)