Added score report and score outputs (+ directory), modified README & requirements.txt, and removed goldretriever.py, all addressing requested changes in PR

jyoune · jyoune · commit 5d8799532861 · 2024-07-05T14:17:21.000-04:00
diff --git a/sr-eval/README.md b/sr-eval/README.md
@@ -1,6 +1,7 @@
 # Scene Recognition Evaluation
-This involves evaluating the results of the scenes-with-text classification task.
-The goal is to have a simple way of comparing different results from SWT.
+This involves evaluating the results of the scenes-with-text classification task. While SWT returns both timepoint and
+timeframe annotations, this subdirectory is focused on timepoints.
+The goal is to have a simple way of comparing different results from SWT. 
 
 # Required Input
 To run this evaluation script, you need the following:
@@ -13,24 +14,29 @@ using goldretriever.py, or your own set that exactly matches the format present
 There are three arguments when running the script: `-mmif-dir`, `-gold-dir`, and `count-subtypes`.
 The first two are directories that contain the predictions and golds, respectively. The third is a boolean value that
 determines if the evaluation takes into account subtype labels or not.
+* Our standard for naming prediction (mmif) directories is as follows:
+* `preds@app-swt-detection<VERSION-NUMBER>@<BATCH-NAME>`.
+
 Note that only the first one is required, as `-gold-dir` defaults to the set of golds downloaded (using `goldretriever`)
 from the [aapb-annotations](https://github.com/clamsproject/aapb-annotations/tree/main/scene-recognition/golds) repo,
 and `count-subtypes` defaults to `False`.
 
 # Usage
 To run the evaluation, run the following in the `sr-eval` directory:
 ```
-python evaluate.py -mmif-dir <pred_directory> -gold-dir <gold_directory> -count-subtypes True
+python evaluate.py --mmif-dir <pred_directory> --gold-dir <gold_directory> --count-subtypes True
 ```
 
 # Output Format
-Currently, the evaluation script produces two output files: `document-scores.csv` and `dataset-scores.csv`
-* `document-scores.csv` has the label scores by document, including a macro-average of label scores.
-* `dataset-scores.csv` has the total label scores across the dataset, including micro-averaged results.
+Currently, the evaluation script produces a set of `{guid}.csv` files for each document in the set of predictions, and a
+`dataset-scores.csv`.
+* `{guid}.csv` has the label scores for a given document, including a macro-average of label scores.
+* `dataset-scores.csv` has the total label scores across the dataset, including a final micro-average of all labels.
 
-These contain the precision, recall, and f1 scores by label. At the moment, the scores themselves are outputted in a
-dictionary format, but this is subject to change.
+These contain the precision, recall, and f1 scores by label. In each document, the first row 
+is the negative label `-`, and specifically the `dataset-scores` has the `all` label as its second row which
+represents the final micro-average of all the labels.
 
-# Notes
-As mentioned previously, this is the first version of this evaluation script and some things are subject to change
-including output format and location.
+The output files are placed in a directory whose name is derived from the final portion (split on `@`)of the basename 
+for the given prediction directory. Using our format described in [Required Input](#required-input), this would result
+in the name being `scores@<BATCH-NAME>`.
diff --git a/sr-eval/goldretriever.py b/sr-eval/goldretriever.py
diff --git a/sr-eval/report-20240704-preds@app-swt-detection5.0@240117-aapb-collaboration-27-d.md b/sr-eval/report-20240704-preds@app-swt-detection5.0@240117-aapb-collaboration-27-d.md
@@ -0,0 +1,53 @@
+# Scenes With Text Evaluation Report
+
+
+## Report Instance of Evaluation Information
+* 2024-07-03
+* [app-SWT-detection](https://github.com/clamsproject/app-swt-detection), version 5.0
+* [sr-eval golds](https://github.com/clamsproject/aapb-annotations/tree/main/scene-recognition/golds).
+* [preds@app-swt-detection5.0@240117-aapb-collaboration-27-d](https://github.com/clamsproject/aapb-evaluations/tree/sr-eval-final/sr-eval/preds%40app-swt-detection5.0%40240117-aapb-collaboration-27-d).
+* [sr-eval/evaluation.py](https://github.com/clamsproject/aapb-evaluations/blob/sr-eval-final/sr-eval/evaluate.py).
+* `python evaluate.py -m /path/to/repo/aapb-evaluations/sr-eval/preds@app-swt-detection5.0@240117-aapb-collaboration-27-d`
+
+## Metrics
+This dataset was evaluated using the typical `Precision, Recall, F1` scores for classification. Specifically, each
+predicted timepoint label is compared to the gold standard label annotation. Each document 
+(in this case the single `cpb-aacip-259-wh2dcb8p.csv`) has a set of macro-averaged scores across all labels, and then the
+`dataset_scores.csv` contains the total (micro-averaged across the entire dataset) scores per label, and a final overall
+(micro) averaged set of scores (the `all` column)
+Note that labels which do not appear (on a per-document basis or otherwise) are not included in these averages.
+
+## Results
+`cpb-aacip-259-wh2dcb8p.csv`
+
+| label   |precision         |recall|f1                |
+|---------|------------------|------|------------------|
+| -       |0.8874425727411945|0.8827113480578828|0.8850706376479572|
+| B       |1.0               |1.0   |1.0               |
+| S       |1.0               |0.8571428571428571|0.923076923076923 |
+| O       |0.0               |0.0   |0.0               |
+| L       |1.0               |0.3333333333333333|0.5               |
+| M       |0.32142857142857145|0.75  |0.45000000000000007|
+| G       |0.390625          |0.5208333333333334|0.44642857142857145|
+| Y       |0.0               |0.0   |0.0               |
+| F       |0.6818181818181818|0.45918367346938777|0.548780487804878 |
+| P       |0.5296610169491526|0.5605381165919282|0.5446623093681917|
+| T       |0.0               |0.0   |0.0               |
+| I       |1.0               |0.7142857142857143|0.8333333333333333|
+| E       |0.0               |0.0   |0.0               |
+| C       |0.5306122448979592|0.65  |0.5842696629213483|
+| R       |0.0               |0.0   |0.0               |
+| average |0.4894391725223373|0.4485352250809625|0.44770812837208024|
+
+`dataset_scores.csv` has the same values for each label, as in this case there is only one documnent. However, the
+overall micro-averaged set of scores (from the `all` row) is as follows:
+
+| label |precision         |recall|f1                |
+|-------|------------------|------|------------------|
+| all   |0.7916666666666666|0.7916666666666666|0.7916666666666666|
+
+
+## Limitations/Issues
+* This script seems to be somewhat inefficient in terms of runtime. This may be due to not utilizing various evaluation
+libraries (like numpy), and it might be worthwhile in the future to refactor the script such that those libraries are
+used.
diff --git a/sr-eval/requirements.txt b/sr-eval/requirements.txt
@@ -1,3 +1,3 @@
-mmif_python==1.0.14
+clams_utils
 pandas==2.0.3
-Requests==2.32.3
+Requests==2.32.3
diff --git a/sr-eval/scores@240117-aapb-collaboration-27-d/cpb-aacip-259-wh2dcb8p.csv b/sr-eval/scores@240117-aapb-collaboration-27-d/cpb-aacip-259-wh2dcb8p.csv
@@ -0,0 +1,17 @@
+,precision,recall,f1
+-,0.8874425727411945,0.8827113480578828,0.8850706376479572
+B,1.0,1.0,1.0
+S,1.0,0.8571428571428571,0.923076923076923
+O,0.0,0.0,0.0
+L,1.0,0.3333333333333333,0.5
+M,0.32142857142857145,0.75,0.45000000000000007
+G,0.390625,0.5208333333333334,0.44642857142857145
+Y,0.0,0.0,0.0
+F,0.6818181818181818,0.45918367346938777,0.548780487804878
+P,0.5296610169491526,0.5605381165919282,0.5446623093681917
+T,0.0,0.0,0.0
+I,1.0,0.7142857142857143,0.8333333333333333
+E,0.0,0.0,0.0
+C,0.5306122448979592,0.65,0.5842696629213483
+R,0.0,0.0,0.0
+average,0.4894391725223373,0.4485352250809625,0.44770812837208024
diff --git a/sr-eval/scores@240117-aapb-collaboration-27-d/dataset_scores.csv b/sr-eval/scores@240117-aapb-collaboration-27-d/dataset_scores.csv
@@ -0,0 +1,17 @@
+,precision,recall,f1
+-,0.8874425727411945,0.8827113480578828,0.8850706376479572
+all,0.7916666666666666,0.7916666666666666,0.7916666666666666
+B,1.0,1.0,1.0
+S,1.0,0.8571428571428571,0.923076923076923
+O,0.0,0.0,0.0
+L,1.0,0.3333333333333333,0.5
+M,0.32142857142857145,0.75,0.45000000000000007
+G,0.390625,0.5208333333333334,0.44642857142857145
+Y,0.0,0.0,0.0
+F,0.6818181818181818,0.45918367346938777,0.548780487804878
+P,0.5296610169491526,0.5605381165919282,0.5446623093681917
+T,0.0,0.0,0.0
+I,1.0,0.7142857142857143,0.8333333333333333
+E,0.0,0.0,0.0
+C,0.5306122448979592,0.65,0.5842696629213483
+R,0.0,0.0,0.0