Planning Documentation/Templates for Future Automation of Evals #37

jarumihooi · 2023-12-13T07:47:20Z

Because

the goal is to improve the automation of future evaluation tasks (as opposed to updating current evaluations to become automatic), brainstorming of what components could be used to increase the automaticness of running the evaluations should be considered.

This issue will focus on what formats/templates/common practices should be adhered to allow for better automation into this process.

Done when

A template for future tasks in general, using goldretriever, having the same invocation should be created. This can then be followed for future tasks. Importantly, some error handling, especially at the level of basic sanity tests like outputting the same number of expected files should be present.
A template for readmes should be created to describe before full automation how to run the apps.
A template for reports generated by the eval should be created. However, the similarity of each task's report is still in discussion. It is still to be determined how much of the report can be automatically generated.

Additional context

Questions to consider for moving towards automation:

How does one make the process more automatic than downloading the eval repo, placing the mmif predictions in the correct place and running the code?
- Should the evaluation code be wrapped in a docker app that also contains the needed environment and modules to run it?
- Perhaps one could run one app, select the task for which to evaluate for, and the app could handle the rest?
Where should generated mmifs or other system outputs be placed. Should any of that be automatic?

What other todos and concerns should improve this process flow?

jarumihooi · 2023-12-13T08:07:48Z

How does one make the process more automatic than downloading the eval repo, placing the mmif predictions in the correct place and running the code?

It seems like the invocation of evaluation could be automatically triggered if a new commit of prediction mmifs is placed in a certain place, for instance, inside a subdir/task-dir of the aapb-evaluations repository. This could trigger an action that will run evaluation code automatically.

clams-bot added this to infra Dec 13, 2023

github-project-automation bot moved this to Todo in infra Dec 13, 2023

jarumihooi changed the title ~~Updates and Planning for More Automation of Evals~~ Planning Documentation/Templates for Future Automation of Evals Dec 13, 2023

jarumihooi mentioned this issue Dec 22, 2023

Update words/terminology for "preds" and "golds" #40

Closed

8 tasks

keighrim added this to the eval-v1 milestone Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planning Documentation/Templates for Future Automation of Evals #37

Planning Documentation/Templates for Future Automation of Evals #37

jarumihooi commented Dec 13, 2023 •

edited

Loading

jarumihooi commented Dec 13, 2023

Planning Documentation/Templates for Future Automation of Evals #37

Planning Documentation/Templates for Future Automation of Evals #37

Comments

jarumihooi commented Dec 13, 2023 • edited Loading

Because

Done when

Additional context

jarumihooi commented Dec 13, 2023

jarumihooi commented Dec 13, 2023 •

edited

Loading