Tidying uncertainties from evaluation

pythonhealthdatascience · Jun 6, 2024 · 52c6950 · 52c6950
1 parent 8bfba4c
commit 52c6950
Show file tree

Hide file tree

Showing 4 changed files with 74 additions and 80 deletions.
diff --git a/evaluation/artefacts.qmd b/evaluation/artefacts.qmd
@@ -12,7 +12,8 @@ This page evaluates the extent to which the original study meets the recommendat
 
 Of the **11** items in this audit:
 
-* **10** were met fully (✅)
+* **9** were met fully (✅)
+* **1** was met partially (🟡)
 * **1** was not met (❌)
 
 | Item | Description | Met by study? | Evidence/location |
@@ -21,7 +22,7 @@ Of the **11** items in this audit:
 | Open Researcher and Contributor ID (ORCID) | Is the model linked to one or more of the authors via an ORCID? | ✅ Fully | Monks linked to ORCID on Zenodo and in journal article |
 | License | Does the repository have a recognised open license to control the use of code, liabilty and credit? | ✅ Fully | MIT license |
 | Model overview | Is there an obvious file that provides an overview of the repository/model and it purpose (e.g. in README file)? | ✅ Fully | `README.md` present but doesn't overview repository/model or purpose, although does like to `Capacity_Model_Template.ipynb` which does overview the model |
-| Link to published paper | Do models shared externally from journal articles contain a link to the published article? | ✅ Fully | Zenodo citations section includes link to the article |
+| Link to published paper | Do models shared externally from journal articles contain a link to the published article? | 🟡 Partially | Zenodo citations section includes link to the article |
 | Steps to run code | Does the readme file or similar describe the steps required to execute the simulation model? | ✅ Fully | Provided in `Capacity_Model_Template.ipynb` |
 | Formal dependency management | Has a formal tool, e.g. renv, conda, or poetry been used to manage software dependencies for the simulation model? | ✅ Fully | Conda - `environment.yaml` |
 | Informal dependency management | Has an informal list or description of software, or OS dependencies been provided? | ✅ Fully | Within `environment.yaml` |

diff --git a/evaluation/badges.qmd b/evaluation/badges.qmd
@@ -9,6 +9,8 @@ bibliography: ../quarto_site/references.bib
 
 This page evaluates the extent to which the author-published research artefacts meet the criteria of badges related to reproducibility from various organisations and journals.
 
+*Caveat: Please note that these criteria are based on available information about each badge online, and that we have allowed troubleshooting for execution and reproduction. We cannot guarantee that the badges below would have been awarded in practice by these journals.*
+
 ## Criteria
 
 ```{python}
@@ -23,9 +25,9 @@ criteria = {
     'license': 'Includes an open license',
     'relevant': '''Arefacts are relevant to and contribute to the article's results''',
     'complete': 'Complete set of materials shared (as would be needed to fully reproduce article)',
-    'structure': 'Artefacts are well structured/organised (e.g. to the extent that reuse is facilitated, adhering to norms and standards of research community)',
-    'documentation_sufficient': 'Artefacts are sufficiently documented (eg. to understand how it works, to enable it to be run, including package versions)',
-    'documentation_careful': 'Artefacts are carefully documented (more than sufficient - e.g. to the extent that reuse is facilitated)',
+    'structure': 'Artefacts are well structured/organised (e.g. to the extent that reuse and repurposing is facilitated, adhering to norms and standards of research community)',
+    'documentation_sufficient': 'Artefacts are sufficiently documented (i.e. to understand how it works, to enable it to be run, including package versions)',
+    'documentation_careful': 'Artefacts are carefully documented (more than sufficient - i.e. to the extent that reuse and repurposing is facilitated (e.g. changing parameters, reusing for own purpose))',
     # This criteria is kept seperate to documentation_careful, as it specifically requires a README file
     'documentation_readme': 'Artefacts are clearly documented and accompanied by a README file with step-by-step instructions on how to reproduce results in the manuscript',
     'execute': 'Scripts can be successfully executed',

diff --git a/evaluation/posts/2024_06_04/index.qmd b/evaluation/posts/2024_06_04/index.qmd
@@ -16,51 +16,40 @@ Evaluated study against guidelines (badges, sharing artefacts, and starting on S
 
 ### Badges
 
-Evaluating artefacts as in <https://zenodo.org/records/3760626> (as has been copied into `original_study/`).
-
-Felt uncertain around these criteria:
-
-* `documentation_sufficient` - it was missing one package from environment - but it had an environment file, package versions, and a README explaining how to set up environment and referring to a notebook which runs the code that produces the output - hence, despite missing one, I felt it met this criteria
-* `documentation_careful` - I feel the documentation is minimal, but that it was actually still sufficient for running the model - and, as such, presume it might meet this criteria? Unless "reuse" refers to being able to use and change - in which case, there is not much guidance on how to change the parameters, only on how to run it matching up with the paper? In which case, I don't think it meets these. I think that interpretation also makes sense, as it then distinguishes from "sufficient"?
-    * The criteria this relates to is from the @association_for_computing_machinery_acm_artifact_2020 : "The artifacts associated with the paper are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated – Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated. In particular, norms and standards of the research community for artifacts of this type are strictly adhered to"
-* `execute` - yes, but with one change to environment. Do they allow changes? I've selected yes, as I'm assuming minor troubleshooting is allowed, and the distinction between this and reproduction is that this is just about running scripts, whilst reproduction is about getting sufficiently similar results. Execution required in ACM and IEEE criteria...
-    * From the @association_for_computing_machinery_acm_artifact_2020: "Included scripts and/or software used to generate the results in the associated paper can be successfully executed, and included data can be accessed and appropriately manipulated", with no mention of whether minor troubleshooting is allowed
-    * From the @institute_of_electrical_and_electronics_engineers_ieee_about_nodate: "runs to produce the outputs desired", with no mention of whether minor troubleshooting is allowed
-* `regenerated` - as above, unclear whether modification is allowed. For simplicity, have assumed definition we allowed - that you can troubleshoot - but this might not align with journals. Is this an issue?
-* `hour` - failed this, but we weren't trying to do this within an hour. If I had been asked to do that (rather than spend time reading and thinking beforehand), I anticipate I could've run it (without going on to then add seeds etc). Hence, does it pass this?
-    * Psychological Science (@hardwicke_transparency_2023, @association_for_psychological_science_aps_psychological_2023) require reproduction within an hour, which I think implies some minor troubleshooting would be allowed?
-    * Others don't specify either way
-* `documentation_readme` - I wouldn't say it explicitly meets this criteria, although it was simple enough that it could do it anyway
+Evaluating artefacts as in <https://zenodo.org/records/3760626> and <https://git.exeter.ac.uk/tmwm201/dialysis-service-delivery-covid19>.
+
+**Uncertainties:**
+
+* `execute`: "Scripts can be successfully executed" (and then `regenerated`: "Independent party regenerated results using the authors research artefacts")
+    * Required some minor changes to environment in order for scripts to run
+    * However, not certain whether the badges would allow minor troubleshooting (or major troubleshooting) in order for a script to run?
+    * By my criteria (allowing troubleshooting) this would be fine. The step up from this criteria is reproduction (which I again allow troubleshooting) - so it would make sense that this is getting it to run, whilst next step is getting sufficiently similar results. May just need to add a **caveat** that this is with troubleshooting allowed (which may not be journal policy) - in same way that caveat, this is my interpretation of badges from available information and cannot guarantee would or would not be awarded.
+* `hour`: Reproduced within approximately one hour (excluding compute time)
+    * Took longer than an hour, but I wasn't trying to get it done in that time
+    * If I hadn't spent time reading and documenting and fiddling with the seeds, then I anticipate I could've run it within an hour
+    * However, **I'm assuming to follow our process and fail it** (for consistency with how we are working and timing)
+* `documentation_readme`: "Artefacts are clearly documented and accompanied by a README file with step-by-step instructions on how to reproduce results in the manuscript"
+    * I wouldn't say it explicitly meets this criteria
+    * Although it was simple enough that it could do it anyway - directed me to notebook, run that, job done.
+    * **Uncertain on this one**
 
 ### Sharing research artefacts
 
-Evaluated against recommendations for sharing of research artefacts (STARS, and Monks and Harper best practice audit).
+Evaluated again, this time against recommendations for sharing of research artefacts (STARS, and Monks and Harper best practice audit).
 
-Uncertainities:
+**Uncertainties:**
 
-* Readme for audit - does it matter if that information is in the readme file? can they pass if readme links to template which overviews it? I think so? Although Table 1 in Monks and Harper 2023 specifies item as Readme, the description is just whether there is an obvious file overviewing purpose
-* Citation information - not provided for this study, although Zenodo does provide it, but as code materials themselves don't contain it, feel it fails this
-* Remote code repository - STARS talks about why you should use them (e.g. version control) - but here, evaluating, someone might have done a single upload and not used properly - but purpose of STARS is not to evaluate if something is right or wrong upload so doesn't specify further - in this case, assuming simplest route of if it is on something like GitHub then it meets criteria
-* Enhanced documentation - lots of suggestions of what to do for this
-* Remote execution/Online coding environment - Zenodo doesn't but GitLab does. I was largely working with Zenodo as that was the archived version, but in this case, did say met criteria via GitLab. Importance of which code is chosen when there are two versions. I didn't initially think it met criteria until checked the GitLab too. Have recommended both are referred to. But do they meet criteria if changes are made to repository after publication? After submission? For example, if we asked them to add a license.
+* `README`: "Does the readme file or similar describe the steps required to execute the simulation model?"
+    * For this, is it sufficient that README is provided which explains environment and links to template that runs model? I'm assuming so.
+    * However, this decision differs from documentation_readme above, although that requires "clear" documentation
+* Do they meet criteria is changes are made to a repository **after** publications? (this is general, not specific to this test-run)
+    * Example: If we asked them to add a licence - do we then fail them on that? I'm presuming so?
 
-::: {.callout-tip}
-## Keeping code repository and archive in sync and up-to-date
-
-Evaluation was made harder as there were slight differences between the archived code (Zenodo) and the production code (GitLab). You might not always want to keep them in sync. But it would be good to keep them in sync up until submission.
-:::
-
-## Reporting guidelines
+### Evaluation against reporting guidelines (STRESS-DES pt. 1)
 
 These were evaluated based ONLY on the journal article and supplementary materials (and not on the linked code repository).
 
-STRESS uncertanties:
-
-* Model outputs - "Specify how and when they are calculated during the model run along with how any measures of error such as confidence intervals are calculated" - I find this criteria quite hard to apply? Here, it's just N patients and additional time during model, mediane and extreme. Does it need more detail than that to meet it though?
-* Experiementation aims - is it applicable? The DES model, as I understand, just tests one scenario - a worst case scenario - so it's not really "scenario based analysis" comparing scenarios - but it does have A scenario and justification for it?
-* Algorithms - is this applicable in this example? Parameters are just from sampling from distributions, or proportions. Assuming its applicable, and that distribing those distributions answers this question?
-* Queues - wasn't certain about this, as I don't really know what order they get allocated to units. but that's not a queue? but it kind of is - in that, at the start of the day, they are waiting to be assigned?
-* Entry/exit points - have said not met as can't find this stated, but also, not sure what the answer is. They are in model on days they need dialysis, and then leave again? unless get unallocated? and then they never return if their state turns to dead?
+Started work on evaluating against STRESS-DES.
 
 ::: {.callout-tip}
 ## Including checklist