From 52c6950369cdb4a42688bb6411e1c3b172303aef Mon Sep 17 00:00:00 2001 From: amyheather Date: Thu, 6 Jun 2024 10:21:58 +0100 Subject: [PATCH] Tidying uncertainties from evaluation --- evaluation/artefacts.qmd | 5 +- evaluation/badges.qmd | 8 ++- evaluation/posts/2024_06_04/index.qmd | 61 +++++++++----------- evaluation/posts/2024_06_05/index.qmd | 80 ++++++++++++++------------- 4 files changed, 74 insertions(+), 80 deletions(-) diff --git a/evaluation/artefacts.qmd b/evaluation/artefacts.qmd index d28fdbe..76cfd83 100644 --- a/evaluation/artefacts.qmd +++ b/evaluation/artefacts.qmd @@ -12,7 +12,8 @@ This page evaluates the extent to which the original study meets the recommendat Of the **11** items in this audit: -* **10** were met fully (✅) +* **9** were met fully (✅) +* **1** was met partially (🟡) * **1** was not met (❌) | Item | Description | Met by study? | Evidence/location | @@ -21,7 +22,7 @@ Of the **11** items in this audit: | Open Researcher and Contributor ID (ORCID) | Is the model linked to one or more of the authors via an ORCID? | ✅ Fully | Monks linked to ORCID on Zenodo and in journal article | | License | Does the repository have a recognised open license to control the use of code, liabilty and credit? | ✅ Fully | MIT license | | Model overview | Is there an obvious file that provides an overview of the repository/model and it purpose (e.g. in README file)? | ✅ Fully | `README.md` present but doesn't overview repository/model or purpose, although does like to `Capacity_Model_Template.ipynb` which does overview the model | -| Link to published paper | Do models shared externally from journal articles contain a link to the published article? | ✅ Fully | Zenodo citations section includes link to the article | +| Link to published paper | Do models shared externally from journal articles contain a link to the published article? | 🟡 Partially | Zenodo citations section includes link to the article | | Steps to run code | Does the readme file or similar describe the steps required to execute the simulation model? | ✅ Fully | Provided in `Capacity_Model_Template.ipynb` | | Formal dependency management | Has a formal tool, e.g. renv, conda, or poetry been used to manage software dependencies for the simulation model? | ✅ Fully | Conda - `environment.yaml` | | Informal dependency management | Has an informal list or description of software, or OS dependencies been provided? | ✅ Fully | Within `environment.yaml` | diff --git a/evaluation/badges.qmd b/evaluation/badges.qmd index 00dfb62..7bba198 100644 --- a/evaluation/badges.qmd +++ b/evaluation/badges.qmd @@ -9,6 +9,8 @@ bibliography: ../quarto_site/references.bib This page evaluates the extent to which the author-published research artefacts meet the criteria of badges related to reproducibility from various organisations and journals. +*Caveat: Please note that these criteria are based on available information about each badge online, and that we have allowed troubleshooting for execution and reproduction. We cannot guarantee that the badges below would have been awarded in practice by these journals.* + ## Criteria ```{python} @@ -23,9 +25,9 @@ criteria = { 'license': 'Includes an open license', 'relevant': '''Arefacts are relevant to and contribute to the article's results''', 'complete': 'Complete set of materials shared (as would be needed to fully reproduce article)', - 'structure': 'Artefacts are well structured/organised (e.g. to the extent that reuse is facilitated, adhering to norms and standards of research community)', - 'documentation_sufficient': 'Artefacts are sufficiently documented (eg. to understand how it works, to enable it to be run, including package versions)', - 'documentation_careful': 'Artefacts are carefully documented (more than sufficient - e.g. to the extent that reuse is facilitated)', + 'structure': 'Artefacts are well structured/organised (e.g. to the extent that reuse and repurposing is facilitated, adhering to norms and standards of research community)', + 'documentation_sufficient': 'Artefacts are sufficiently documented (i.e. to understand how it works, to enable it to be run, including package versions)', + 'documentation_careful': 'Artefacts are carefully documented (more than sufficient - i.e. to the extent that reuse and repurposing is facilitated (e.g. changing parameters, reusing for own purpose))', # This criteria is kept seperate to documentation_careful, as it specifically requires a README file 'documentation_readme': 'Artefacts are clearly documented and accompanied by a README file with step-by-step instructions on how to reproduce results in the manuscript', 'execute': 'Scripts can be successfully executed', diff --git a/evaluation/posts/2024_06_04/index.qmd b/evaluation/posts/2024_06_04/index.qmd index a0f198d..d6a7fca 100644 --- a/evaluation/posts/2024_06_04/index.qmd +++ b/evaluation/posts/2024_06_04/index.qmd @@ -16,51 +16,40 @@ Evaluated study against guidelines (badges, sharing artefacts, and starting on S ### Badges -Evaluating artefacts as in (as has been copied into `original_study/`). - -Felt uncertain around these criteria: - -* `documentation_sufficient` - it was missing one package from environment - but it had an environment file, package versions, and a README explaining how to set up environment and referring to a notebook which runs the code that produces the output - hence, despite missing one, I felt it met this criteria -* `documentation_careful` - I feel the documentation is minimal, but that it was actually still sufficient for running the model - and, as such, presume it might meet this criteria? Unless "reuse" refers to being able to use and change - in which case, there is not much guidance on how to change the parameters, only on how to run it matching up with the paper? In which case, I don't think it meets these. I think that interpretation also makes sense, as it then distinguishes from "sufficient"? - * The criteria this relates to is from the @association_for_computing_machinery_acm_artifact_2020 : "The artifacts associated with the paper are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated – Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated. In particular, norms and standards of the research community for artifacts of this type are strictly adhered to" -* `execute` - yes, but with one change to environment. Do they allow changes? I've selected yes, as I'm assuming minor troubleshooting is allowed, and the distinction between this and reproduction is that this is just about running scripts, whilst reproduction is about getting sufficiently similar results. Execution required in ACM and IEEE criteria... - * From the @association_for_computing_machinery_acm_artifact_2020: "Included scripts and/or software used to generate the results in the associated paper can be successfully executed, and included data can be accessed and appropriately manipulated", with no mention of whether minor troubleshooting is allowed - * From the @institute_of_electrical_and_electronics_engineers_ieee_about_nodate: "runs to produce the outputs desired", with no mention of whether minor troubleshooting is allowed -* `regenerated` - as above, unclear whether modification is allowed. For simplicity, have assumed definition we allowed - that you can troubleshoot - but this might not align with journals. Is this an issue? -* `hour` - failed this, but we weren't trying to do this within an hour. If I had been asked to do that (rather than spend time reading and thinking beforehand), I anticipate I could've run it (without going on to then add seeds etc). Hence, does it pass this? - * Psychological Science (@hardwicke_transparency_2023, @association_for_psychological_science_aps_psychological_2023) require reproduction within an hour, which I think implies some minor troubleshooting would be allowed? - * Others don't specify either way -* `documentation_readme` - I wouldn't say it explicitly meets this criteria, although it was simple enough that it could do it anyway +Evaluating artefacts as in and . + +**Uncertainties:** + +* `execute`: "Scripts can be successfully executed" (and then `regenerated`: "Independent party regenerated results using the authors research artefacts") + * Required some minor changes to environment in order for scripts to run + * However, not certain whether the badges would allow minor troubleshooting (or major troubleshooting) in order for a script to run? + * By my criteria (allowing troubleshooting) this would be fine. The step up from this criteria is reproduction (which I again allow troubleshooting) - so it would make sense that this is getting it to run, whilst next step is getting sufficiently similar results. May just need to add a **caveat** that this is with troubleshooting allowed (which may not be journal policy) - in same way that caveat, this is my interpretation of badges from available information and cannot guarantee would or would not be awarded. +* `hour`: Reproduced within approximately one hour (excluding compute time) + * Took longer than an hour, but I wasn't trying to get it done in that time + * If I hadn't spent time reading and documenting and fiddling with the seeds, then I anticipate I could've run it within an hour + * However, **I'm assuming to follow our process and fail it** (for consistency with how we are working and timing) +* `documentation_readme`: "Artefacts are clearly documented and accompanied by a README file with step-by-step instructions on how to reproduce results in the manuscript" + * I wouldn't say it explicitly meets this criteria + * Although it was simple enough that it could do it anyway - directed me to notebook, run that, job done. + * **Uncertain on this one** ### Sharing research artefacts -Evaluated against recommendations for sharing of research artefacts (STARS, and Monks and Harper best practice audit). +Evaluated again, this time against recommendations for sharing of research artefacts (STARS, and Monks and Harper best practice audit). -Uncertainities: +**Uncertainties:** -* Readme for audit - does it matter if that information is in the readme file? can they pass if readme links to template which overviews it? I think so? Although Table 1 in Monks and Harper 2023 specifies item as Readme, the description is just whether there is an obvious file overviewing purpose -* Citation information - not provided for this study, although Zenodo does provide it, but as code materials themselves don't contain it, feel it fails this -* Remote code repository - STARS talks about why you should use them (e.g. version control) - but here, evaluating, someone might have done a single upload and not used properly - but purpose of STARS is not to evaluate if something is right or wrong upload so doesn't specify further - in this case, assuming simplest route of if it is on something like GitHub then it meets criteria -* Enhanced documentation - lots of suggestions of what to do for this -* Remote execution/Online coding environment - Zenodo doesn't but GitLab does. I was largely working with Zenodo as that was the archived version, but in this case, did say met criteria via GitLab. Importance of which code is chosen when there are two versions. I didn't initially think it met criteria until checked the GitLab too. Have recommended both are referred to. But do they meet criteria if changes are made to repository after publication? After submission? For example, if we asked them to add a license. +* `README`: "Does the readme file or similar describe the steps required to execute the simulation model?" + * For this, is it sufficient that README is provided which explains environment and links to template that runs model? I'm assuming so. + * However, this decision differs from documentation_readme above, although that requires "clear" documentation +* Do they meet criteria is changes are made to a repository **after** publications? (this is general, not specific to this test-run) + * Example: If we asked them to add a licence - do we then fail them on that? I'm presuming so? -::: {.callout-tip} -## Keeping code repository and archive in sync and up-to-date - -Evaluation was made harder as there were slight differences between the archived code (Zenodo) and the production code (GitLab). You might not always want to keep them in sync. But it would be good to keep them in sync up until submission. -::: - -## Reporting guidelines +### Evaluation against reporting guidelines (STRESS-DES pt. 1) These were evaluated based ONLY on the journal article and supplementary materials (and not on the linked code repository). -STRESS uncertanties: - -* Model outputs - "Specify how and when they are calculated during the model run along with how any measures of error such as confidence intervals are calculated" - I find this criteria quite hard to apply? Here, it's just N patients and additional time during model, mediane and extreme. Does it need more detail than that to meet it though? -* Experiementation aims - is it applicable? The DES model, as I understand, just tests one scenario - a worst case scenario - so it's not really "scenario based analysis" comparing scenarios - but it does have A scenario and justification for it? -* Algorithms - is this applicable in this example? Parameters are just from sampling from distributions, or proportions. Assuming its applicable, and that distribing those distributions answers this question? -* Queues - wasn't certain about this, as I don't really know what order they get allocated to units. but that's not a queue? but it kind of is - in that, at the start of the day, they are waiting to be assigned? -* Entry/exit points - have said not met as can't find this stated, but also, not sure what the answer is. They are in model on days they need dialysis, and then leave again? unless get unallocated? and then they never return if their state turns to dead? +Started work on evaluating against STRESS-DES. ::: {.callout-tip} ## Including checklist diff --git a/evaluation/posts/2024_06_05/index.qmd b/evaluation/posts/2024_06_05/index.qmd index 20e0c71..5e64e5e 100644 --- a/evaluation/posts/2024_06_05/index.qmd +++ b/evaluation/posts/2024_06_05/index.qmd @@ -13,31 +13,44 @@ Evaluation against reporting guidelines - finished STRESS-DES, and did for ISPOR ## Work log -### Evaluation against reporting guidelines +### Evaluation against reporting guidelines (pt. 2) -Uncertainties for STRESS-DES: +Finished STRESS-DES. -* Software or programming language - "Where frameworks and libraries have been used provide all details including version numbers." - not in the report, but is in the linked code. Passed them anyway, as wouldn't think you would put this information in the report? -* Model execution - "State the event processing mechanism used e.g. three phase, event, activity, process interaction." - not sure what this is... - * `Simulation - The Practice of Model Development and Use` (Stewart Robinson): - * There are a few ways to model progress of time - one is time-slicing method, and other is discrete-event simulation (three phase method). "A number of machisms have been proposed for carrying out discrete-event simulation, among them are the **event-based, activity-based, process-based and three-phase approaches**. For a detailed discussion on these see Pidd (1998)" - * **Three phase** - Events are classified as B (bound or booked) or C (conditional). B events are state changes scheduled at a point in time (often relates to arrivals or finishing an activity). C events are state changes that depend on cidions of a model (e.g. needing to wait until operator is free, often relat to start of an activity) - * [Intro to DES](https://www.cs.cmu.edu/~music/cmsip/readings/intro-discrete-event-sim.html): - * **Activity scanning** - "Basic building block is the activity. Model program's code segments consist of sequences of activities (operations) waiting to be executed. An activity sequence's conditions must be satisfied before it is executed. Simulation executive moves from event to event executing those activity sequences whose conditions are satisfied." - * **Event scheduling** - "Basic building block is the event. Model program's code segments consist of event routines waiting to be executed. Event routine associated with each event type -- performs required operations for that type. Simulation executive moves from event to event executing the corresponding event routines" - * **Process interaction** - "Basic building block is the process. System consists of a set of interacting processes. Model program's code for each process includes the operations that it carries out throughout its lifetime. Event sequence for system consists of merging of event sequences for all processes. Future event list consists of a sequence of event nodes (or notices). Each event node indicates the event time and process to which it belongs. Simulation executive carries out the following tasks: (a) placement of processes at particular points of time in the list (b) removal of processes from the event list (c) activation of the process corresponding to the next event node from the event list (d) rescheduling of processes in the event list. Typically, a process object can be in one of several states: (a) active -- process is currently executing There is only one such process in a system. (b) ready -- process is in event list, waiting for activation at a certain time (c) idle (or blocked) -- process is not in event list, but eligible to be be reactivated by some other entity (e.g., waiting for a passive resource) (d) terminated -- process has completed its sequence of actions, is not in event list, and cannot be reactivated" - * **However, still not really sure what this would be?** +**Uncertainties:** -Uncertainties from ISPOR SDM - +* `5.1 Software or programming language`: "Where frameworks and libraries have been used provide all details including version numbers." + * This part of criteria is **not** in the report, but is in the linked code. + * Passed them anyway, as other parts of this criteria (OS, version, build DES software, Python) were provided - and also, wouldn't think you would put this extra level of information in the report? +* `5.3 Model execution`: "State the event processing mechanism used e.g. three phase, event, activity, process interaction." + * **Feeling quite unclear on this**. Did some research onto it but that hasn't really cleared it up... + * Based answer on Tom's, but don't feel confident guessing at this for future ones + * Any **recommended resources** to explain this? -* 15 Is the model generalizability issue discussed? - have I been too harsh on this one? +::: {.callout-note appearance="minimal" collapse=true} -::: {.callout-tip} -## STRESS +## Event processing mechanism + +`Simulation - The Practice of Model Development and Use` (Stewart Robinson): + +* There are a few ways to model progress of time - one is time-slicing method, and other is discrete-event simulation (three phase method). "A number of machisms have been proposed for carrying out discrete-event simulation, among them are the **event-based, activity-based, process-based and three-phase approaches**. For a detailed discussion on these see Pidd (1998)" +* **Three phase** - Events are classified as B (bound or booked) or C (conditional). B events are state changes scheduled at a point in time (often relates to arrivals or finishing an activity). C events are state changes that depend on cidions of a model (e.g. needing to wait until operator is free, often relat to start of an activity) + +[Intro to DES](https://www.cs.cmu.edu/~music/cmsip/readings/intro-discrete-event-sim.html): + +* **Activity scanning** - "Basic building block is the activity. Model program's code segments consist of sequences of activities (operations) waiting to be executed. An activity sequence's conditions must be satisfied before it is executed. Simulation executive moves from event to event executing those activity sequences whose conditions are satisfied." +* **Event scheduling** - "Basic building block is the event. Model program's code segments consist of event routines waiting to be executed. Event routine associated with each event type -- performs required operations for that type. Simulation executive moves from event to event executing the corresponding event routines" +* **Process interaction** - "Basic building block is the process. System consists of a set of interacting processes. Model program's code for each process includes the operations that it carries out throughout its lifetime. Event sequence for system consists of merging of event sequences for all processes. Future event list consists of a sequence of event nodes (or notices). Each event node indicates the event time and process to which it belongs. Simulation executive carries out the following tasks: (a) placement of processes at particular points of time in the list (b) removal of processes from the event list (c) activation of the process corresponding to the next event node from the event list (d) rescheduling of processes in the event list. Typically, a process object can be in one of several states: (a) active -- process is currently executing There is only one such process in a system. (b) ready -- process is in event list, waiting for activation at a certain time (c) idle (or blocked) -- process is not in event list, but eligible to be be reactivated by some other entity (e.g., waiting for a passive resource) (d) terminated -- process has completed its sequence of actions, is not in event list, and cannot be reactivated" -Event processing mechanisms are unfamiliar, and a little hard to google. ::: +Then completed the guidelines adapted from ISPOR-SDM. + +**Uncertainties:** + +* `15 Is the model generalizability issue discussed?` + * Have I been too harsh on this one? Or do we want it to be explicitly addressed? + ### Comparison of my STRESS-DES against the original study Tom shared the completed STRESS-DES checklits from the original study, and I compared against my evaluation (bearing in mind that theirs relates to the DES model and the Mone-Carlo vehicle routing model, whilst mine just relates to the former). Amendments: @@ -55,12 +68,6 @@ Tom shared the completed STRESS-DES checklits from the original study, and I com -::: {.callout-tip} -## STRESS - -When reporting multiple models in a paper, the STRESS checklist can easily/accidentally just be applied to one of the models in some instances, and it can make it harder to distinguish what relates to which model. As such, might suggest creating seperate checklists for each model. -::: - ### Comparing best practice audit results with Monks and Harper Compared my decisions for the best practice audit against those made in Monks and Harper. @@ -77,22 +84,17 @@ review = pd.read_csv('bp_audit_clean.csv') review[review['key'] == 'R4ZLYWPP'].T ``` -Same result: - -* DOI -* ORCID -* License -* Model overview -* Steps to run code -* Formal dependency management -* Local execution -* Remote execution - -Different result: - -* Link to published paper (I said yes as its within Zenodo, but they said no) **ask tom whether he think that is ok, or whether he thinks it needs to be within the files themselves?** -* Informal dependency management (they failed it as it was formal, but I will keep as passed, as the intention is to understand whether they had dependency management or not) -* Code testing (they said yes, I said no) **ask tom what they felt was evidence of testing?** +Different results: + +* `Link to published paper` + * I said yes as its within Zenodo, but they said no + * I have amended my answer, as on reflection, I agree that this should be part of the artefacts themselves, and not just meta-data on Zenodo, as the artefacts are what you download, but have kept this as "partially met" + * **Get a second opinion on this** +* Informal dependency management + * They failed it as it was formal, but I will keep as passed, as the intention is to understand whether they had dependency management or not, so no changes to make here. +* Code testing + * They said yes but I said no + * **Ask tom what they felt was evidence of testing?** ## Suggested changes for protocol/template