docs: hiv #42

tristan-f-r · 2025-07-30T19:44:40Z

Adds documentation to the HIV dataset, and attaches a fetch.py

agitter

I left some initial comments that mostly pertain to my different expectations for what can go in the dataset readmes. I am finding it hard to review the new script-based pipeline with respect to the original notebooks. I'm not sure if it is a pure port or a complete rewrite.

agitter · 2025-08-08T20:55:37Z

datasets/hiv/README.md

@@ -0,0 +1,15 @@
+# HIV dataset
+


I would find it helpful to have more of the context from https://github.com/Reed-CompBio/spras-benchmarking/blob/0293ae4dc0be59502fac06b42cfd9796a4b4413e/hiv-benchmarking/README.md. What is the overall goal of this benchmarking dataset?

Now that we have know more about the types of datasets in the SPRAS benchmark, we can categorize it. This one uses omic data as input and uses curated data as a gold standard, but the curated data we determined to be a poor fit so it lacks a good gold standard.

datasets/hiv/README.md

agitter · 2025-08-08T20:58:45Z

datasets/hiv/README.md

+See `Snakefile` for the way that all of the IO files are connected.
+
+1. `fetch.py` - This grabs the score files from https://doi.org/10.1371/journal.ppat.1011492 - see `fetch.py` for more info.
+1. `prepare.py` - This cleans up the prize files in `raw`; specifically to remove duplicates.


The original readme was more detailed about these processing steps. For instance, one of the steps removed isoform identifiers, which I don't see mentioned here. My goal is to be able to visit the readme for a benchmarking dataset and be able to write the methods section of the manuscript from that information without re-reading all the code. We'll get bogged down during writing if we have to read all the source to remember how we processed each dataset.

This motivation makes sense - I was very confused about what the audience for a dataset-wide README is. My main worry is that, like the other READMEs present in this repository, we would get outdated documentation that doesn't actually match with what's happening during processing, so I do want to ask a follow-up question (to determine how much we should be worried about maintaining the README:)

Would future users of SPRAS-benchmarking also be writing down the methodology described in the README as well, or is it specifically for the first benchmarking paper?

(My question was answered in another comment)

agitter · 2025-08-08T20:59:06Z

datasets/hiv/scripts/kegg_orthology.py

@@ -0,0 +1,70 @@
+from Bio.KEGG.KGML.KGML_parser import read


Were these moved or moved and edited?

Moved and lightly edited.

agitter · 2025-08-08T21:01:52Z

datasets/hiv/scripts/name_mapping.py

@@ -1,3 +1,8 @@
+"""
+This code is almost fully copied from https://www.uniprot.org/help/id_mapping_prog,


The original code intentionally avoided calling the UniProt API and used a downloaded mapping file instead. That is stable. The API will change over time and change our output. Is that desirable?

I would believe so - is the UniProt API notably unstable?

Once we start uploading data to OSDF, we can store previous artifacts of our downloaded files.

It's not the API, it is UniProt itself that is unstable. Information about proteins changes over time, so these mappings change over time. A consequence is that one some date we may go from one external file to N prizes and at a later date end up with that same external file and M prizes because different external IDs can be mapped to valid UniProt IDs.

This makes sense, though I'm also hesitant about not having a way to know when information changes. Some APIs we may use will be versioned (e.g. in #39), but it seems that the UniProt API really isn't, so I also agree with committing the file.

Though, it could be worthwhile to set up a system that still queries these APIs, checks the file we currently have stored, and warns us if the files are outdated.

Storing these auto-updating artifacts into OSDF seems like the best way to handle this.

I'm not sure how to get OSDF access, but I can absolutely implement something like above (more formally described in #44) for UniProt.

agitter · 2025-08-08T21:02:35Z

datasets/hiv/raw/README.md

@@ -0,0 +1,10 @@
+# raw


I'm missing where the rest of the input files come from. Having it in fetch.py does not tell me where the inputs for this dataset come from and what they are.

I'm a little confused about this comment (though there was some clarification missing on a paper link, which I've just committed) - do you want docs duplication from fetch.py into raw/README.md? The inline and top-level comments in fetch.py describe more than the original hiv README did, including the origin of ko03250.xml.

We can call it duplication, but yes. I want the readme(s) to tell the story of the dataset. There will be multiple scripts needed to execute the benchmark, and I see two different goals:

the code with good comments

documenting the dataset globally so that we (and potentially external readers of a manuscript) can easily translate from a dataset directory to an overall understanding of how all the pieces work together, where external information comes from, and gather that information in a manuscript.

We could go through the exercise of writing one of the paragraph for the benchmarking paper manuscript to see what information we need to expose.

This is the current level of detail I was going for in the benchmarking report:

Structure this section so each dataset explains what the data is and the biological context, where the all data came from, the sources/targets/prizes (include the biological roles of the inputs?), the ids (explain why the translation is needed and why the chosen id was chosen), the gold standard, the interactome, and the preprocessing for everything.

Co-authored-by: Anthony Gitter <[email protected]>

tristan-f-r and others added 2 commits July 30, 2025 11:36

refactor: begin

ba25546

updated documentation

c80c561

tristan-f-r assigned agitter Jul 30, 2025

tristan-f-r added 5 commits July 30, 2025 12:47

chore: remove other readme

a38177b

fix: use correct file names

e29e737

fix: more file path changes

70500b3

fix: typo in prize 060

d1d3cf7

docs: correct exceptions to copied code

d72db21

agitter reviewed Aug 8, 2025

View reviewed changes

tristan-f-r and others added 2 commits August 8, 2025 14:17

docs: be clearer about paper

e07d6d4

remove redundant docs cmt

97d3cfa

Co-authored-by: Anthony Gitter <[email protected]>

tristan-f-r mentioned this pull request Aug 11, 2025

dataset: DISEASES #39

Open

		@@ -1,3 +1,8 @@
		"""
		This code is almost fully copied from https://www.uniprot.org/help/id_mapping_prog,

docs: hiv #42

Are you sure you want to change the base?

docs: hiv #42

Uh oh!

Conversation

tristan-f-r commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agitter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agitter Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tristan-f-r commented Jul 30, 2025 •

edited

Loading

tristan-f-r Aug 8, 2025 •

edited

Loading

tristan-f-r Aug 10, 2025 •

edited

Loading

tristan-f-r Aug 8, 2025 •

edited

Loading

agitter Aug 9, 2025 •

edited

Loading

tristan-f-r Aug 10, 2025 •

edited

Loading

tristan-f-r Aug 23, 2025 •

edited

Loading