Skip to content

Conversation

@tristan-f-r
Copy link
Contributor

@tristan-f-r tristan-f-r commented Jul 30, 2025

Adds documentation to the HIV dataset, and attaches a fetch.py

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments that mostly pertain to my different expectations for what can go in the dataset readmes. I am finding it hard to review the new script-based pipeline with respect to the original notebooks. I'm not sure if it is a pure port or a complete rewrite.

@@ -0,0 +1,15 @@
# HIV dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it helpful to have more of the context from https://github.com/Reed-CompBio/spras-benchmarking/blob/0293ae4dc0be59502fac06b42cfd9796a4b4413e/hiv-benchmarking/README.md. What is the overall goal of this benchmarking dataset?

Now that we have know more about the types of datasets in the SPRAS benchmark, we can categorize it. This one uses omic data as input and uses curated data as a gold standard, but the curated data we determined to be a poor fit so it lacks a good gold standard.

See `Snakefile` for the way that all of the IO files are connected.

1. `fetch.py` - This grabs the score files from https://doi.org/10.1371/journal.ppat.1011492 - see `fetch.py` for more info.
1. `prepare.py` - This cleans up the prize files in `raw`; specifically to remove duplicates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original readme was more detailed about these processing steps. For instance, one of the steps removed isoform identifiers, which I don't see mentioned here. My goal is to be able to visit the readme for a benchmarking dataset and be able to write the methods section of the manuscript from that information without re-reading all the code. We'll get bogged down during writing if we have to read all the source to remember how we processed each dataset.

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This motivation makes sense - I was very confused about what the audience for a dataset-wide README is. My main worry is that, like the other READMEs present in this repository, we would get outdated documentation that doesn't actually match with what's happening during processing, so I do want to ask a follow-up question (to determine how much we should be worried about maintaining the README:)

Would future users of SPRAS-benchmarking also be writing down the methodology described in the README as well, or is it specifically for the first benchmarking paper?

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My question was answered in another comment)

@@ -0,0 +1,70 @@
from Bio.KEGG.KGML.KGML_parser import read
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these moved or moved and edited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved and lightly edited.

@@ -1,3 +1,8 @@
"""
This code is almost fully copied from https://www.uniprot.org/help/id_mapping_prog,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code intentionally avoided calling the UniProt API and used a downloaded mapping file instead. That is stable. The API will change over time and change our output. Is that desirable?

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would believe so - is the UniProt API notably unstable?

Once we start uploading data to OSDF, we can store previous artifacts of our downloaded files.

Copy link
Collaborator

@agitter agitter Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the API, it is UniProt itself that is unstable. Information about proteins changes over time, so these mappings change over time. A consequence is that one some date we may go from one external file to N prizes and at a later date end up with that same external file and M prizes because different external IDs can be mapped to valid UniProt IDs.

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, though I'm also hesitant about not having a way to know when information changes. Some APIs we may use will be versioned (e.g. in #39), but it seems that the UniProt API really isn't, so I also agree with committing the file.

Though, it could be worthwhile to set up a system that still queries these APIs, checks the file we currently have stored, and warns us if the files are outdated.

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing these auto-updating artifacts into OSDF seems like the best way to handle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to get OSDF access, but I can absolutely implement something like above (more formally described in #44) for UniProt.

@@ -0,0 +1,10 @@
# raw
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where the rest of the input files come from. Having it in fetch.py does not tell me where the inputs for this dataset come from and what they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about this comment (though there was some clarification missing on a paper link, which I've just committed) - do you want docs duplication from fetch.py into raw/README.md? The inline and top-level comments in fetch.py describe more than the original hiv README did, including the origin of ko03250.xml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call it duplication, but yes. I want the readme(s) to tell the story of the dataset. There will be multiple scripts needed to execute the benchmark, and I see two different goals:

  1. the code with good comments
  2. documenting the dataset globally so that we (and potentially external readers of a manuscript) can easily translate from a dataset directory to an overall understanding of how all the pieces work together, where external information comes from, and gather that information in a manuscript.

We could go through the exercise of writing one of the paragraph for the benchmarking paper manuscript to see what information we need to expose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current level of detail I was going for in the benchmarking report:

Structure this section so each dataset explains what the data is and the biological context, where the all data came from, the sources/targets/prizes (include the biological roles of the inputs?), the ids (explain why the translation is needed and why the chosen id was chosen), the gold standard, the interactome, and the preprocessing for everything.

@tristan-f-r tristan-f-r mentioned this pull request Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants