-
Notifications
You must be signed in to change notification settings - Fork 9
docs: hiv #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: hiv #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some initial comments that mostly pertain to my different expectations for what can go in the dataset readmes. I am finding it hard to review the new script-based pipeline with respect to the original notebooks. I'm not sure if it is a pure port or a complete rewrite.
| @@ -0,0 +1,15 @@ | |||
| # HIV dataset | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it helpful to have more of the context from https://github.com/Reed-CompBio/spras-benchmarking/blob/0293ae4dc0be59502fac06b42cfd9796a4b4413e/hiv-benchmarking/README.md. What is the overall goal of this benchmarking dataset?
Now that we have know more about the types of datasets in the SPRAS benchmark, we can categorize it. This one uses omic data as input and uses curated data as a gold standard, but the curated data we determined to be a poor fit so it lacks a good gold standard.
| See `Snakefile` for the way that all of the IO files are connected. | ||
|
|
||
| 1. `fetch.py` - This grabs the score files from https://doi.org/10.1371/journal.ppat.1011492 - see `fetch.py` for more info. | ||
| 1. `prepare.py` - This cleans up the prize files in `raw`; specifically to remove duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original readme was more detailed about these processing steps. For instance, one of the steps removed isoform identifiers, which I don't see mentioned here. My goal is to be able to visit the readme for a benchmarking dataset and be able to write the methods section of the manuscript from that information without re-reading all the code. We'll get bogged down during writing if we have to read all the source to remember how we processed each dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This motivation makes sense - I was very confused about what the audience for a dataset-wide README is. My main worry is that, like the other READMEs present in this repository, we would get outdated documentation that doesn't actually match with what's happening during processing, so I do want to ask a follow-up question (to determine how much we should be worried about maintaining the README:)
Would future users of SPRAS-benchmarking also be writing down the methodology described in the README as well, or is it specifically for the first benchmarking paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(My question was answered in another comment)
| @@ -0,0 +1,70 @@ | |||
| from Bio.KEGG.KGML.KGML_parser import read | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these moved or moved and edited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and lightly edited.
| @@ -1,3 +1,8 @@ | |||
| """ | |||
| This code is almost fully copied from https://www.uniprot.org/help/id_mapping_prog, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code intentionally avoided calling the UniProt API and used a downloaded mapping file instead. That is stable. The API will change over time and change our output. Is that desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would believe so - is the UniProt API notably unstable?
Once we start uploading data to OSDF, we can store previous artifacts of our downloaded files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the API, it is UniProt itself that is unstable. Information about proteins changes over time, so these mappings change over time. A consequence is that one some date we may go from one external file to N prizes and at a later date end up with that same external file and M prizes because different external IDs can be mapped to valid UniProt IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, though I'm also hesitant about not having a way to know when information changes. Some APIs we may use will be versioned (e.g. in #39), but it seems that the UniProt API really isn't, so I also agree with committing the file.
Though, it could be worthwhile to set up a system that still queries these APIs, checks the file we currently have stored, and warns us if the files are outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing these auto-updating artifacts into OSDF seems like the best way to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to get OSDF access, but I can absolutely implement something like above (more formally described in #44) for UniProt.
| @@ -0,0 +1,10 @@ | |||
| # raw | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing where the rest of the input files come from. Having it in fetch.py does not tell me where the inputs for this dataset come from and what they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about this comment (though there was some clarification missing on a paper link, which I've just committed) - do you want docs duplication from fetch.py into raw/README.md? The inline and top-level comments in fetch.py describe more than the original hiv README did, including the origin of ko03250.xml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call it duplication, but yes. I want the readme(s) to tell the story of the dataset. There will be multiple scripts needed to execute the benchmark, and I see two different goals:
- the code with good comments
- documenting the dataset globally so that we (and potentially external readers of a manuscript) can easily translate from a dataset directory to an overall understanding of how all the pieces work together, where external information comes from, and gather that information in a manuscript.
We could go through the exercise of writing one of the paragraph for the benchmarking paper manuscript to see what information we need to expose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the current level of detail I was going for in the benchmarking report:
Structure this section so each dataset explains what the data is and the biological context, where the all data came from, the sources/targets/prizes (include the biological roles of the inputs?), the ids (explain why the translation is needed and why the chosen id was chosen), the gold standard, the interactome, and the preprocessing for everything.
Co-authored-by: Anthony Gitter <[email protected]>
Adds documentation to the HIV dataset, and attaches a
fetch.py