-
Notifications
You must be signed in to change notification settings - Fork 9
dataset: DISEASES #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
One quick general thing I noticed is that the macOS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also drop dataPre.ipnyb once we transfer the comments over. I've gone in and committed some pathlib changes to make this correct regardless of CWD, as well as added some comments where I'm confident.
| text_mining.columns = ["geneID", "geneName", "diseaseID", "diseaseName", "zScore", "confidenceScore", "sourceUrl"] | ||
| knowledge.columns = ["geneID", "geneName", "diseaseID", "diseaseName", "sourceDB", "evidenceType", "confidenceScore"] | ||
|
|
||
| # The DISEASES data is in the ENSP namespace, but we want to work in ENSG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[why?]
| # GS_group = GS_ids.groupby('diseaseName') | ||
| # GS_dict = {k:v for k,v in GS_group} | ||
| # GS_count = {x:len(GS_dict[x]) for x in GS_dict.keys()} | ||
| # print('count quantiles: ',stats.quantiles(GS_count.values())) | ||
| # print('score quantiles: ',stats.quantiles(GS_ids['confidenceScore'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this for anything? This is a leftover of the jupyter notebook, and we can just leave it in git history instead.
|
Is there any information on how to use the scripts and the data? Something similar to Grace's Cell line ReadMe would be helpful. |
|
There isn't. I don't believe this even has a Snakefile attached, either. I wasn't able to go through the same documentation process as in #42 with this PR, so this is something to ask @AMINOexe for. Also, my above review is stale: while I didn't ask about general code workings, there is a stable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to have these downloaded locally instead of relying on requesting it from the site directly? (I'm not sure how stable the requesting is / if this is the only way to do this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agitter mentioned this earlier. I forgot to talk to Justin about using OSDF for this, since I'm worried that this repository's clone size is becoming too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned an idea to @tristan-f-r that we may want to experiment with having a local cache of these datasets somewhere (e.g. Google drive?) while we work on a the full plan of how to use OSDF. That would keep the large files out of GitHub and under our control. Before migrating to OSDF, we'll need to establish 1) what storage to use and 2) how to organize that storage (paths, versioning, etc.).
added DISEASE db benchmarking