Skip to content

Conversation

@AMINOexe
Copy link
Contributor

@AMINOexe AMINOexe commented Jul 29, 2025

added DISEASE db benchmarking

@agitter
Copy link
Collaborator

agitter commented Jul 29, 2025

One quick general thing I noticed is that the macOS .DS_Store files are included. We should add a .gitignore pattern to exclude those from the repo.

@tristan-f-r tristan-f-r added the dataset Mutating datasets in any way. label Jul 29, 2025
@tristan-f-r tristan-f-r changed the title Diseases dataset dataset: diseases Jul 30, 2025
@tristan-f-r tristan-f-r changed the title dataset: diseases dataset: DISEASES Jul 30, 2025
Copy link
Contributor

@tristan-f-r tristan-f-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also drop dataPre.ipnyb once we transfer the comments over. I've gone in and committed some pathlib changes to make this correct regardless of CWD, as well as added some comments where I'm confident.

text_mining.columns = ["geneID", "geneName", "diseaseID", "diseaseName", "zScore", "confidenceScore", "sourceUrl"]
knowledge.columns = ["geneID", "geneName", "diseaseID", "diseaseName", "sourceDB", "evidenceType", "confidenceScore"]

# The DISEASES data is in the ENSP namespace, but we want to work in ENSG.
Copy link
Contributor

@tristan-f-r tristan-f-r Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[why?]

Comment on lines +46 to +50
# GS_group = GS_ids.groupby('diseaseName')
# GS_dict = {k:v for k,v in GS_group}
# GS_count = {x:len(GS_dict[x]) for x in GS_dict.keys()}
# print('count quantiles: ',stats.quantiles(GS_count.values()))
# print('score quantiles: ',stats.quantiles(GS_ids['confidenceScore']))
Copy link
Contributor

@tristan-f-r tristan-f-r Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for anything? This is a leftover of the jupyter notebook, and we can just leave it in git history instead.

@tristan-f-r tristan-f-r mentioned this pull request Jul 30, 2025
@tristan-f-r tristan-f-r mentioned this pull request Aug 10, 2025
@ntalluri
Copy link
Contributor

Is there any information on how to use the scripts and the data? Something similar to Grace's Cell line ReadMe would be helpful.

@tristan-f-r
Copy link
Contributor

tristan-f-r commented Aug 11, 2025

There isn't. I don't believe this even has a Snakefile attached, either. I wasn't able to go through the same documentation process as in #42 with this PR, so this is something to ask @AMINOexe for.

Also, my above review is stale: while I didn't ask about general code workings, there is a stable fetch.py attached now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to have these downloaded locally instead of relying on requesting it from the site directly? (I'm not sure how stable the requesting is / if this is the only way to do this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agitter mentioned this earlier. I forgot to talk to Justin about using OSDF for this, since I'm worried that this repository's clone size is becoming too big.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned an idea to @tristan-f-r that we may want to experiment with having a local cache of these datasets somewhere (e.g. Google drive?) while we work on a the full plan of how to use OSDF. That would keep the large files out of GitHub and under our control. Before migrating to OSDF, we'll need to establish 1) what storage to use and 2) how to organize that storage (paths, versioning, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants