dataset: DISEASES #39

AMINOexe · 2025-07-29T04:24:39Z

added DISEASE db benchmarking

agitter · 2025-07-29T13:23:08Z

One quick general thing I noticed is that the macOS .DS_Store files are included. We should add a .gitignore pattern to exclude those from the repo.

tristan-f-r

We can also drop dataPre.ipnyb once we transfer the comments over. I've gone in and committed some pathlib changes to make this correct regardless of CWD, as well as added some comments where I'm confident.

tristan-f-r · 2025-07-30T05:46:17Z

datasets/diseases/scripts/gold_standard.py

+    text_mining.columns = ["geneID", "geneName", "diseaseID", "diseaseName", "zScore", "confidenceScore", "sourceUrl"]
+    knowledge.columns = ["geneID", "geneName", "diseaseID", "diseaseName", "sourceDB", "evidenceType", "confidenceScore"]
+
+    # The DISEASES data is in the ENSP namespace, but we want to work in ENSG.


tristan-f-r · 2025-07-30T05:46:41Z

datasets/diseases/scripts/gold_standard.py

+    # GS_group = GS_ids.groupby('diseaseName')
+    # GS_dict = {k:v for k,v in GS_group}
+    # GS_count = {x:len(GS_dict[x]) for x in GS_dict.keys()}
+    # print('count quantiles: ',stats.quantiles(GS_count.values()))
+    # print('score quantiles: ',stats.quantiles(GS_ids['confidenceScore']))


Do we need this for anything? This is a leftover of the jupyter notebook, and we can just leave it in git history instead.

datasets/diseases/scripts/inputs.py

ntalluri · 2025-08-11T18:17:07Z

Is there any information on how to use the scripts and the data? Something similar to Grace's Cell line ReadMe would be helpful.

tristan-f-r · 2025-08-11T20:15:48Z

There isn't. I don't believe this even has a Snakefile attached, either. I wasn't able to go through the same documentation process as in #42 with this PR, so this is something to ask @AMINOexe for.

Also, my above review is stale: while I didn't ask about general code workings, there is a stable fetch.py attached now.

ntalluri · 2025-10-24T18:53:12Z

datasets/diseases/scripts/fetch.py

Is there a way to have these downloaded locally instead of relying on requesting it from the site directly? (I'm not sure how stable the requesting is / if this is the only way to do this).

@agitter mentioned this earlier. I forgot to talk to Justin about using OSDF for this, since I'm worried that this repository's clone size is becoming too big.

I mentioned an idea to @tristan-f-r that we may want to experiment with having a local cache of these datasets somewhere (e.g. Google drive?) while we work on a the full plan of how to use OSDF. That would keep the large files out of GitHub and under our control. Before migrating to OSDF, we'll need to establish 1) what storage to use and 2) how to organize that storage (paths, versioning, etc.).

AMINOexe added 4 commits July 15, 2025 11:52

added Diseases data and processing

15c1ba8

added diseases vizualization

5945aa2

generated gold standard files

8b69ac7

refactored disease scripts

3c01555

tristan-f-r added the dataset Mutating datasets in any way. label Jul 29, 2025

tristan-f-r and others added 7 commits July 29, 2025 10:07

Merge branch 'main' into diseases_dataset

c196a58

fix: rm DS store

3028762

updated diseases config

96bf124

updated diseases config

9039d6e

style: fmt

bea5d95

style: fmt

31a4c27

fix: drop dsstore

0923d1d

tristan-f-r changed the title ~~Diseases dataset~~ dataset: diseases Jul 30, 2025

tristan-f-r added 2 commits July 29, 2025 18:32

Merge branch 'main' into diseases_dataset

ba4c568

refactor: begin moving aroun

a36f3ea

tristan-f-r changed the title ~~dataset: diseases~~ dataset: DISEASES Jul 30, 2025

chore: some cleanup

bf56e7b

tristan-f-r requested changes Jul 30, 2025

View reviewed changes

tristan-f-r mentioned this pull request Jul 30, 2025

dataset: DepMap #41

Open

tristan-f-r added 2 commits July 30, 2025 10:07

feat: tiga / DO fetching

5e7bf8e

fix: use provided cwd path for string

3db1d02

tristan-f-r mentioned this pull request Aug 10, 2025

docs: hiv #42

Open

ntalluri reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset: DISEASES #39

dataset: DISEASES #39

Uh oh!

AMINOexe commented Jul 29, 2025 •

edited by tristan-f-r

Loading

Uh oh!

agitter commented Jul 29, 2025

Uh oh!

tristan-f-r left a comment •

edited

Loading

Uh oh!

tristan-f-r Jul 30, 2025 •

edited

Loading

Uh oh!

tristan-f-r Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ntalluri commented Aug 11, 2025

Uh oh!

tristan-f-r commented Aug 11, 2025 •

edited

Loading

Uh oh!

ntalluri Oct 24, 2025

Uh oh!

tristan-f-r Oct 24, 2025

Uh oh!

agitter Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dataset: DISEASES #39

Are you sure you want to change the base?

dataset: DISEASES #39

Uh oh!

Conversation

AMINOexe commented Jul 29, 2025 • edited by tristan-f-r Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agitter commented Jul 29, 2025

Uh oh!

tristan-f-r left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ntalluri commented Aug 11, 2025

Uh oh!

tristan-f-r commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

agitter Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AMINOexe commented Jul 29, 2025 •

edited by tristan-f-r

Loading

tristan-f-r left a comment •

edited

Loading

tristan-f-r Jul 30, 2025 •

edited

Loading

tristan-f-r Jul 30, 2025 •

edited

Loading

tristan-f-r commented Aug 11, 2025 •

edited

Loading