gene retrieval via API #7

Vincent-Ustach · 2025-02-01T21:22:23Z

Copy code from protein retrieval example notebook into importable methods.

Wrap methods in fastapi app

retrieval can also be performed from new script that uses the same methods

add loguru and argparse retrieval script

quantize model before move to GPU

remove drugbank data

docstrings

move protein embeddings loading to startup update logger messages

retrieval script

rcalef

Hi Vinnie, thanks so much for helping us out by sharing your changes! This seems like a great way for users to interact with the pre-trained model with reduced overhead for ongoing experimentation.

I left a couple small comments and questions, but only one other high-level ask: Would it be possible to add a README.md to the procyon/app folder containing some basic instructions? I think this could essentially be exactly the same text as the main docstring in procyon/app/main/py, just in a more easily discoverable place.

Otherwise, looks great! I think you may need to rebase on top of some our more recent changes (let me know if you have any questions about any merge conflicts), then should be good to squash merge into main!

rcalef · 2025-02-10T19:11:26Z

procyon/inference/retrieval_utils.py

+def startup_retrieval(
+    inference_bool: bool = True,
+) -> Tuple[
+    Union[UnifiedProCyon, None],


I'd been using e.g. Optional[UnifiedProCyon], so googled what's preferred, and it appears the current best practice for a value that could be None is to express it as e.g. UnifiedProCyon | None

rcalef · 2025-02-10T19:25:28Z

procyon/inference/retrieval_utils.py

+
+    logger.info("Now quantizing the model to a smaller precision")
+    model.bfloat16()  # Quantize the model to a smaller precision
+    logger.info("Done quantizing the model to a smaller precision")


Possibly personal opinion, but the logging seems quite verbose. It might be nice to log some of these messages at the debug level instead and add a command-line option to turn them on. I haven't used loguru before but it looks like some combination of logger.add or logger.remove should do it

rcalef · 2025-02-10T19:27:47Z

procyon/inference/retrieval_utils.py

+                retrieval=True,
+                aaseq_type="protein",
+            )
+        # The script can run up to here without a GPU, but the following line requires a GPU


Let me know if running this on the GPU is a headache. If so, we can change get_proteins_from_embedding to operate on a specified device, I don't think it has to run on the GPU

rcalef · 2025-02-10T19:30:41Z

scripts/protein_retrieval_disease_pheno.py

+        task_desc_infile (Path): The path to the file containing the task description.
+        disease_desc_infile (Path): The path to the file containing the disease description.
+        instruction_source_dataset (str): Dataset source for instructions - either "disgenet" or "omim"
+        inference_bool (bool): OPTIONAL; choose this if you do not intend to do inference


Just to understand, what's the use case for inference_bool being False here? I may just be missing something, but not really seeing what benefit it has other than perhaps testing that the CLI works

rcalef · 2025-02-10T19:31:49Z

procyon/app/main.py

+if __name__ == "__main__":
+    """
+    This API endpoint will allow users to perform protein retrieval for a given disease description using the 
+        pre-trained ProCyon model Procyon-Full.


Suggested change

pre-trained ProCyon model Procyon-Full.

pre-trained ProCyon model ProCyon-Full.

Vincent-Ustach added 30 commits January 16, 2025 16:08

initial commit

e7476df

add loguru and argparse retrieval script

store false for inference_bool

a4eb270

require task_desc_infile

3749b33

task_desc_infile pathlib path

a719456

inference_bool default True

a149c13

inference_bool typo

0dc40c3

import DataArgs

6d30f24

set model device data_args to None

9df5b98

Merge pull request #2 from mims-harvard/main

7a7ed70

quantize model before move to GPU

read disease desc

dda1bca

Merge branch 'refs/heads/main' into retrieval_script

43a3218

reduce precision of model before loading to device

b18d3d6

copy retrieval script

c0e4881

rename files

a589308

if no model load no create_input_retrieval

824b3c2

remove unused imports

1205024

if no model no create_input_retrieval

54943dd

remove unused imports

3a92807

remove drugbank data

black formatting

694b729

refactor with startup and do methods for later api

48f8fb9

docstrings

4844320

bug in calling startup_retrieval

6a7b4f1

fastapi app

4f8368e

add more required env vars

4e5bb42

move utils to inference/retrieval_utils.py

643cc6f

fix imports for retrieval_utils

f4299b1

typo

397d631

update comment

01c887f

instruction_source_dataset passed as argument

1b3b262

docstrings

update docstring for app

f4fa800

Vincent-Ustach added 19 commits January 19, 2025 07:17

bug w repeated model.to_device() commands

7c6619d

return top k

58ff2a6

by default return all records

f30352f

fillna

7a9edd7

remove disease from input description. remove unused imports in app.

491d950

remove disease from input description. remove unused imports in app.

6067989

delete drug script

3c60c9a

update docstrings

00de80f

move protein embeddings loading to startup update logger messages

remove exception catch

6947b3b

move args in script

6bbd355

move args in call to do_retrieval

f9e0b39

move args in app call to do_retrieval

187de6f

try again w args in script

00c7929

default of omim

dcdafc3

declare all_protein_embeddings as global in retrieve_proteins

3fcf39a

remove exceptions from highest level of app

a7aeb6b

duplicate loguru removed from pyproject.toml

7aa2698

black formatting

cdb23fc

Merge pull request #1 from GeneDx/retrieval_script

7e3631e

retrieval script

rcalef approved these changes Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gene retrieval via API #7

gene retrieval via API #7

Vincent-Ustach commented Feb 1, 2025

rcalef left a comment

rcalef Feb 10, 2025

rcalef Feb 10, 2025

rcalef Feb 10, 2025

rcalef Feb 10, 2025

rcalef Feb 10, 2025

	pre-trained ProCyon model Procyon-Full.
	pre-trained ProCyon model ProCyon-Full.

gene retrieval via API #7

Are you sure you want to change the base?

gene retrieval via API #7

Conversation

Vincent-Ustach commented Feb 1, 2025

rcalef left a comment

Choose a reason for hiding this comment

rcalef Feb 10, 2025

Choose a reason for hiding this comment

rcalef Feb 10, 2025

Choose a reason for hiding this comment

rcalef Feb 10, 2025

Choose a reason for hiding this comment

rcalef Feb 10, 2025

Choose a reason for hiding this comment

rcalef Feb 10, 2025

Choose a reason for hiding this comment