Skip to content

Commit f237018

Browse files
authored
Merge pull request #6 from omics-datascience/string
String
2 parents 0a1c7ec + 679a46a commit f237018

File tree

9 files changed

+418
-111
lines changed

9 files changed

+418
-111
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,6 @@ borrar.py
2828
.venv/
2929
databases/gene_info/.Rhistory
3030
databases/pharmGKB/CREATED_*
31+
databases/string/protein.links.full.txt.gz
32+
databases/gene_ontology/logs.json
33+
databases/string/protein.info.txt.gz

DEPLOYING.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ To import all databases in MongoDB:
6666
Alternatively (but **not recommended** due to high computational demands) you can run a separate ETL process to download from source, process and import the databases into MongoDB.
6767
6868
1. Install the necessary requirements:
69-
- [R languaje](https://www.r-project.org/). Version 4.1.2 or later (Only necessary if you want to update the Gene information database from Ensembl)
69+
- [R language](https://www.r-project.org/). Version 4.1.2 or later (Only necessary if you want to update the Gene information database from Ensembl)
7070
- Some python packages. They can be installed using:
7171
`pip install -r config/genomic_db_conf/requirements.txt`
7272
2. The ETL process is programmed in a single bash script for each database. Edit in the bash file of the database that you want to update the **user** and **password** parameters, using the same values that you set in the `docker-compose.yml` file. Bash files can be found in the *'databases'* folder, within the corresponding directories for each database:
@@ -76,7 +76,8 @@ Alternatively (but **not recommended** due to high computational demands) you ca
7676
- For Gene information ([Ensembl genomic data](https://www.ensembl.org/biomart/martview/), [RefSeq gene summaries](https://www.ncbi.nlm.nih.gov/refseq/), and [CiVIC gene descriptions](https://civicdb.org/welcome)) use "databases/gene_info" directory and the *geneinfo2mongodb.sh* file.
7777
- For Oncokb cancer genes and drug information, it is necessary to download some datasets from their [official site](https://www.oncokb.org/actionableGenes) (**registration required**). You need to download the _Therapeutic, Diagnostic, and Prognostic_ dataset from [Actionable Genes page](https://www.oncokb.org/actionableGenes) by clicking the _Association button_. Place it within the directory "databases/oncokb" with the name "oncokb_biomarker_drug_associations.tsv". Then, download the dataset from the [Cancer Genes](https://www.oncokb.org/cancerGenes) page by clicking the _Cancer Gene List_ button. Place it within the same directory as above, with the name "cancerGeneList.tsv". Finally, execute the `oncokb2mongodb.sh` script to load both datasets into MongoDB.
7878
- For cancer related drugs ([Pharmacogenomics Knowledge Base (PharmGKB) ](https://www.pharmgkb.org/)) use "databases\pharmGKB" directory and the *pharmgkb2mongodb.sh* file.
79-
- For Gene ontology ([Gene Ontology (GO)](http://geneontology.org/use/)) "databases\gene_ontology" directory and the *go2mongodb.sh* file. **NOTE:** This import needs the "Gene nomenclature" databases (2) already imported to properly process the gene ontology databases
79+
- For Gene ontology ([Gene Ontology (GO)](http://geneontology.org/)) use "databases\gene_ontology" directory and the *go2mongodb.sh* file. **NOTE:** This import needs the "Gene nomenclature" databases (2) already imported to properly process the gene ontology databases
80+
- For predicted functional associations network (String) it is necessary to download some datasets from their [official site](https://string-db.org/cgi/download), make sure that the **selected organism is Homo Sapiens** (the file sizes should be in Mb), from "INTERACTION DATA" download "protein network data (full network, incl. distinction: direct vs. interologs)" and rename it to "protein.links.full.txt.gz" then from "ACCESSORY DATA" download "list of STRING proteins incl. their display names and descriptions" and rename it to "protein.aliases.txt.gz", place the 2 files in the "databases\string" directory and the *string2mongodb.sh* file.
8081
3. Run bash files.
8182
`./<file.sh>`
8283
where file.sh can be *cpdb2mongodb.sh*, *hgnc2mongodb.sh*, *gtex2mongodb.sh*, *go2mongodb.sh*, *pharmgkb2mongodb.sh*, or *ensembl_gene2mongodb.sh*, as appropriate.
@@ -110,10 +111,11 @@ Where *\<service\>* could be `nginx`, `web` or `mongo`.
110111

111112
## Update genomic databases
112113
If new versions are released for the genomic databases included in BioAPI, you can update them by following the instructions below:
113-
- For the "Metabolic pathways (ConsensusPathDB)", "Gene nomenclature (HUGO Gene Nomenclature Committee)", "Gene ontology (GO)", "Cancer related drugs (PharmGKB)","Gene information (from Ensembl and CiVIC)" and "Cancer and Accionable genes (OncoKB)" databases, it is not necessary to make any modifications to any script. This is because the datasets are automatically downloaded in their most up-to-date versions when the bash file for each database is executed as described in the **Manually import the different databases** section of this file.
114+
- For the "Metabolic pathways (ConsensusPathDB)", "Gene nomenclature (HUGO Gene Nomenclature Committee)", "Gene ontology (GO)", "Cancer related drugs (PharmGKB)","Gene information (from Ensembl and CiVIC)" and "Cancer and Actionable genes (OncoKB)" databases, it is not necessary to make any modifications to any script. This is because the datasets are automatically downloaded in their most up-to-date versions when the bash file for each database is executed as described in the **Manually import the different databases** section of this file.
114115
**Important notes**:
115116
- For OncoKB the download is not automatic since it requires registration, but the steps to download them manually are explained in the same section mentioned above.
116-
- For RefSeq gene summaries, the R package [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) is used. The update of the database will depend on the version that the package includes.
117+
- For RefSeq gene summaries, the R package [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) is used. The update of the database will depend on the version that the package includes.
118+
- For String the download is not automatic, but the steps to download them manually are explained in the same section mentioned above.
117119
- If you need to update the "Gene expression (Genotype-Tissue Expression)" database, you should also follow the procedures in the section named above, but first you should edit the bash file as follows:
118120
1. Modify the **gtex2mongodb.sh** file. Edit the variables *"expression_url"* and *"annotation_url"*.
119121
1. In the *expession_url* variable, set the url corresponding to the GTEx "RNA-Seq Data" compressed file (gz compression). This file should contain the Gene TPMs values (Remember that Gene expression on the GTEx Portal are shown in Transcripts Per Million or TPMs).

README.md

Lines changed: 88 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -463,25 +463,25 @@ as significant. Must be a float. Not recommended to set it higher than 0.05.
463463
- Code: 200
464464
- Content:
465465
The response you get is a list. Each element of the list is a GO term that fulfills the conditions of the query. GO terms can contain name, definition, relations to other terms, etc.
466-
- `go_id`: Unique identifier.
467-
- `name`: human-readable term name.
468-
- `ontology_type`: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to.
469-
- `definition`: A textual description of what the term represents, plus reference(s) to the source of the information.
466+
- `<go_id>`: Unique identifier.
467+
- `<name>`: human-readable term name.
468+
- `<ontology_type>`: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to.
469+
- `<definition>`: A textual description of what the term represents, plus reference(s) to the source of the information.
470470
- relations to other terms: Each go term can be related to many other terms wit a [variety of relations](http://geneontology.org/docs/ontology-relations/).
471-
- `synonyms`: Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope.
472-
- `subset`: Indicates that the term belongs to a designated subset of terms.
473-
- `relations_to_genes`: list of elements of type Json. Each element corresponds to a to a gene and how it's related to the term.
474-
- `gene`: name of the gene.
475-
- `relation_type`: the type of relation between the gene and the GO term. When `filter_type` is enrichment, extra relation will be gather from g:Profiler database. These relations will be shown as "relation obtained from gProfiler".
476-
- `evidence`: evidence code to indicate how the annotation to a particular term is supported.
477-
- `enrichment_metrics`: .
478-
- `p_value`: Hypergeometric p-value after correction for multiple testing.
479-
- `intersection_size`: The number of genes in the query that are annotated to the corresponding term.
480-
- `effective_domain_size`: The total number of genes "in the universe " which is used as one of the four parameters for the hypergeometric probability function of statistical significance.
481-
- `query_size`: The number of genes that were included in the query.
482-
- `term_size`: The number of genes that are annotated to the term.
483-
- `precision`: The proportion of genes in the input list that are annotated to the function. Defined as intersection_size/query_size.
484-
- `recall`: The proportion of functionally annotated genes that the query recovers. Defined as intersection_size/term_size.
471+
- `<synonyms>`: Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope.
472+
- `<subset>`: Indicates that the term belongs to a designated subset of terms.
473+
- `<relations_to_genes>`: list of elements of type Json. Each element corresponds to a to a gene and how it's related to the term.
474+
- `<gene>`: name of the gene.
475+
- `<relation_type>`: the type of relation between the gene and the GO term. When `filter_type` is enrichment, extra relation will be gather from g:Profiler database. These relations will be shown as "relation obtained from gProfiler".
476+
- `<evidence>`: evidence code to indicate how the annotation to a particular term is supported.
477+
- `<enrichment_metrics>`: .
478+
- `<p_value>`: Hypergeometric p-value after correction for multiple testing.
479+
- `<intersection_size>`: The number of genes in the query that are annotated to the corresponding term.
480+
- `<effective_domain_size>`: The total number of genes "in the universe " which is used as one of the four parameters for the hypergeometric probability function of statistical significance.
481+
- `<query_size>`: The number of genes that were included in the query.
482+
- `<term_size>`: The number of genes that are annotated to the term.
483+
- `<precision>`: The proportion of genes in the input list that are annotated to the function. Defined as intersection_size/query_size.
484+
- `<recall>`: The proportion of functionally annotated genes that the query recovers. Defined as intersection_size/term_size.
485485
- Example:
486486
- URL: http://localhost:8000/genes-to-terms
487487
- body:
@@ -528,20 +528,20 @@ Gets the list of related terms to a term.
528528
- URL: /related-terms
529529
- Method: POST
530530
- Params: A body in Json format with the following content
531-
- `term_id`: the term if of the term you want to search
532-
- `relations`: filters the non-hierarchical relations between terms. By default it's ["part_of","regulates","has_part"]. It should always be a list
533-
- `ontology_type`: filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]It should always be a list containing any permutation of the default relations
534-
- `general_depth`: the search depth with the non-hierarchical relations
535-
- `hierarchical_depth_to_children`: the search depth with the hierarchical relations in the direction of the children
531+
- `term_id`: The term ID of the term you want to search
532+
- `relations`: Filters the non-hierarchical relations between terms. By default it's ["part_of","regulates","has_part"]. It should always be a list
533+
- `ontology_type`: Filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]It should always be a list containing any permutation of the default relations
534+
- `general_depth`: The search depth for the non-hierarchical relations
535+
- `hierarchical_depth_to_children`: The search depth for the hierarchical relations in the direction of the children
536536
- `to_root`: 0 for false 1 fot true. If true get all the terms in the hierarchical relations in the direction of the root
537537
- Success Response:
538538
- Code: 200
539539
- Content: The response you get is a list of GO terms related to the searched term that fulfills the conditions of the query. Each term has:
540-
- `go_id`: id of the GO term
541-
- `name`: name of the GO term
542-
- `ontology_type`: the ontology that the GO term belongs to
543-
- `relations`: dictionary of relations
544-
- `relation type`: list of terms related by that relation type to the term
540+
- `<go_id>`: ID of the GO term
541+
- `<name>`: Name of the GO term
542+
- `<ontology_type>`: The ontology that the GO term belongs to
543+
- `<relations>`: Dictionary of relations
544+
- `<relation type>`: List of terms related by that relation type to the term
545545
- Example:
546546
- URL: http://localhost:8000/related-terms
547547
- body:
@@ -574,7 +574,7 @@ Gets the list of related terms to a term.
574574

575575
### Cancer related drugs (PharmGKB)
576576

577-
Gets the list of related drugs to a list of genes.
577+
Gets a list of related drugs to a list of genes.
578578

579579
- URL: /drugs-pharm-gkb
580580
- Method: POST
@@ -583,14 +583,14 @@ Gets the list of related drugs to a list of genes.
583583
- Success Response:
584584
- Code: 200
585585
- Content: The response you get is a list of genes containing the related drug information
586-
- `pharmGKB_id`: Identifier assigned to this drug label by PharmGKB
587-
- `name`: Name assigned to the label by PharmGKB
588-
- `source`: The source that originally authored the label (e.g. FDA, EMA)
589-
- `biomarker_flag`: "On" if drug in this label appears on the FDA Biomarker list; "Off (Formerly On)" if the label was on the FDA Biomarker list at one time; "Off (Never On)" if the label was never listed on the FDA Biomarker list (to PharmGKB's knowledge)
590-
- `Testing Level`: PGx testing level as annotated by PharmGKB based on definitions at https://www.pharmgkb.org/page/drugLabelLegend
591-
- `Chemicals`: Related chemicals
592-
- `Genes`: List of related genes
593-
- `Variants-Haplotypes`: Related variants and/or haplotypes
586+
- `<pharmGKB_id>`: Identifier assigned to this drug label by PharmGKB
587+
- `<name>`: Name assigned to the label by PharmGKB
588+
- `<source>`: The source that originally authored the label (e.g. FDA, EMA)
589+
- `<biomarker_flag>`: "On" if drug in this label appears on the FDA Biomarker list; "Off (Formerly On)" if the label was on the FDA Biomarker list at one time; "Off (Never On)" if the label was never listed on the FDA Biomarker list (to PharmGKB's knowledge)
590+
- `<Testing Level>`: PGx testing level as annotated by PharmGKB based on definitions at https://www.pharmgkb.org/page/drugLabelLegend
591+
- `<Chemicals>`: Related chemicals
592+
- `<Genes>`: List of related genes
593+
- `<Variants-Haplotypes>`: Related variants and/or haplotypes
594594
- Example:
595595
- URL: http://localhost:8000/drugs-pharm-gkb
596596
- body:
@@ -613,7 +613,57 @@ Gets the list of related drugs to a list of genes.
613613
}
614614
]
615615
}
616-
```
616+
```
617+
618+
### Predicted functional associations network (String)
619+
620+
Gets a list of genes and relations related to a gene.
621+
- URL: /string-relations
622+
- Method: POST
623+
- Params: A body in Json format with the following content
624+
- `gene_id`: target gene
625+
- `min_combined_score`: the minimun combined scored allowed int the relations. Possible scores go from 1 to 1000
626+
- Success Response:
627+
- Code: 200
628+
- Content: The response you get is a list of relations containing the targeted gene
629+
- `<gene_1>`: Gene 1 in the bidirectional relatioship
630+
- `<gene_2>`: Gene 2 in the bidirectional relatioship
631+
- `<neighborhood`>: Optional. Values range from 1 to 1000
632+
- `<neighborhood_transferred`>: Optional. Values range from 1 to 1000
633+
- `<fusion`>: Optional. Values range from 1 to 1000
634+
- `<cooccurence`>: Optional. Values range from 1 to 1000
635+
- `<homology`>: Optional. Values range from 1 to 1000
636+
- `<coexpression`>: Optional. Values range from 1 to 1000
637+
- `<coexpression_transferred`>: Optional. Values range from 1 to 1000
638+
- `<experiments`>: Optional. Values range from 1 to 1000
639+
- `<experiments_transferred`>: Optional. Values range from 1 to 1000
640+
- `<database`>: Optional. Values range from 1 to 1000
641+
- `<database_transferred`>: Optional. Values range from 1 to 1000
642+
- `<textmining`>: Optional. Values range from 1 to 1000
643+
- `<textmining_transferred`>: Optional. Values range from 1 to 1000
644+
- `<combined_score`>: Values range from 1 to 1000
645+
646+
- Example:
647+
- URL: http://localhost:8000/string-relations
648+
- body:
649+
`{ "gene_id" : "MX2", "min_combined_score": 996 }`
650+
- Response:
651+
```json
652+
[
653+
{
654+
"coexpression": 558,
655+
"coexpression_transferred": 825,
656+
"combined_score": 997,
657+
"database": 900,
658+
"experiments_transferred": 149,
659+
"gene_1": "OASL",
660+
"gene_2": "MX2",
661+
"textmining": 652,
662+
"textmining_transferred": 257
663+
}
664+
]
665+
```
666+
617667
## Error Responses
618668

619669
The possible error codes are 400, 404 and 500. The content of each of them is a Json with a unique key called "error" where its value is a description of the problem that produces the error. For example:

0 commit comments

Comments
 (0)