omics-datascience
diff --git a/‎.gitignore
Lines changed: 3 additions & 0 deletions b/‎.gitignore
Lines changed: 3 additions & 0 deletions
diff --git a/‎DEPLOYING.md
Lines changed: 6 additions & 4 deletions b/‎DEPLOYING.md
Lines changed: 6 additions & 4 deletions
diff --git a/‎README.md
Lines changed: 88 additions & 38 deletions b/‎README.md
Lines changed: 88 additions & 38 deletions
@@ -28,3 +28,6 @@ borrar.py
 .venv/
 databases/gene_info/.Rhistory
 databases/pharmGKB/CREATED_*
+databases/string/protein.links.full.txt.gz
+databases/gene_ontology/logs.json
+databases/string/protein.info.txt.gz
@@ -66,7 +66,7 @@ To import all databases in MongoDB:
 Alternatively (but **not recommended** due to high computational demands) you can run a separate ETL process to download from source, process and import the databases into MongoDB.
 
 1. Install the necessary requirements:  
-    - [R languaje](https://www.r-project.org/). Version 4.1.2 or later (Only necessary if you want to update the Gene information database from Ensembl)
+    - [R language](https://www.r-project.org/). Version 4.1.2 or later (Only necessary if you want to update the Gene information database from Ensembl)
     - Some python packages. They can be installed using:  
         `pip install -r config/genomic_db_conf/requirements.txt`  
 2. The ETL process is programmed in a single bash script for each database. Edit in the bash file of the database that you want to update the **user** and **password** parameters, using the same values that you set in the `docker-compose.yml` file. Bash files can be found in the *'databases'* folder, within the corresponding directories for each database:  
@@ -76,7 +76,8 @@ Alternatively (but **not recommended** due to high computational demands) you ca
     - For Gene information ([Ensembl genomic data](https://www.ensembl.org/biomart/martview/), [RefSeq gene summaries](https://www.ncbi.nlm.nih.gov/refseq/), and [CiVIC gene descriptions](https://civicdb.org/welcome)) use "databases/gene_info" directory and the *geneinfo2mongodb.sh* file.  
     - For Oncokb cancer genes and drug information, it is necessary to download some datasets from their [official site](https://www.oncokb.org/actionableGenes) (**registration required**). You need to download the _Therapeutic, Diagnostic, and Prognostic_ dataset from [Actionable Genes page](https://www.oncokb.org/actionableGenes) by clicking the _Association button_. Place it within the directory "databases/oncokb" with the name "oncokb_biomarker_drug_associations.tsv". Then, download the dataset from the [Cancer Genes](https://www.oncokb.org/cancerGenes) page by clicking the _Cancer Gene List_ button. Place it within the same directory as above, with the name "cancerGeneList.tsv". Finally, execute the `oncokb2mongodb.sh` script to load both datasets into MongoDB.
     - For cancer related drugs ([Pharmacogenomics Knowledge Base (PharmGKB) ](https://www.pharmgkb.org/))  use "databases\pharmGKB" directory and the *pharmgkb2mongodb.sh* file.
-	- For Gene ontology ([Gene Ontology (GO)](http://geneontology.org/use/)) "databases\gene_ontology" directory and the *go2mongodb.sh* file. **NOTE:** This import needs the "Gene nomenclature" databases (2) already imported to properly process the gene ontology databases
+	- For Gene ontology ([Gene Ontology (GO)](http://geneontology.org/)) use "databases\gene_ontology" directory and the *go2mongodb.sh* file. **NOTE:** This import needs the "Gene nomenclature" databases (2) already imported to properly process the gene ontology databases
+    - For predicted functional associations network (String) it is necessary to download some datasets from their [official site](https://string-db.org/cgi/download), make sure that the **selected organism is Homo Sapiens** (the file sizes should be in Mb), from "INTERACTION DATA" download "protein network data (full network, incl. distinction: direct vs. interologs)" and rename it to "protein.links.full.txt.gz" then from "ACCESSORY DATA" download "list of STRING proteins incl. their display names and descriptions" and rename it to "protein.aliases.txt.gz", place the 2 files in the "databases\string" directory and the *string2mongodb.sh* file.
 3. Run bash files.  
     `./<file.sh>`  
     where file.sh can be *cpdb2mongodb.sh*, *hgnc2mongodb.sh*, *gtex2mongodb.sh*, *go2mongodb.sh*, *pharmgkb2mongodb.sh*, or *ensembl_gene2mongodb.sh*, as appropriate.  
@@ -110,10 +111,11 @@ Where  *\<service\>* could be `nginx`, `web` or `mongo`.
 
 ## Update genomic databases
 If new versions are released for the genomic databases included in BioAPI, you can update them by following the instructions below:  
-- For the "Metabolic pathways (ConsensusPathDB)", "Gene nomenclature (HUGO Gene Nomenclature Committee)", "Gene ontology (GO)", "Cancer related drugs (PharmGKB)","Gene information (from Ensembl and CiVIC)" and "Cancer and Accionable genes (OncoKB)" databases, it is not necessary to make any modifications to any script. This is because the datasets are automatically downloaded in their most up-to-date versions when the bash file for each database is executed as described in the **Manually import the different databases** section of this file.  
+- For the "Metabolic pathways (ConsensusPathDB)", "Gene nomenclature (HUGO Gene Nomenclature Committee)", "Gene ontology (GO)", "Cancer related drugs (PharmGKB)","Gene information (from Ensembl and CiVIC)" and "Cancer and Actionable genes (OncoKB)" databases, it is not necessary to make any modifications to any script. This is because the datasets are automatically downloaded in their most up-to-date versions when the bash file for each database is executed as described in the **Manually import the different databases** section of this file.  
 **Important notes**: 
   - For OncoKB the download is not automatic since it requires registration, but the steps to download them manually are explained in the same section mentioned above.  
-  - For RefSeq gene summaries, the R package [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) is used. The update of the database will depend on the version that the package includes.   
+  - For RefSeq gene summaries, the R package [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) is used. The update of the database will depend on the version that the package includes. 
+  - For String the download is not automatic, but the steps to download them manually are explained in the same section mentioned above.
 - If you need to update the "Gene expression (Genotype-Tissue Expression)" database, you should also follow the procedures in the section named above, but first you should edit the bash file as follows:  
     1. Modify the **gtex2mongodb.sh** file. Edit the variables *"expression_url"* and *"annotation_url"*.
     1. In the *expession_url* variable, set the url corresponding to the GTEx "RNA-Seq Data" compressed file (gz compression). This file should contain the Gene TPMs values (Remember that Gene expression on the GTEx Portal are shown in Transcripts Per Million or TPMs).
 
@@ -463,25 +463,25 @@ as significant. Must be a float. Not recommended to set it higher than 0.05.
     - Code: 200
     - Content:
 		The response you get is a list. Each element of the list is a GO term that fulfills the conditions of the query. GO terms can contain name, definition, relations to other terms, etc.
-        - `go_id`: Unique identifier. 
-        - `name`: human-readable term name. 
-        - `ontology_type`: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to. 
-        - `definition`: A textual description of what the term represents, plus reference(s) to the source of the information. 
+        - `<go_id>`: Unique identifier. 
+        - `<name>`: human-readable term name. 
+        - `<ontology_type>`: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to. 
+        - `<definition>`: A textual description of what the term represents, plus reference(s) to the source of the information. 
         - relations to other terms: Each go term can be related to many other terms wit a [variety of relations](http://geneontology.org/docs/ontology-relations/). 
-        - `synonyms`: Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope. 
-        - `subset`: Indicates that the term belongs to a designated subset of terms. 
-        - `relations_to_genes`: list of elements of type Json. Each element corresponds to a to a gene and how it's related to the term.  
-            - `gene`: name of the gene.
-            - `relation_type`: the type of relation between the gene and the GO term. When `filter_type` is enrichment, extra relation will be gather from g:Profiler database. These relations will be shown as "relation obtained from gProfiler".
-            - `evidence`: evidence code to indicate how the annotation to a particular term is supported.
-        - `enrichment_metrics`: .  
-            - `p_value`: Hypergeometric p-value after correction for multiple testing. 
-            - `intersection_size`: The number of genes in the query that are annotated to the corresponding term.
-            - `effective_domain_size`: The total number of genes "in the universe " which is used as one of the four parameters for the hypergeometric probability function of statistical significance.  
-            - `query_size`: The number of genes that were included in the query.   
-            - `term_size`: The number of genes that are annotated to the term.    
-            - `precision`: The proportion of genes in the input list that are annotated to the function. Defined as intersection_size/query_size. 
-            - `recall`: The proportion of functionally annotated genes that the query recovers. Defined as intersection_size/term_size.
+        - `<synonyms>`: Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope. 
+        - `<subset>`: Indicates that the term belongs to a designated subset of terms. 
+        - `<relations_to_genes>`: list of elements of type Json. Each element corresponds to a to a gene and how it's related to the term.  
+            - `<gene>`: name of the gene.
+            - `<relation_type>`: the type of relation between the gene and the GO term. When `filter_type` is enrichment, extra relation will be gather from g:Profiler database. These relations will be shown as "relation obtained from gProfiler".
+            - `<evidence>`: evidence code to indicate how the annotation to a particular term is supported.
+        - `<enrichment_metrics>`: .  
+            - `<p_value>`: Hypergeometric p-value after correction for multiple testing. 
+            - `<intersection_size>`: The number of genes in the query that are annotated to the corresponding term.
+            - `<effective_domain_size>`: The total number of genes "in the universe " which is used as one of the four parameters for the hypergeometric probability function of statistical significance.  
+            - `<query_size>`: The number of genes that were included in the query.   
+            - `<term_size>`: The number of genes that are annotated to the term.    
+            - `<precision>`: The proportion of genes in the input list that are annotated to the function. Defined as intersection_size/query_size. 
+            - `<recall>`: The proportion of functionally annotated genes that the query recovers. Defined as intersection_size/term_size.
     - Example:
         - URL: http://localhost:8000/genes-to-terms
         - body: 
@@ -528,20 +528,20 @@ Gets the list of related terms to a term.
 - URL: /related-terms
 - Method: POST
 - Params: A body in Json format with the following content
-	-  `term_id`: the term if of the term you want to search
-	-  `relations`: filters the non-hierarchical relations between terms. By default it's ["part_of","regulates","has_part"]. It should always be a list 
-	- `ontology_type`: filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]It should always be a list containing any permutation of the default relations
-	-  `general_depth`: the search depth with the non-hierarchical relations
-	-  `hierarchical_depth_to_children`: the search depth with the hierarchical relations in the direction of the children
+	-  `term_id`: The term ID of the term you want to search
+	-  `relations`: Filters the non-hierarchical relations between terms. By default it's ["part_of","regulates","has_part"]. It should always be a list 
+	- `ontology_type`: Filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]It should always be a list containing any permutation of the default relations
+	-  `general_depth`: The search depth for the non-hierarchical relations
+	-  `hierarchical_depth_to_children`: The search depth for the hierarchical relations in the direction of the children
 	-  `to_root`: 0 for false 1 fot true. If true get all the terms in the hierarchical relations in the direction of the root
 - Success Response:
     - Code: 200
     - Content: The response you get is a list of GO terms related to the searched term that fulfills the conditions of the query. Each term has:
-		- `go_id`: id of the GO term
-		- `name`: name of the GO term
-        - `ontology_type`: the ontology that the GO term belongs to
-		- `relations`: dictionary of relations 
-            - `relation type`: list of terms related by that relation type to the term
+		- `<go_id>`: ID of the GO term
+		- `<name>`: Name of the GO term
+        - `<ontology_type>`: The ontology that the GO term belongs to
+		- `<relations>`: Dictionary of relations 
+            - `<relation type>`: List of terms related by that relation type to the term
 	- Example:
         - URL: http://localhost:8000/related-terms
          - body: 
@@ -574,7 +574,7 @@ Gets the list of related terms to a term.
 
 ### Cancer related drugs (PharmGKB)
 
-Gets the list of related drugs to a list of genes.
+Gets a list of related drugs to a list of genes.
 
 - URL: /drugs-pharm-gkb
 - Method: POST
@@ -583,14 +583,14 @@ Gets the list of related drugs to a list of genes.
 - Success Response:
     - Code: 200
     - Content: The response you get is a list of genes containing the related drug information
-		- `pharmGKB_id`: Identifier assigned to this drug label by PharmGKB
-		- `name`: Name assigned to the label by PharmGKB
-		- `source`: The source that originally authored the label (e.g. FDA, EMA)
-		- `biomarker_flag`: "On" if drug in this label appears on the FDA Biomarker list; "Off (Formerly On)" if the label was on the FDA Biomarker list at one time; "Off (Never On)" if the label was never listed on the FDA Biomarker list (to PharmGKB's knowledge)
-		- `Testing Level`:  PGx testing level as annotated by PharmGKB based on definitions at https://www.pharmgkb.org/page/drugLabelLegend
-		- `Chemicals`: Related chemicals
-		- `Genes`: List of related genes
-		- `Variants-Haplotypes`: Related variants and/or haplotypes
+		- `<pharmGKB_id>`: Identifier assigned to this drug label by PharmGKB
+		- `<name>`: Name assigned to the label by PharmGKB
+		- `<source>`: The source that originally authored the label (e.g. FDA, EMA)
+		- `<biomarker_flag>`: "On" if drug in this label appears on the FDA Biomarker list; "Off (Formerly On)" if the label was on the FDA Biomarker list at one time; "Off (Never On)" if the label was never listed on the FDA Biomarker list (to PharmGKB's knowledge)
+		- `<Testing Level>`:  PGx testing level as annotated by PharmGKB based on definitions at https://www.pharmgkb.org/page/drugLabelLegend
+		- `<Chemicals>`: Related chemicals
+		- `<Genes>`: List of related genes
+		- `<Variants-Haplotypes>`: Related variants and/or haplotypes
 	- Example:
         - URL: http://localhost:8000/drugs-pharm-gkb
          - body: 
@@ -613,7 +613,57 @@ Gets the list of related drugs to a list of genes.
 			}
 		    ]
 		    }
-	```  
+	``` 
+
+### Predicted functional associations network (String)
+
+Gets a list of genes and relations related to a gene.
+- URL: /string-relations
+- Method: POST
+- Params: A body in Json format with the following content
+	-  `gene_id`: target gene
+    -  `min_combined_score`: the minimun combined scored allowed int the relations. Possible scores go from 1 to 1000
+- Success Response:
+    - Code: 200
+    - Content: The response you get is a list of relations containing the targeted gene
+		- `<gene_1>`: Gene 1 in the bidirectional relatioship
+		- `<gene_2>`: Gene 2 in the bidirectional relatioship
+        - `<neighborhood`>: Optional. Values range from 1 to 1000
+        - `<neighborhood_transferred`>: Optional. Values range from 1 to 1000
+        - `<fusion`>: Optional. Values range from 1 to 1000
+        - `<cooccurence`>: Optional. Values range from 1 to 1000
+        - `<homology`>: Optional. Values range from 1 to 1000
+        - `<coexpression`>: Optional. Values range from 1 to 1000
+        - `<coexpression_transferred`>: Optional. Values range from 1 to 1000
+        - `<experiments`>: Optional. Values range from 1 to 1000
+        - `<experiments_transferred`>: Optional. Values range from 1 to 1000
+        - `<database`>: Optional. Values range from 1 to 1000
+        - `<database_transferred`>: Optional. Values range from 1 to 1000
+        - `<textmining`>: Optional. Values range from 1 to 1000
+        - `<textmining_transferred`>: Optional. Values range from 1 to 1000
+        - `<combined_score`>: Values range from 1 to 1000
+
+    - Example:
+        - URL: http://localhost:8000/string-relations
+         - body: 
+            `{  "gene_id" : "MX2", "min_combined_score": 996  }`
+        - Response:
+	```json
+        [
+        {
+            "coexpression": 558,
+            "coexpression_transferred": 825,
+            "combined_score": 997,
+            "database": 900,
+            "experiments_transferred": 149,
+            "gene_1": "OASL",
+            "gene_2": "MX2",
+            "textmining": 652,
+            "textmining_transferred": 257
+        }
+        ]
+	``` 
+
 ## Error Responses
 
 The possible error codes are 400, 404 and 500. The content of each of them is a Json with a unique key called "error" where its value is a description of the problem that produces the error. For example: