You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Requirements: having successfully run the step insert species and taxonomy AND database creation.
Goal: insert the genes used in Bgee.
Details
This script will insert species gene information based on already inserted species (in the table species). It takes between 15 and 60 min per species.
It uses the Ensembl API.
It will fill the gene table with geneId, geneName and geneDescription (+ geneBioTypeId + speciesId).
It will also fill the tables geneBioType, geneOntologyTerm + geneOntologyTermAltId (and provides a file for obsolete GO terms), geneNameSynonym, geneToGeneOntologyTerm, and geneXRef. The insertion in geneXRef makes use of the dataSource table.
It will also insert OncoMX XRefs in the geneXRef table. This step has to be run separately using the command : make ../../generated_file/insert_oncoMX_XRefs
Important note regarding bonobo genes: for bonobo, we take the same genome as chimpanzee (there is no bonobo genome in Ensembl, and it is debatable whether bonobo and chimp represent the same species). We use a SQL query at the end of the Makefile, to duplicate all chimpanzee genes, while providing new IDs. Consequently, it also duplicates the entries in geneNameSynonym, geneToTerm and geneToGeneOntologyTerm (with the appropriate IDs). As a result, if you add or modify any fields in any of the tables gene, geneNameSynonym, geneToTerm, or geneToGeneOntologyTerm, you might need to modify the query used in this Makefile.
Data generation
If it is the first time you execute this step in this pipeline run:
make clean
Run Makefile:
make
Data verification
Before Bgee 13, mirbase Xref were not provided by Ensembl for Zebrafish. Check it at this end of the pipeline with
SELECTx.geneIdFROM gene g, geneXRef x WHEREg.geneId=x.geneIdANDg.geneBioTypeId= (SELECT geneBioTypeId FROM geneBioType WHERE geneBioTypeName='miRNA') ANDx.dataSourceId= (SELECT dataSourceId FROM dataSource WHERE dataSourceName='ZFIN');
Before Bgee 13, Ensembl did not provide XenBase Xref mapping. Check if this is done with:
SELECTx.geneIdFROM gene g, geneXRef x WHEREg.geneId=x.geneIdANDx.dataSourceId= (SELECT dataSourceId FROM dataSource WHERE dataSourceName='XenBase');
Error handling
Ensembl does NOT provide gene information the same way for each species, especially for non-Vertebrate species (Drosophila melanogaster and C. elegans) and/or model organisms that have a non-Ensembl reference database (e.g. Zebrafish or Xenope). Consequently Xrefs used in Bgee (linked in dataSource table) are not available the same way in Ensembl. You may need to add some extra aliases in the insert_genes.pl script in order to catch all Xrefs you need. You may have to do that for each new species as well as for each new Ensembl release. E.g. not all Xrefs are available in the zfin source, some others are available in zfin_id:
Update gene table with count of genes in database with an identical Ensembl gene ID (in Bgee, for some species with no genome available, we use the genome of a closely-related species, such as chimpanzee genome for analyzing bonobo data. For this reason, a same Ensembl gene ID can be mapped to several species in Bgee.):
make sameIdGeneCount
Insert OncoMX XRefs based on a file provided by OncoMX and a mapping to bgeeGeneId based on UniProt IDs. Please be sure that the URL of the file is up to date before running the rule :
make ../../generated_file/insert_oncoMX_XRefs
Gene Homology
Requirements: having contacted Adrian Altenhoff [email protected] one month in advance to request an update of the OMA HOGs based on our list of species. This is different from the data available from their download page... Having successfully run the step insert genes.
Goal: insert the OMA Hierarchical Orthologous Groups, and mirBase families.
Details
NEED TO THINK ABOUT HOW TO HANDLE MIRBASE FAMILIES, OUTSIDE OF THE TABLES DESIGNED FOR OMA HOG. OR FIND A WAY TO INTEGRATE THEM INTO THE OMA HOG TABLES?