- Pathannotator annotates proteins with KEGG and Flybase pathways. It does this through the use of KofamScan, KEGG API, OrthoFinder and Flybase.
- KofamScan is a gene functional annotation tool based on KEGG Orthology and hidden Markov model (HMM). It is provided by the KEGG (Kyoto Encyclopedia of Genes and Genomes) project. The online version is available here: https://www.genome.jp/tools/kofamkoala/ .
- This pipeline pulls annotation directly from the KEGG API when possible. When that isn't possible the pipeline impliments Kofamscan to identify homologous KEGG objects (KO). The pathways annotated to these KEGG objects can then be transfered to the corresponding proteins in your species of interest.
- If specified, the pipeline will also provide annotations to Flybase pathways. To do this the pipeline uses OrthoFinder <https://github.com/davidemms/OrthoFinder?>`_to identify homologous *Drosophila melanogaster* proteins for your input proteins. Flybase `metabolic pathway and signaling pathway annotations are then transferred to your input proteins from these homologs.
Pathannotator is provided as a Docker container for use on the command line.
The Dockerfile and scripts are available on GitHub
The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.
Using wget:
wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz wget https://www.genome.jp/ftp/db/kofam/ko_list.gz
TIP: KEGG updates their annotations approximately once a month. If your profiles and ko_list files are older than this then your annotations will not be up-to-date. Just delete them and the new versions will be downloaded with your first annotation run.
If your species of interest has been annotated by the KEGG project you can provide this tool with the corresponding KEGG species code to pull those annotations directly. If your species of interest is not listed you should choose a closely related species and use that code. KEGG species codes can be found here: https://www.genome.jp/brite/br08611
On the command line the following help statement can be displayed with 'help'.
Help and Usage: There are 4 positional arguments. 1: KEGG species code (NA or related species code if species not in KEGG; 'help' to see this help and usage statement) KEGG species codes can be found here: https://www.genome.jp/brite/br08611 2: input file (protein FASTA without header lines) 3: output directory (must be an existing directory; the file path should be relative to, and inside of, your working directory) 4: 'FB' for flybase annotations, 'NA' for none KofamScan is used under an MIT License: Copyright (c) 2019 Takuya Aramaki Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Flybase annotation is carried out using OrthoFinder <https://github.com/davidemms/OrthoFinder?>`_.
The amount of time it takes to run this tool will vary greatly depending on several factors: 1. the number of sequences in your protein FASTA input file 2. whether your species has a KEGG species code 3. whether your input IDs are NCBI RefSeq protein IDs 4. whether you request Flybase annotation in addition to KEGG 5. how many CPUs you have available to run the analysis
Example runs for various circumstances described above. These examples were run on the SciNet Atlas HPC system using Apptainer.
Number of input sequences | KEGG species code | NCBI RefSeq protein | Inlude Flybase annotation | Number of CPUs | Time to run (minutes) |
20,571 | related species | yes | yes | 2 | 1047 |
20,571 | related species | yes | no | 2 | 533 |
20,571 | related species | yes | yes | 12 | 192 |
20,571 | related species | yes | no | 12 | 86 |
20,571 | related species | yes | yes | 48 | 95 |
20,571 | related species | yes | no | 48 | 25 |
Number of input sequences | KEGG species code | NCBI RefSeq protein | Inlude Flybase annotation | Number of CPUs | Time to run (minutes) |
22,272 | same species | yes | yes | 2 | 428 |
22,272 | same species | yes | no | 2 | < 1 |
22,272 | same species | yes | yes | 12 | 96 |
22,272 | same species | yes | no | 12 | < 1 |
22,272 | same species | yes | yes | 48 | 62 |
22,272 | same species | yes | no | 48 | < 1 |
Number of input sequences | KEGG species code | NCBI RefSeq protein | Inlude Flybase annotation | Number of CPUs | Time to run (minutes) |
18,330 | related species | no | yes | 2 | >12hrs |
18,330 | related species | no | no | 2 | 199 |
18,330 | related species | no | yes | 12 | 142 |
18,330 | related species | no | no | 12 | 32 |
18,330 | related species | no | yes | 48 | 48 |
18,330 | related species | no | no | 48 | 10 |
Pathannotator is provided as a Docker container.
A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
There are two major containerization technologies: Docker and Apptainer (Singularity).
Docker containers can be run with either technology.
About Docker
- Docker must be installed on the computer you wish to use for your analysis.
- To run Docker you must have ‘root’ (admin) permissions (or use sudo).
- Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems (see Apptainer/Singularity below).
- Docker can be run on your local computer, a server, a cloud virtual machine etc.
- For more information on installing Docker on other systems: Installing Docker.
The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container
The container can be pulled with this command:
docker pull agbase/pathannotator:3.0
Remember
You must have root permissions or use sudo, like so:
sudo docker pull agbase/pathannotator:3.0
sudo docker run --rm agbase/pathannotator:3.0 help
TIP:
The /workdir directory is built into this container and should be used to mount your working directory.
The /data directory is built into this container and should be used to mount the KofamScan database files.
sudo docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:3.0 \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
sudo docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container
-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container
agbase/pathannotator:3.0: the name of the Docker image to use
Tip
All the options supplied after the image name are Pathannotator options
tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species or 'NA'.
GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).
out_dir: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.
FB: FB indicates that we want to get Flybase pathways annotations in addition to KEGG annotations.
Reference Understanding results.
About Apptainer
- does not require ‘root’ permissions
- runs all containers as the user that is logged into the host machine
- HPC systems are likely to have Apptainer installed and are unlikely to object if asked to install it (no guarantees).
- can be run on any machine where it is installed
- more information about installing Apptainer
- This tool was tested using Apptainer 1.3.1
HPC Job Schedulers
Although Apptainer can be installed on any computer this documentation assumes it will be run on an HPC system. The tool was tested on a Slurm system and the job submission scripts below reflect that. Submission scripts will need to be modified for use with other job scheduler systems.
The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container
Example Slurm script:
#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=ceres
#SBATCH --account=nal_genomics
module load apptainer
cd /location/where/you/want/to/save/image/file
apptainer pull docker://agbase/pathannotator:3.0
Tip
There /workdir directory is built into this container and should be used to mount your local working directory.
There /data directory is built into this container and should be used to mount the KOfam database files.
#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=48
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --partition=ceres
#SBATCH --account=nal_genomics
module load apptainer
cd /directory/you/want/to/work/in
singularity run \
-B /directory/you/want/to/work/in:/workdir \
-B /directory/with/kofam/database/files:/data \
/path/with/image/file/pathannotator_3.0.sif \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
apptainer run: tells Apptainer to run
-B /directory/you/want/to/work/in:/workdir: mounts the working directory on the host machine to '/workdir' in the container
-B /directory/with/kofam/database/files:/data: mounts the directory with the kofam database file (or where you want them stored) on the host machine to '/data' in the container
/path/with/image/file/pathannotator_3.0.sif: the name of the Apptainer image to use
Tip
All the options supplied after the image name are Pathannotator options
tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If you species doesn't have a code choose a closely related species or 'NA'.
GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines)
out_dir: Directory where you want the outputs of the pipeline to be stored. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.
FB: FB indicates that you want Flybase pathways annotations in addition to KEGG annotations
Reference Understanding results.
The output files you can expect will differ depending on the circumstances of your run. If you are using the KEGG code for your species of interest and your FASTA protein identifiers are NCBI protein IDs then your annotations will be pulled directly from the KEGG API. In other circumstances (detailed below) KofamScan will be run to identify homologs and transfer annotations to your species of interest. Under all circumstances you may specify whether or not you want to receive Flybase pathways annotations as well.
tca_KEGG_species.tsv: These are KEGG's annotations of the NCBI-RefSeq proteins to the species-specific KEGG pathways. The filename will begin with the KEGG species code. The pathway identifiers will begin the KEGG species code. Note that for species-specific pathways, KEGG internally filters associations between the KO (KEGG Orthology) accession and the reference pathway.
Input_protein_ID
KEGG_KO
KEGG_tca_pathway
KEGG_tca_pathway_name
XP_001813251.1
K01540
tca04820
Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
XP_001812480.1
K02268
tca00190
Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
XP_008195997.1
K04676
tca04350
TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)
tca_KEGG_ref.tsv: These are KEGG's annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'. You should expect more pathway annotations per protein than for the species-specific pathway.
Input_protein_ID
KEGG_KO
KEGG_ref_pathway
KEGG_ref_pathway_name
XP_015835225.1
K26207
map04024
cAMP signaling pathway
XP_015835225.1
K26207
map04261
Adrenergic signaling in cardiomyocytes
XP_001813251.1
K01540
map04022
cGMP-PKG signaling pathway
tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.
Input_protein_ID
pathway
NP_001034488.1
KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950
NP_001034489.1
KEGG:map04391,KEGG:tca04391
tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.
pathway
Input_protein_ID
Flybase:FBgg0000881
XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1
KEGG:tca03273
XP_008192998.2,XP_009105448.1,XP_008196990.1
OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.
Input_protein_ID
Flybase_protein_ID
Flybase_pathway_ID
Flybase_pathway_name
NP_001034540.1
FBpp0077451
FBgg0001085
BMP Signaling Pathway Core Components
NP_001034503.2
FBpp0084690
FBgg0000904
Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1
FBpp0078442
FBgg0002045
CHITIN BIOSYNTHESIS
dme_flybase.tsv: This is an alternative to 'OrthoFinder_flybase.tsv' if you are annotating Drosophila melanogaster.
Input_protein_ID
KEGG_KO
Flybase_pathway_ID
Flybase_pathway_name
NP_001034490.1
K04491
FBgg0000890
Wnt-TCF Signaling Pathway Core Components
NP_001034491.1
K00698
FBgg0002045
CHITIN BIOSYNTHESIS
NP_001034491.1
K00698
FBgg0002045
CHITIN BIOSYNTHESIS
kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.
# gene name
KO
thrshld
score
E-value
KO definition
NP_001034280.2
K10180
417.47
374.4
1.2e-113
T-box protein 6
NP_001034280.2
K10177
886.07
309.5
7.2e-94
T-box protein 3
NP_001034280.2
K10176
750.77
300.4
4.6e-91
T-box protein 2
tca_KEGG_species.tsv: These are annotations to the species-specific KEGG pathway. The pathway identifiers will begin with the KEGG species code.
Input_protein_ID
KEGG_KO
KEGG_tca_pathway
KEGG_tca_pathway_name
XP_001813251.1
K01540
tca04820
Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
XP_001812480.1
K02268
tca00190
Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
XP_008195997.1
K04676
tca04350
TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)
tca_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.
Input_protein_ID
KEGG_KO
KEGG_ref_pathway
KEGG_ref_pathway_name
XP_015835225.1
K26207
map04024
cAMP signaling pathway
XP_015835225.1
K26207
map04261
Adrenergic signaling in cardiomyocytes
XP_001813251.1
K01540
map04022
cGMP-PKG signaling pathway
tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.
tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.
pathway
Input_protein_ID
Flybase:FBgg0000918
XP044254039.1,XP_044272825.1
KEGG:tca00780
XP_044253000.1,XP_044261349.1,XP_044272235.1
OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.
Input_protein_ID
Flybase_protein_ID
Flybase_pathway_ID
Flybase_pathway_name
NP_001034540.1
FBpp0077451
FBgg0001085
BMP Signaling Pathway Core Components
NP_001034503.2
FBpp0084690
FBgg0000904
Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1
FBpp0078442
FBgg0002045
CHITIN BIOSYNTHESIS
Expected output files:
If you did not specify a KEGG species code (used 'NA') then no species-specific annotations file will be generated.
kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.
# gene name
KO
thrshld
score
E-value
KO definition
NP_001034280.2
K10180
417.47
374.4
1.2e-113
T-box protein 6
NP_001034280.2
K10177
886.07
309.5
7.2e-94
T-box protein 3
NP_001034280.2
K10176
750.77
300.4
4.6e-91
T-box protein 2
NA_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers wil begin with 'map'.
Input_protein_ID
KEGG_KO
KEGG_ref_pathway
KEGG_ref_pathway_name
NP_001034489.1
K16672
map04391
Hippo signaling pathway - fly
NP_001034490.1
K04491
map04310
Wnt signaling pathway
NP_001034490.1
K04491
map04390
Hippo signaling pathway
tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.
Input_protein_ID
pathway
NP_001034488.1
KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950
NP_001034489.1
KEGG:map04391,KEGG:tca04391
tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.
pathway
Input_protein_ID
Flybase:FBgg0000881
XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1
KEGG:tca03273
XP_008192998.2,XP_009105448.1,XP_008196990.1
OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.
Input_protein_ID
Flybase_protein_ID
Flybase_pathway_ID
Flybase_pathway_name
NP_001034540.1
FBpp0077451
FBgg0001085
BMP Signaling Pathway Core Components
NP_001034503.2
FBpp0084690
FBgg0000904
Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1
FBpp0078442
FBgg0002045
CHITIN BIOSYNTHESIS
NP_001034491.1
FBpp0290640
FBgg0002045
CHITIN BIOSYNTHESIS