- Pathannotator annotates proteins with KEGG and Flybase pathways. It does this through the use of KofamScan, KEGG API and Flybase.
- KofamScan is a gene functional annotation tool based on KEGG Orthology and hidden Markov model (HMM). It is provided by the KEGG (Kyoto Encyclopedia of Genes and Genomes) project. The online version is available here: https://www.genome.jp/tools/kofamkoala/ .
- This pipeline pulls annotation directly from the KEGG API when possible. When that isn't possible the pipeline impliments Kofamscan to identify homologous KEGG objects (KO). The pathways annotated to these KEGG objects can then be transfered to the corresponding proteins in your species of interest.
- If specified, the pipeline will also provide annotations to Flybase pathways. To do this the pipeline uses phmmer to identify homologous Drosophila melanogaster proteins for your input proteins. Flybase metabolic pathway and signaling annotations are then transferred to your input proteins from these homologs.
Pathannotator is provided as a Docker container for use on the command line.
The Dockerfile and scripts are available on GitHub
The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.
Using wget:
wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz wget https://www.genome.jp/ftp/db/kofam/ko_list.gz
TIP: KEGG updates their annotations approximately once a month. If your profiles and ko_list files are older than this then your annotations will not be up-to-date. Just delete them and the new versions will be downloaded with your first annotation run.
If your species of interest has been annotated by the KEGG project you can provide this tool with the corresponding KEGG species code to pull those annotations directly. If your species of interest is not listed you should choose a closely related species and use that code. KEGG species codes can be found here: https://www.genome.jp/brite/br08611
On the command line the following help statement can be displayed with 'help'.
Help and Usage:
There are 4 positional arguments.
1: KEGG species code (NA or related species code if species not in KEGG; 'help' to see this help and usage statement)
KEGG species codes can be found here: https://www.genome.jp/brite/br08611
2: input file (protein FASTA without header lines)
3: output directory (must be an existing directory; the file path should be relative to, and inside of, your working directory)
4: 'FB' for flybase annotations, 'NA' for none
KofamScan is used under an MIT License:
Copyright (c) 2019 Takuya Aramaki
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE."
The amount of time it takes to run this tool will vary greatly depending on several factors: 1. the number of sequences in your protein FASTA input file 2. whether your species has a KEGG species code 3. whether your input IDs are NCBI RefSeq protein IDs 4. whether you request Flybase annotation in addition to KEGG 5. how many CPUs you have available to run the analysis
Please consider your options carefully as they can impact run times significantly.
Example runs for various circumstances described above. These examples were run on the SciNet Ceres HPC system using Apptainer.
Number of input sequences | KEGG species code | NCBI RefSeq protein | Inlude Flybase annotation | Number of CPUs available | Time to run (minutes) |
20,571 | related species | yes | yes | 2 | > 1,663 |
20,571 | related species | yes | no | 2 | 944 |
20,571 | related species | yes | yes | 12 | 477 |
20,571 | related species | yes | no | 12 | 234 |
20,571 | related species | yes | yes | 48 | 380 |
20,571 | related species | yes | no | 48 | 32 |
20,571 | related species | yes | yes | 96 | 344 |
20,571 | related species | yes | no | 96 | 24 |
Number of input sequences | KEGG species code | NCBI RefSeq protein | Inlude Flybase annotation | Number of CPUs available | Time to run (minutes) |
22,272 | same species | yes | yes | 2 | 1,174 |
22,272 | same species | yes | no | 2 | < 1 |
22,272 | same species | yes | yes | 12 | 409 |
22,272 | same species | yes | no | 12 | < 1 |
22,272 | same species | yes | yes | 48 | 365 |
22,272 | same species | yes | no | 48 | < 1 |
22,272 | same species | yes | yes | 96 | 358 |
22,272 | same species | yes | no | 96 | < 1 |
Number of input sequences | KEGG species code | NCBI RefSeq protein | Inlude Flybase annotation | Number of CPUs available | Time to run (minutes) |
18,330 | related species | no | yes | 2 | 974 |
18,330 | related species | no | no | 2 | 494 |
18,330 | related species | no | yes | 12 | 273 |
18,330 | related species | no | no | 12 | 206 |
18,330 | related species | no | yes | 48 | 220 |
18,330 | related species | no | no | 48 | 47 |
18,330 | related species | no | yes | 96 | 179 |
18,330 | related species | no | no | 96 | 18 |
Pathannotator is provided as a Docker container.
A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
There are two major containerization technologies: Docker and Apptainer (Singularity).
Docker containers can be run with either technology.
About Docker
- Docker must be installed on the computer you wish to use for your analysis.
- To run Docker you must have ‘root’ (admin) permissions (or use sudo).
- Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems (see Apptainer/Singularity below).
- Docker can be run on your local computer, a server, a cloud virtual machine etc.
- For more information on installing Docker on other systems: Installing Docker.
The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container
The container can be pulled with this command:
docker pull agbase/pathannotator:1.0
Remember
You must have root permissions or use sudo, like so:
sudo docker pull agbase/pathannotator:1.0
sudo docker run --rm agbase/pathannotator:1.0 help
TIP:
The /workdir directory is built into this container and should be used to mount your working directory.
The /data directory is built into this container and should be used to mount the KofamScan database files.
sudo docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:1.0 \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
sudo docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container
-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container
agbase/pathannotator:1.0: the name of the Docker image to use
Tip
All the options supplied after the image name are Pathannotator options
tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species.
GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).
out_dir: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.
FB: FB indicates that we want to get Flybase pathways annotations in addition to KEGG annotations.
Reference Understanding results.
About Apptainer
- does not require ‘root’ permissions
- runs all containers as the user that is logged into the host machine
- HPC systems are likely to have Apptainer installed and are unlikely to object if asked to install it (no guarantees).
- can be run on any machine where it is installed
- more information about installing Apptainer
- This tool was tested using Apptainer 1.3.1
HPC Job Schedulers
Although Apptainer can be installed on any computer this documentation assumes it will be run on an HPC system. The tool was tested on a Slurm system and the job submission scripts below reflect that. Submission scripts will need to be modified for use with other job scheduler systems.
The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container
Example Slurm script:
#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics
module load apptainer
cd /location/where/you/want/to/save/image/file
apptainer pull docker://agbase/pathannotator:1.0
Tip
There /workdir directory is built into this container and should be used to mount your local working directory.
There /data directory is built into this container and should be used to mount the KOfam database files.
#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics
module load apptainer
cd /directory/you/want/to/work/in
singularity run \
-B /directory/you/want/to/work/in:/workdir \
-B /directory/with/kofam/database/files:/data \
/path/with/image/file/pathannotator_1.0.sif \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
apptainer run: tells Apptainer to run
-B /directory/you/want/to/work/in:/workdir: mounts the working directory on the host machine to '/workdir' in the container
-B /directory/with/kofam/database/files:/data: mounts the directory with the kofam database file (or where you want them stored) on the host machine to '/data' in the container
/path/with/image/file/pathannotator_1.0.sif: the name of the Apptainer image to use
Tip
All the options supplied after the image name are Pathannotator options
tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If you species doesn't have a code choose a closely related species.
GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines)
out_dir: Directory where you want the outputs of the pipeline to be stored. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.
FB: FB indicates that you want Flybase pathways annotations in addition to KEGG annotations
Reference Understanding results.
The output files you can expect will differ depending on the circumstances of your run. If you are using the KEGG code for your species of interest and your FASTA protein identifiers are NCBI protein IDs then your annotations will be pulled directly from the KEGG API. In other circumstances (detailed below) KofamScan will be run to identify homologs and transfer annotations to your species of interest. Under all circumstances you may specify whether or not you want to receive Flybase pathways annotations as well.
tca_KEGG_species.tsv: These are KEGG's annotations of the NCBI-RefSeq proteins to the species-specific KEGG pathways. The filename will begin with the KEGG species code. The pathway identifiers will begin the KEGG species code. Note that for species-specific pathways, KEGG internally filters associations between the KO (KEGG Orthology) accession and the reference pathway.
KEGG_genes_ID
Input_protein_ID
KEGG_KO
KEGG_tca_pathway
KEGG_tca_pathway_name
100141520
XP_001813251
K01540
tca04820
Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
100141523
XP_001812480
K02268
tca00190
Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
100141526
XP_008195997
K04676
tca04350
TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)
tca_KEGG_ref.tsv: These are KEGG's annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'. You should expect more pathway annotations per protein than for the species-specific pathway.
KEGG_genes_ID
Input_protein_ID
KEGG_KO
KEGG_ref_pathway
KEGG_ref_pathway_name
100141516
XP_015835225
K26207
map04024
cAMP signaling pathway
100141516
XP_015835225
K26207
map04261
Adrenergic signaling in cardiomyocytes
100141520
XP_001813251
K01540
map04022
cGMP-PKG signaling pathway
HMM_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.
KEGG_genes_ID
Input_protein_ID
Flybase_protein_ID
Flybase_pathway_ID
Flybase_pathway_name
CG9885
NP_001034540.1
FBpp0077451
FBgg0001085
BMP Signaling Pathway Core Components
CG10002
NP_001034503.2
FBpp0084690
FBgg0000904
Insulin-like Receptor Signaling Pathway Core Components
CG2666
NP_001034492.1
FBpp0078442
FBgg0002045
CHITIN BIOSYNTHESIS
dme_flybase.tsv: This is an alternative to 'HMM_flybase.tsv' if you used the 'FB' option for Flybase pathways annotations AND your species code was 'dme' (Drosophila melanogaster).
kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.
# gene name
KO
thrshld
score
E-value
KO definition
NP_001034280.2
K10180
417.47
374.4
1.2e-113
T-box protein 6
NP_001034280.2
K10177
886.07
309.5
7.2e-94
T-box protein 3
NP_001034280.2
K10176
750.77
300.4
4.6e-91
T-box protein 2
tca_KEGG_species.tsv: These are annotations to the species-specific KEGG pathway. The pathway identifiers will begin the KEGG species code.
KEGG_genes_ID
Input_protein_ID
KEGG_KO
KEGG_tca_pathway
KEGG_tca_pathway_name
100141520
XP_001813251
K01540
tca04820
Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
100141523
XP_001812480
K02268
tca00190
Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
100141526
XP_008195997
K04676
tca04350
TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)
tca_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.
KEGG_genes_ID
Input_protein_ID
KEGG_KO
KEGG_ref_pathway
KEGG_ref_pathway_name
100141516
XP_015835225
K26207
map04024
cAMP signaling pathway
100141516
XP_015835225
K26207
map04261
Adrenergic signaling in cardiomyocytes
100141520
XP_001813251
K01540
map04022
cGMP-PKG signaling pathway
HMM_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.
KEGG_genes_ID
Input_protein_ID
Flybase_protein_ID
Flybase_pathway_ID
Flybase_pathway_name
CG9885
NP_001034540.1
FBpp0077451
FBgg0001085
BMP Signaling Pathway Core Components
CG10002
NP_001034503.2
FBpp0084690
FBgg0000904
Insulin-like Receptor Signaling Pathway Core Components
CG2666
NP_001034492.1
FBpp0078442
FBgg0002045
CHITIN BIOSYNTHESIS
dme_flybase.tsv: This is an alternative to 'HMM_flybase.tsv' if you used the 'FB' option for Flybase pathways annotations AND your species code was 'dme' (Drosophila melanogaster).
Expected output files:
If you did not specify a KEGG species code (used 'NA') then no species-specific annotations file will be generated.
kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.
# gene name
KO
thrshld
score
E-value
KO definition
NP_001034280.2
K10180
417.47
374.4
1.2e-113
T-box protein 6
NP_001034280.2
K10177
886.07
309.5
7.2e-94
T-box protein 3
NP_001034280.2
K10176
750.77
300.4
4.6e-91
T-box protein 2
NA_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers wil begin with 'map'.
KEGG_genes_ID
Input_protein_ID
KEGG_KO
KEGG_ref_pathway
KEGG_ref_pathway_name
NA
NP_001034489
K16672
map04391
Hippo signaling pathway - fly
NA
NP_001034490
K04491
map04310
Wnt signaling pathway
NA
NP_001034490
K04491
map04390
Hippo signaling pathway
HMM_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.
KEGG_genes_ID
Input_protein_ID
Flybase_protein_ID
Flybase_pathway_ID
Flybase_pathway_name
CG9885
NP_001034540.1
FBpp0077451
FBgg0001085
BMP Signaling Pathway Core Components
CG10002
NP_001034503.2
FBpp0084690
FBgg0000904
Insulin-like Receptor Signaling Pathway Core Components
CG2666
NP_001034492.1
FBpp0078442
FBgg0002045
CHITIN BIOSYNTHESIS
CG2666
NP_001034491.1
FBpp0290640
FBgg0002045
CHITIN BIOSYNTHESIS