Skip to content

these are the scripts used to pull KEGG annotations using kofamscan and the KEGG API

Notifications You must be signed in to change notification settings

AgBase/pathannotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro

  • Pathannotator annotates proteins with KEGG and Flybase pathways. It does this through the use of KofamScan, KEGG API and Flybase.
  • KofamScan is a gene functional annotation tool based on KEGG Orthology and hidden Markov model (HMM). It is provided by the KEGG (Kyoto Encyclopedia of Genes and Genomes) project. The online version is available here: https://www.genome.jp/tools/kofamkoala/ .
  • This pipeline pulls annotation directly from the KEGG API when possible. When that isn't possible the pipeline impliments Kofamscan to identify homologous KEGG objects (KO). The pathways annotated to these KEGG objects can then be transfered to the corresponding proteins in your species of interest.
  • If specified, the pipeline will also provide annotations to Flybase pathways. To do this the pipeline uses phmmer to identify homologous Drosophila melanogaster proteins for your input proteins. Flybase metabolic pathway and signaling annotations are then transferred to your input proteins from these homologs.

Where to Find Pathannotator

Pathannotator is provided as a Docker container for use on the command line.

The Dockerfile and scripts are available on GitHub

Getting the KOfam Databases

The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.

Using wget:

wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz

wget https://www.genome.jp/ftp/db/kofam/ko_list.gz

TIP: KEGG updates their annotations approximately once a month. If your profiles and ko_list files are older than this then your annotations will not be up-to-date. Just delete them and the new versions will be downloaded with your first annotation run.

If your species of interest has been annotated by the KEGG project you can provide this tool with the corresponding KEGG species code to pull those annotations directly. If your species of interest is not listed you should choose a closely related species and use that code. KEGG species codes can be found here: https://www.genome.jp/brite/br08611

Help and Usage Statement

On the command line the following help statement can be displayed with 'help'.

Help and Usage:
    There are 4 positional arguments.
    1: KEGG species code (NA or related species code if species not in KEGG; 'help' to see this help and usage statement)
       KEGG species codes can be found here: https://www.genome.jp/brite/br08611
    2: input file (protein FASTA without header lines)
    3: output directory (must be an existing directory; the file path should be  relative to, and inside of, your working directory)
    4: 'FB' for flybase annotations, 'NA' for none

    KofamScan is used under an MIT License:

    Copyright (c) 2019 Takuya Aramaki

    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE."

Benchmarking

The amount of time it takes to run this tool will vary greatly depending on several factors: 1. the number of sequences in your protein FASTA input file 2. whether your species has a KEGG species code 3. whether your input IDs are NCBI RefSeq protein IDs 4. whether you request Flybase annotation in addition to KEGG 5. how many CPUs you have available to run the analysis

Please consider your options carefully as they can impact run times significantly.

Example runs for various circumstances described above. These examples were run on the SciNet Ceres HPC system using Apptainer.

Number of input sequences KEGG species code NCBI RefSeq protein Inlude Flybase annotation Number of CPUs available Time to run (minutes)
20,571 related species yes yes 2 > 1,663
20,571 related species yes no 2 944
20,571 related species yes yes 12 477
20,571 related species yes no 12 234
20,571 related species yes yes 48 380
20,571 related species yes no 48 32
20,571 related species yes yes 96 344
20,571 related species yes no 96 24
Number of input sequences KEGG species code NCBI RefSeq protein Inlude Flybase annotation Number of CPUs available Time to run (minutes)
22,272 same species yes yes 2 1,174
22,272 same species yes no 2 < 1
22,272 same species yes yes 12 409
22,272 same species yes no 12 < 1
22,272 same species yes yes 48 365
22,272 same species yes no 48 < 1
22,272 same species yes yes 96 358
22,272 same species yes no 96 < 1
Number of input sequences KEGG species code NCBI RefSeq protein Inlude Flybase annotation Number of CPUs available Time to run (minutes)
18,330 related species no yes 2 974
18,330 related species no no 2 494
18,330 related species no yes 12 273
18,330 related species no no 12 206
18,330 related species no yes 48 220
18,330 related species no no 48 47
18,330 related species no yes 96 179
18,330 related species no no 96 18

Pathannotator on the Command Line

Container Technologies

Pathannotator is provided as a Docker container.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

There are two major containerization technologies: Docker and Apptainer (Singularity).

Docker containers can be run with either technology.

Running Pathannotator using Docker

About Docker

  • Docker must be installed on the computer you wish to use for your analysis.
  • To run Docker you must have ‘root’ (admin) permissions (or use sudo).
  • Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems (see Apptainer/Singularity below).
  • Docker can be run on your local computer, a server, a cloud virtual machine etc.
  • For more information on installing Docker on other systems: Installing Docker.

Getting the Pathannotator container

The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

The container can be pulled with this command:

docker pull agbase/pathannotator:1.0

Remember

You must have root permissions or use sudo, like so:

sudo docker pull agbase/pathannotator:1.0

Getting the Help and Usage Statement
sudo docker run --rm agbase/pathannotator:1.0 help

TIP:

The /workdir directory is built into this container and should be used to mount your working directory.

The /data directory is built into this container and should be used to mount the KofamScan database files.

Example Command
sudo docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:1.0 \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
Command Explained

sudo docker run: tells docker to run

--rm: removes the container when the analysis has finished. The image will remain for future use.

-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container

-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container

agbase/pathannotator:1.0: the name of the Docker image to use

Tip

All the options supplied after the image name are Pathannotator options

tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species.

GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).

out_dir: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.

FB: FB indicates that we want to get Flybase pathways annotations in addition to KEGG annotations.

Reference Understanding results.

Running Pathannotator using Apptainer (formerly Singularity)

About Apptainer

  • does not require ‘root’ permissions
  • runs all containers as the user that is logged into the host machine
  • HPC systems are likely to have Apptainer installed and are unlikely to object if asked to install it (no guarantees).
  • can be run on any machine where it is installed
  • more information about installing Apptainer
  • This tool was tested using Apptainer 1.3.1

HPC Job Schedulers

Although Apptainer can be installed on any computer this documentation assumes it will be run on an HPC system. The tool was tested on a Slurm system and the job submission scripts below reflect that. Submission scripts will need to be modified for use with other job scheduler systems.

Getting the Pathannotator container

The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

Example Slurm script:

#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics

module load apptainer

cd /location/where/you/want/to/save/image/file

apptainer pull docker://agbase/pathannotator:1.0

Running Pathannotator with Data

Tip

There /workdir directory is built into this container and should be used to mount your local working directory.

There /data directory is built into this container and should be used to mount the KOfam database files.

Example Slurm Script
#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics

module load apptainer

cd /directory/you/want/to/work/in

singularity run \
-B /directory/you/want/to/work/in:/workdir \
-B /directory/with/kofam/database/files:/data \
/path/with/image/file/pathannotator_1.0.sif \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
Command Explained

apptainer run: tells Apptainer to run

-B /directory/you/want/to/work/in:/workdir: mounts the working directory on the host machine to '/workdir' in the container

-B /directory/with/kofam/database/files:/data: mounts the directory with the kofam database file (or where you want them stored) on the host machine to '/data' in the container

/path/with/image/file/pathannotator_1.0.sif: the name of the Apptainer image to use

Tip

All the options supplied after the image name are Pathannotator options

tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If you species doesn't have a code choose a closely related species.

GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines)

out_dir: Directory where you want the outputs of the pipeline to be stored. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.

FB: FB indicates that you want Flybase pathways annotations in addition to KEGG annotations

Reference Understanding results.

Understanding Your Results

The output files you can expect will differ depending on the circumstances of your run. If you are using the KEGG code for your species of interest and your FASTA protein identifiers are NCBI protein IDs then your annotations will be pulled directly from the KEGG API. In other circumstances (detailed below) KofamScan will be run to identify homologs and transfer annotations to your species of interest. Under all circumstances you may specify whether or not you want to receive Flybase pathways annotations as well.

Same-species KEGG code and NCBI RefSeq protein IDs

Expected output files:
  • tca_KEGG_species.tsv: These are KEGG's annotations of the NCBI-RefSeq proteins to the species-specific KEGG pathways. The filename will begin with the KEGG species code. The pathway identifiers will begin the KEGG species code. Note that for species-specific pathways, KEGG internally filters associations between the KO (KEGG Orthology) accession and the reference pathway.

    KEGG_genes_ID

    Input_protein_ID

    KEGG_KO

    KEGG_tca_pathway

    KEGG_tca_pathway_name

    100141520

    XP_001813251

    K01540

    tca04820

    Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)

    100141523

    XP_001812480

    K02268

    tca00190

    Oxidative phosphorylation - Tribolium castaneum (red flour beetle)

    100141526

    XP_008195997

    K04676

    tca04350

    TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

  • tca_KEGG_ref.tsv: These are KEGG's annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'. You should expect more pathway annotations per protein than for the species-specific pathway.

    KEGG_genes_ID

    Input_protein_ID

    KEGG_KO

    KEGG_ref_pathway

    KEGG_ref_pathway_name

    100141516

    XP_015835225

    K26207

    map04024

    cAMP signaling pathway

    100141516

    XP_015835225

    K26207

    map04261

    Adrenergic signaling in cardiomyocytes

    100141520

    XP_001813251

    K01540

    map04022

    cGMP-PKG signaling pathway

  • HMM_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

    KEGG_genes_ID

    Input_protein_ID

    Flybase_protein_ID

    Flybase_pathway_ID

    Flybase_pathway_name

    CG9885

    NP_001034540.1

    FBpp0077451

    FBgg0001085

    BMP Signaling Pathway Core Components

    CG10002

    NP_001034503.2

    FBpp0084690

    FBgg0000904

    Insulin-like Receptor Signaling Pathway Core Components

    CG2666

    NP_001034492.1

    FBpp0078442

    FBgg0002045

    CHITIN BIOSYNTHESIS

  • dme_flybase.tsv: This is an alternative to 'HMM_flybase.tsv' if you used the 'FB' option for Flybase pathways annotations AND your species code was 'dme' (Drosophila melanogaster).

KEGG code for a related species

Expected output files:
  • kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.

    # gene name

    KO

    thrshld

    score

    E-value

    KO definition

    NP_001034280.2

    K10180

    417.47

    374.4

    1.2e-113

    T-box protein 6

    NP_001034280.2

    K10177

    886.07

    309.5

    7.2e-94

    T-box protein 3

    NP_001034280.2

    K10176

    750.77

    300.4

    4.6e-91

    T-box protein 2

  • tca_KEGG_species.tsv: These are annotations to the species-specific KEGG pathway. The pathway identifiers will begin the KEGG species code.

    KEGG_genes_ID

    Input_protein_ID

    KEGG_KO

    KEGG_tca_pathway

    KEGG_tca_pathway_name

    100141520

    XP_001813251

    K01540

    tca04820

    Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)

    100141523

    XP_001812480

    K02268

    tca00190

    Oxidative phosphorylation - Tribolium castaneum (red flour beetle)

    100141526

    XP_008195997

    K04676

    tca04350

    TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

  • tca_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.

    KEGG_genes_ID

    Input_protein_ID

    KEGG_KO

    KEGG_ref_pathway

    KEGG_ref_pathway_name

    100141516

    XP_015835225

    K26207

    map04024

    cAMP signaling pathway

    100141516

    XP_015835225

    K26207

    map04261

    Adrenergic signaling in cardiomyocytes

    100141520

    XP_001813251

    K01540

    map04022

    cGMP-PKG signaling pathway

  • HMM_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

    KEGG_genes_ID

    Input_protein_ID

    Flybase_protein_ID

    Flybase_pathway_ID

    Flybase_pathway_name

    CG9885

    NP_001034540.1

    FBpp0077451

    FBgg0001085

    BMP Signaling Pathway Core Components

    CG10002

    NP_001034503.2

    FBpp0084690

    FBgg0000904

    Insulin-like Receptor Signaling Pathway Core Components

    CG2666

    NP_001034492.1

    FBpp0078442

    FBgg0002045

    CHITIN BIOSYNTHESIS

  • dme_flybase.tsv: This is an alternative to 'HMM_flybase.tsv' if you used the 'FB' option for Flybase pathways annotations AND your species code was 'dme' (Drosophila melanogaster).

'NA' as KEGG code

Expected output files:

If you did not specify a KEGG species code (used 'NA') then no species-specific annotations file will be generated.

  • kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.

    # gene name

    KO

    thrshld

    score

    E-value

    KO definition

    NP_001034280.2

    K10180

    417.47

    374.4

    1.2e-113

    T-box protein 6

    NP_001034280.2

    K10177

    886.07

    309.5

    7.2e-94

    T-box protein 3

    NP_001034280.2

    K10176

    750.77

    300.4

    4.6e-91

    T-box protein 2

  • NA_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers wil begin with 'map'.

    KEGG_genes_ID

    Input_protein_ID

    KEGG_KO

    KEGG_ref_pathway

    KEGG_ref_pathway_name

    NA

    NP_001034489

    K16672

    map04391

    Hippo signaling pathway - fly

    NA

    NP_001034490

    K04491

    map04310

    Wnt signaling pathway

    NA

    NP_001034490

    K04491

    map04390

    Hippo signaling pathway

  • HMM_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

    KEGG_genes_ID

    Input_protein_ID

    Flybase_protein_ID

    Flybase_pathway_ID

    Flybase_pathway_name

    CG9885

    NP_001034540.1

    FBpp0077451

    FBgg0001085

    BMP Signaling Pathway Core Components

    CG10002

    NP_001034503.2

    FBpp0084690

    FBgg0000904

    Insulin-like Receptor Signaling Pathway Core Components

    CG2666

    NP_001034492.1

    FBpp0078442

    FBgg0002045

    CHITIN BIOSYNTHESIS

    CG2666

    NP_001034491.1

    FBpp0290640

    FBgg0002045

    CHITIN BIOSYNTHESIS

Contact us

About

these are the scripts used to pull KEGG annotations using kofamscan and the KEGG API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published