Skip to content

AgBase/pathannotator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Intro

Where to Find Pathannotator

Pathannotator is provided as a Docker container for use on the command line.

The Dockerfile and scripts are available on GitHub

Getting the KOfam Databases

The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.

Using wget:

wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz

wget https://www.genome.jp/ftp/db/kofam/ko_list.gz

TIP: KEGG updates their annotations approximately once a month. If your profiles and ko_list files are older than this then your annotations will not be up-to-date. Just delete them and the new versions will be downloaded with your first annotation run.

If your species of interest has been annotated by the KEGG project you can provide this tool with the corresponding KEGG species code to pull those annotations directly. If your species of interest is not listed you should choose a closely related species and use that code. KEGG species codes can be found here: https://www.genome.jp/brite/br08611

Help and Usage Statement

On the command line the following help statement can be displayed with 'help'.

Help and Usage:
    There are 4 positional arguments.
    1: KEGG species code (NA or related species code if species not in KEGG; 'help' to see this help and usage statement)
       KEGG species codes can be found here: https://www.genome.jp/brite/br08611
    2: input file (protein FASTA without header lines)
    3: output directory (must be an existing directory; the file path should be  relative to, and inside of, your working directory)
    4: 'FB' for flybase annotations, 'NA' for none

    KofamScan is used under an MIT License:

    Copyright (c) 2019 Takuya Aramaki

    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.

    Flybase annotation is carried out using OrthoFinder <https://github.com/davidemms/OrthoFinder?>`_.

Benchmarking

The amount of time it takes to run this tool will vary greatly depending on several factors: 1. the number of sequences in your protein FASTA input file 2. whether your species has a KEGG species code 3. whether your input IDs are NCBI RefSeq protein IDs 4. whether you request Flybase annotation in addition to KEGG 5. how many CPUs you have available to run the analysis

Example runs for various circumstances described above. These examples were run on the SciNet Atlas HPC system using Apptainer.

Number of input sequences KEGG species code NCBI RefSeq protein Inlude Flybase annotation Number of CPUs Time to run (minutes)
20,571 related species yes yes 2 1047
20,571 related species yes no 2 533
20,571 related species yes yes 12 192
20,571 related species yes no 12 86
20,571 related species yes yes 48 95
20,571 related species yes no 48 25
Number of input sequences KEGG species code NCBI RefSeq protein Inlude Flybase annotation Number of CPUs Time to run (minutes)
22,272 same species yes yes 2 428
22,272 same species yes no 2 < 1
22,272 same species yes yes 12 96
22,272 same species yes no 12 < 1
22,272 same species yes yes 48 62
22,272 same species yes no 48 < 1
Number of input sequences KEGG species code NCBI RefSeq protein Inlude Flybase annotation Number of CPUs Time to run (minutes)
18,330 related species no yes 2 >12hrs
18,330 related species no no 2 199
18,330 related species no yes 12 142
18,330 related species no no 12 32
18,330 related species no yes 48 48
18,330 related species no no 48 10

Pathannotator on the Command Line

Container Technologies

Pathannotator is provided as a Docker container.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

There are two major containerization technologies: Docker and Apptainer (Singularity).

Docker containers can be run with either technology.

Running Pathannotator using Docker

About Docker

  • Docker must be installed on the computer you wish to use for your analysis.
  • To run Docker you must have ‘root’ (admin) permissions (or use sudo).
  • Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems (see Apptainer/Singularity below).
  • Docker can be run on your local computer, a server, a cloud virtual machine etc.
  • For more information on installing Docker on other systems: Installing Docker.

Getting the Pathannotator container

The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

The container can be pulled with this command:

docker pull agbase/pathannotator:3.0

Remember

You must have root permissions or use sudo, like so:

sudo docker pull agbase/pathannotator:3.0

Getting the Help and Usage Statement
sudo docker run --rm agbase/pathannotator:3.0 help

TIP:

The /workdir directory is built into this container and should be used to mount your working directory.

The /data directory is built into this container and should be used to mount the KofamScan database files.

Example Command
sudo docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:3.0 \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
Command Explained

sudo docker run: tells docker to run

--rm: removes the container when the analysis has finished. The image will remain for future use.

-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container

-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container

agbase/pathannotator:3.0: the name of the Docker image to use

Tip

All the options supplied after the image name are Pathannotator options

tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species or 'NA'.

GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).

out_dir: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.

FB: FB indicates that we want to get Flybase pathways annotations in addition to KEGG annotations.

Reference Understanding results.

Running Pathannotator using Apptainer (formerly Singularity)

About Apptainer

  • does not require ‘root’ permissions
  • runs all containers as the user that is logged into the host machine
  • HPC systems are likely to have Apptainer installed and are unlikely to object if asked to install it (no guarantees).
  • can be run on any machine where it is installed
  • more information about installing Apptainer
  • This tool was tested using Apptainer 1.3.1

HPC Job Schedulers

Although Apptainer can be installed on any computer this documentation assumes it will be run on an HPC system. The tool was tested on a Slurm system and the job submission scripts below reflect that. Submission scripts will need to be modified for use with other job scheduler systems.

Getting the Pathannotator container

The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

Example Slurm script:

#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=ceres
#SBATCH --account=nal_genomics

module load apptainer

cd /location/where/you/want/to/save/image/file

apptainer pull docker://agbase/pathannotator:3.0

Running Pathannotator with Data

Tip

There /workdir directory is built into this container and should be used to mount your local working directory.

There /data directory is built into this container and should be used to mount the KOfam database files.

Example Slurm Script
#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=48
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --partition=ceres
#SBATCH --account=nal_genomics

module load apptainer

cd /directory/you/want/to/work/in

singularity run \
-B /directory/you/want/to/work/in:/workdir \
-B /directory/with/kofam/database/files:/data \
/path/with/image/file/pathannotator_3.0.sif \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB
Command Explained

apptainer run: tells Apptainer to run

-B /directory/you/want/to/work/in:/workdir: mounts the working directory on the host machine to '/workdir' in the container

-B /directory/with/kofam/database/files:/data: mounts the directory with the kofam database file (or where you want them stored) on the host machine to '/data' in the container

/path/with/image/file/pathannotator_3.0.sif: the name of the Apptainer image to use

Tip

All the options supplied after the image name are Pathannotator options

tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If you species doesn't have a code choose a closely related species or 'NA'.

GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines)

out_dir: Directory where you want the outputs of the pipeline to be stored. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.

FB: FB indicates that you want Flybase pathways annotations in addition to KEGG annotations

Reference Understanding results.

Understanding Your Results

The output files you can expect will differ depending on the circumstances of your run. If you are using the KEGG code for your species of interest and your FASTA protein identifiers are NCBI protein IDs then your annotations will be pulled directly from the KEGG API. In other circumstances (detailed below) KofamScan will be run to identify homologs and transfer annotations to your species of interest. Under all circumstances you may specify whether or not you want to receive Flybase pathways annotations as well.

Same-species KEGG code and NCBI RefSeq protein IDs

Expected output files:
  • tca_KEGG_species.tsv: These are KEGG's annotations of the NCBI-RefSeq proteins to the species-specific KEGG pathways. The filename will begin with the KEGG species code. The pathway identifiers will begin the KEGG species code. Note that for species-specific pathways, KEGG internally filters associations between the KO (KEGG Orthology) accession and the reference pathway.

    Input_protein_ID

    KEGG_KO

    KEGG_tca_pathway

    KEGG_tca_pathway_name

    XP_001813251.1

    K01540

    tca04820

    Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)

    XP_001812480.1

    K02268

    tca00190

    Oxidative phosphorylation - Tribolium castaneum (red flour beetle)

    XP_008195997.1

    K04676

    tca04350

    TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

  • tca_KEGG_ref.tsv: These are KEGG's annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'. You should expect more pathway annotations per protein than for the species-specific pathway.

    Input_protein_ID

    KEGG_KO

    KEGG_ref_pathway

    KEGG_ref_pathway_name

    XP_015835225.1

    K26207

    map04024

    cAMP signaling pathway

    XP_015835225.1

    K26207

    map04261

    Adrenergic signaling in cardiomyocytes

    XP_001813251.1

    K01540

    map04022

    cGMP-PKG signaling pathway

  • tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.

    Input_protein_ID

    pathway

    NP_001034488.1

    KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950

    NP_001034489.1

    KEGG:map04391,KEGG:tca04391

  • tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.

    pathway

    Input_protein_ID

    Flybase:FBgg0000881

    XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1

    KEGG:tca03273

    XP_008192998.2,XP_009105448.1,XP_008196990.1

  • OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

    Input_protein_ID

    Flybase_protein_ID

    Flybase_pathway_ID

    Flybase_pathway_name

    NP_001034540.1

    FBpp0077451

    FBgg0001085

    BMP Signaling Pathway Core Components

    NP_001034503.2

    FBpp0084690

    FBgg0000904

    Insulin-like Receptor Signaling Pathway Core Components

    NP_001034492.1

    FBpp0078442

    FBgg0002045

    CHITIN BIOSYNTHESIS

  • dme_flybase.tsv: This is an alternative to 'OrthoFinder_flybase.tsv' if you are annotating Drosophila melanogaster.

    Input_protein_ID

    KEGG_KO

    Flybase_pathway_ID

    Flybase_pathway_name

    NP_001034490.1

    K04491

    FBgg0000890

    Wnt-TCF Signaling Pathway Core Components

    NP_001034491.1

    K00698

    FBgg0002045

    CHITIN BIOSYNTHESIS

    NP_001034491.1

    K00698

    FBgg0002045

    CHITIN BIOSYNTHESIS

KEGG code for a related species

Expected output files:
  • kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.

    # gene name

    KO

    thrshld

    score

    E-value

    KO definition

    NP_001034280.2

    K10180

    417.47

    374.4

    1.2e-113

    T-box protein 6

    NP_001034280.2

    K10177

    886.07

    309.5

    7.2e-94

    T-box protein 3

    NP_001034280.2

    K10176

    750.77

    300.4

    4.6e-91

    T-box protein 2

  • tca_KEGG_species.tsv: These are annotations to the species-specific KEGG pathway. The pathway identifiers will begin with the KEGG species code.

    Input_protein_ID

    KEGG_KO

    KEGG_tca_pathway

    KEGG_tca_pathway_name

    XP_001813251.1

    K01540

    tca04820

    Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)

    XP_001812480.1

    K02268

    tca00190

    Oxidative phosphorylation - Tribolium castaneum (red flour beetle)

    XP_008195997.1

    K04676

    tca04350

    TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

  • tca_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.

    Input_protein_ID

    KEGG_KO

    KEGG_ref_pathway

    KEGG_ref_pathway_name

    XP_015835225.1

    K26207

    map04024

    cAMP signaling pathway

    XP_015835225.1

    K26207

    map04261

    Adrenergic signaling in cardiomyocytes

    XP_001813251.1

    K01540

    map04022

    cGMP-PKG signaling pathway

  • tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.

  • tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.

    pathway

    Input_protein_ID

    Flybase:FBgg0000918

    XP044254039.1,XP_044272825.1

    KEGG:tca00780

    XP_044253000.1,XP_044261349.1,XP_044272235.1

  • OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

    Input_protein_ID

    Flybase_protein_ID

    Flybase_pathway_ID

    Flybase_pathway_name

    NP_001034540.1

    FBpp0077451

    FBgg0001085

    BMP Signaling Pathway Core Components

    NP_001034503.2

    FBpp0084690

    FBgg0000904

    Insulin-like Receptor Signaling Pathway Core Components

    NP_001034492.1

    FBpp0078442

    FBgg0002045

    CHITIN BIOSYNTHESIS

'NA' as KEGG code

Expected output files:

If you did not specify a KEGG species code (used 'NA') then no species-specific annotations file will be generated.

  • kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.

    # gene name

    KO

    thrshld

    score

    E-value

    KO definition

    NP_001034280.2

    K10180

    417.47

    374.4

    1.2e-113

    T-box protein 6

    NP_001034280.2

    K10177

    886.07

    309.5

    7.2e-94

    T-box protein 3

    NP_001034280.2

    K10176

    750.77

    300.4

    4.6e-91

    T-box protein 2

  • NA_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers wil begin with 'map'.

    Input_protein_ID

    KEGG_KO

    KEGG_ref_pathway

    KEGG_ref_pathway_name

    NP_001034489.1

    K16672

    map04391

    Hippo signaling pathway - fly

    NP_001034490.1

    K04491

    map04310

    Wnt signaling pathway

    NP_001034490.1

    K04491

    map04390

    Hippo signaling pathway

  • tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.

    Input_protein_ID

    pathway

    NP_001034488.1

    KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950

    NP_001034489.1

    KEGG:map04391,KEGG:tca04391

  • tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.

    pathway

    Input_protein_ID

    Flybase:FBgg0000881

    XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1

    KEGG:tca03273

    XP_008192998.2,XP_009105448.1,XP_008196990.1

  • OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

    Input_protein_ID

    Flybase_protein_ID

    Flybase_pathway_ID

    Flybase_pathway_name

    NP_001034540.1

    FBpp0077451

    FBgg0001085

    BMP Signaling Pathway Core Components

    NP_001034503.2

    FBpp0084690

    FBgg0000904

    Insulin-like Receptor Signaling Pathway Core Components

    NP_001034492.1

    FBpp0078442

    FBgg0002045

    CHITIN BIOSYNTHESIS

    NP_001034491.1

    FBpp0290640

    FBgg0002045

    CHITIN BIOSYNTHESIS

Contact us

About

these are the scripts used to pull KEGG annotations using kofamscan and the KEGG API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •