GitHub - AgBase/pathannotator: these are the scripts used to pull KEGG annotations using kofamscan and the KEGG API

Intro

Pathannotator annotates proteins with KEGG and Flybase pathways. It does this through the use of KofamScan, KEGG API, OrthoFinder and Flybase.
KofamScan is a gene functional annotation tool based on KEGG Orthology and hidden Markov model (HMM). It is provided by the KEGG (Kyoto Encyclopedia of Genes and Genomes) project. The online version is available here: https://www.genome.jp/tools/kofamkoala/ .
This pipeline pulls annotation directly from the KEGG API when possible. When that isn't possible the pipeline impliments Kofamscan to identify homologous KEGG objects (KO). The pathways annotated to these KEGG objects can then be transfered to the corresponding proteins in your species of interest.
If specified, the pipeline will also provide annotations to Flybase pathways. To do this the pipeline uses OrthoFinder <https://github.com/davidemms/OrthoFinder?>`_to identify homologous *Drosophila melanogaster* proteins for your input proteins. Flybase `metabolic pathway and signaling pathway annotations are then transferred to your input proteins from these homologs.

Where to Find Pathannotator

Pathannotator is provided as a Docker container for use on the command line.

Docker Hub

The Dockerfile and scripts are available on GitHub

Getting the KOfam Databases

The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.

Using wget:

wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz

wget https://www.genome.jp/ftp/db/kofam/ko_list.gz

TIP: KEGG updates their annotations approximately once a month. If your profiles and ko_list files are older than this then your annotations will not be up-to-date. Just delete them and the new versions will be downloaded with your first annotation run.

If your species of interest has been annotated by the KEGG project you can provide this tool with the corresponding KEGG species code to pull those annotations directly. If your species of interest is not listed you should choose a closely related species and use that code. KEGG species codes can be found here: https://www.genome.jp/brite/br08611

Help and Usage Statement

On the command line the following help statement can be displayed with 'help'.

Help and Usage:
    There are 4 positional arguments.
    1: KEGG species code (NA or related species code if species not in KEGG; 'help' to see this help and usage statement)
       KEGG species codes can be found here: https://www.genome.jp/brite/br08611
    2: input file (protein FASTA without header lines)
    3: output directory (must be an existing directory; the file path should be  relative to, and inside of, your working directory)
    4: 'FB' for flybase annotations, 'NA' for none

    KofamScan is used under an MIT License:

    Copyright (c) 2019 Takuya Aramaki

    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.

    Flybase annotation is carried out using OrthoFinder <https://github.com/davidemms/OrthoFinder?>`_.

Benchmarking

The amount of time it takes to run this tool will vary greatly depending on several factors: 1. the number of sequences in your protein FASTA input file 2. whether your species has a KEGG species code 3. whether your input IDs are NCBI RefSeq protein IDs 4. whether you request Flybase annotation in addition to KEGG 5. how many CPUs you have available to run the analysis

Example runs for various circumstances described above. These examples were run on the SciNet Atlas HPC system using Apptainer.

Number of input sequences	KEGG species code	NCBI RefSeq protein	Inlude Flybase annotation	Number of CPUs	Time to run (minutes)
20,571	related species	yes	yes	2	1047
20,571	related species	yes	no	2	533
20,571	related species	yes	yes	12	192
20,571	related species	yes	no	12	86
20,571	related species	yes	yes	48	95
20,571	related species	yes	no	48	25

Number of input sequences	KEGG species code	NCBI RefSeq protein	Inlude Flybase annotation	Number of CPUs	Time to run (minutes)
22,272	same species	yes	yes	2	428
22,272	same species	yes	no	2	< 1
22,272	same species	yes	yes	12	96
22,272	same species	yes	no	12	< 1
22,272	same species	yes	yes	48	62
22,272	same species	yes	no	48	< 1

Number of input sequences	KEGG species code	NCBI RefSeq protein	Inlude Flybase annotation	Number of CPUs	Time to run (minutes)
18,330	related species	no	yes	2	>12hrs
18,330	related species	no	no	2	199
18,330	related species	no	yes	12	142
18,330	related species	no	no	12	32
18,330	related species	no	yes	48	48
18,330	related species	no	no	48	10

Pathannotator on the Command Line

Container Technologies

Pathannotator is provided as a Docker container.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

There are two major containerization technologies: Docker and Apptainer (Singularity).

Docker containers can be run with either technology.

Running Pathannotator using Docker

About Docker

Docker must be installed on the computer you wish to use for your analysis.
To run Docker you must have ‘root’ (admin) permissions (or use sudo).
Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems (see Apptainer/Singularity below).
Docker can be run on your local computer, a server, a cloud virtual machine etc.
For more information on installing Docker on other systems: Installing Docker.

Getting the Pathannotator container

The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

The container can be pulled with this command:

docker pull agbase/pathannotator:3.0

Remember

You must have root permissions or use sudo, like so:

sudo docker pull agbase/pathannotator:3.0

Getting the Help and Usage Statement

sudo docker run --rm agbase/pathannotator:3.0 help

TIP:

The /workdir directory is built into this container and should be used to mount your working directory.

The /data directory is built into this container and should be used to mount the KofamScan database files.

Example Command

sudo docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:3.0 \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB

Command Explained

sudo docker run: tells docker to run

--rm: removes the container when the analysis has finished. The image will remain for future use.

-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container

-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container

agbase/pathannotator:3.0: the name of the Docker image to use

Tip

All the options supplied after the image name are Pathannotator options

tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species or 'NA'.

GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).

out_dir: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.

FB: FB indicates that we want to get Flybase pathways annotations in addition to KEGG annotations.

Reference Understanding results.

Running Pathannotator using Apptainer (formerly Singularity)

About Apptainer

does not require ‘root’ permissions
runs all containers as the user that is logged into the host machine
HPC systems are likely to have Apptainer installed and are unlikely to object if asked to install it (no guarantees).
can be run on any machine where it is installed
more information about installing Apptainer
This tool was tested using Apptainer 1.3.1

HPC Job Schedulers

Although Apptainer can be installed on any computer this documentation assumes it will be run on an HPC system. The tool was tested on a Slurm system and the job submission scripts below reflect that. Submission scripts will need to be modified for use with other job scheduler systems.

Getting the Pathannotator container

The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

Example Slurm script:

#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=ceres
#SBATCH --account=nal_genomics

module load apptainer

cd /location/where/you/want/to/save/image/file

apptainer pull docker://agbase/pathannotator:3.0

Running Pathannotator with Data

Tip

There /workdir directory is built into this container and should be used to mount your local working directory.

There /data directory is built into this container and should be used to mount the KOfam database files.

Example Slurm Script

#!/bin/bash
#SBATCH --job-name=pathannot
#SBATCH --ntasks=48
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --partition=ceres
#SBATCH --account=nal_genomics

module load apptainer

cd /directory/you/want/to/work/in

singularity run \
-B /directory/you/want/to/work/in:/workdir \
-B /directory/with/kofam/database/files:/data \
/path/with/image/file/pathannotator_3.0.sif \
tca \
GCF_031307605.1_icTriCast1.1_protein.faa \
out_dir \
FB

Command Explained

apptainer run: tells Apptainer to run

-B /directory/you/want/to/work/in:/workdir: mounts the working directory on the host machine to '/workdir' in the container

-B /directory/with/kofam/database/files:/data: mounts the directory with the kofam database file (or where you want them stored) on the host machine to '/data' in the container

/path/with/image/file/pathannotator_3.0.sif: the name of the Apptainer image to use

Tip

All the options supplied after the image name are Pathannotator options

tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If you species doesn't have a code choose a closely related species or 'NA'.

GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines)

out_dir: Directory where you want the outputs of the pipeline to be stored. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.

FB: FB indicates that you want Flybase pathways annotations in addition to KEGG annotations

Reference Understanding results.

Understanding Your Results

The output files you can expect will differ depending on the circumstances of your run. If you are using the KEGG code for your species of interest and your FASTA protein identifiers are NCBI protein IDs then your annotations will be pulled directly from the KEGG API. In other circumstances (detailed below) KofamScan will be run to identify homologs and transfer annotations to your species of interest. Under all circumstances you may specify whether or not you want to receive Flybase pathways annotations as well.

Same-species KEGG code and NCBI RefSeq protein IDs

Expected output files:

tca_KEGG_species.tsv: These are KEGG's annotations of the NCBI-RefSeq proteins to the species-specific KEGG pathways. The filename will begin with the KEGG species code. The pathway identifiers will begin the KEGG species code. Note that for species-specific pathways, KEGG internally filters associations between the KO (KEGG Orthology) accession and the reference pathway.

Input_protein_ID

KEGG_KO

KEGG_tca_pathway

KEGG_tca_pathway_name

XP_001813251.1

K01540

tca04820

Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)

XP_001812480.1

K02268

tca00190

Oxidative phosphorylation - Tribolium castaneum (red flour beetle)

XP_008195997.1

K04676

tca04350

TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

tca_KEGG_ref.tsv: These are KEGG's annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'. You should expect more pathway annotations per protein than for the species-specific pathway.

Input_protein_ID

KEGG_KO

KEGG_ref_pathway

KEGG_ref_pathway_name

XP_015835225.1

K26207

map04024

cAMP signaling pathway

XP_015835225.1

K26207

map04261

Adrenergic signaling in cardiomyocytes

XP_001813251.1

K01540

map04022

cGMP-PKG signaling pathway

tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.

Input_protein_ID

pathway

NP_001034488.1

KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950

NP_001034489.1

KEGG:map04391,KEGG:tca04391
tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.

pathway

Input_protein_ID

Flybase:FBgg0000881

XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1

KEGG:tca03273

XP_008192998.2,XP_009105448.1,XP_008196990.1

OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

Input_protein_ID

Flybase_protein_ID

Flybase_pathway_ID

Flybase_pathway_name

NP_001034540.1

FBpp0077451

FBgg0001085

BMP Signaling Pathway Core Components

NP_001034503.2

FBpp0084690

FBgg0000904

Insulin-like Receptor Signaling Pathway Core Components

NP_001034492.1

FBpp0078442

FBgg0002045

CHITIN BIOSYNTHESIS

dme_flybase.tsv: This is an alternative to 'OrthoFinder_flybase.tsv' if you are annotating Drosophila melanogaster.

Input_protein_ID

KEGG_KO

Flybase_pathway_ID

Flybase_pathway_name

NP_001034490.1

K04491

FBgg0000890

Wnt-TCF Signaling Pathway Core Components

NP_001034491.1

K00698

FBgg0002045

CHITIN BIOSYNTHESIS

NP_001034491.1

K00698

FBgg0002045

CHITIN BIOSYNTHESIS

KEGG code for a related species

Expected output files:

kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.

# gene name

KO

thrshld

score

E-value

KO definition

NP_001034280.2

K10180

417.47

374.4

1.2e-113

T-box protein 6

NP_001034280.2

K10177

886.07

309.5

7.2e-94

T-box protein 3

NP_001034280.2

K10176

750.77

300.4

4.6e-91

T-box protein 2

tca_KEGG_species.tsv: These are annotations to the species-specific KEGG pathway. The pathway identifiers will begin with the KEGG species code.

Input_protein_ID

KEGG_KO

KEGG_tca_pathway

KEGG_tca_pathway_name

XP_001813251.1

K01540

tca04820

Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)

XP_001812480.1

K02268

tca00190

Oxidative phosphorylation - Tribolium castaneum (red flour beetle)

XP_008195997.1

K04676

tca04350

TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

tca_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.

Input_protein_ID

KEGG_KO

KEGG_ref_pathway

KEGG_ref_pathway_name

XP_015835225.1

K26207

map04024

cAMP signaling pathway

XP_015835225.1

K26207

map04261

Adrenergic signaling in cardiomyocytes

XP_001813251.1

K01540

map04022

cGMP-PKG signaling pathway

tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.
tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.

pathway

Input_protein_ID

Flybase:FBgg0000918

XP044254039.1,XP_044272825.1

KEGG:tca00780

XP_044253000.1,XP_044261349.1,XP_044272235.1

OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

Input_protein_ID

Flybase_protein_ID

Flybase_pathway_ID

Flybase_pathway_name

NP_001034540.1

FBpp0077451

FBgg0001085

BMP Signaling Pathway Core Components

NP_001034503.2

FBpp0084690

FBgg0000904

Insulin-like Receptor Signaling Pathway Core Components

NP_001034492.1

FBpp0078442

FBgg0002045

CHITIN BIOSYNTHESIS

'NA' as KEGG code

Expected output files:

If you did not specify a KEGG species code (used 'NA') then no species-specific annotations file will be generated.

kofam_result_full.txt: This is the full output from KofamScan. According to KEGG: "K number assignments with scores above the predefined thresholds for individual KOs are more reliable than other proposed assignments. Such high score assignments are highlighted with asterisks '*' in the output." Pathways annotations have not yet been identified.

# gene name

KO

thrshld

score

E-value

KO definition

NP_001034280.2

K10180

417.47

374.4

1.2e-113

T-box protein 6

NP_001034280.2

K10177

886.07

309.5

7.2e-94

T-box protein 3

NP_001034280.2

K10176

750.77

300.4

4.6e-91

T-box protein 2

NA_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers wil begin with 'map'.

Input_protein_ID

KEGG_KO

KEGG_ref_pathway

KEGG_ref_pathway_name

NP_001034489.1

K16672

map04391

Hippo signaling pathway - fly

NP_001034490.1

K04491

map04310

Wnt signaling pathway

NP_001034490.1

K04491

map04390

Hippo signaling pathway

tca_acc_pathways.tsv: This file contains the aggregation of all pathway annotations for each input identifier.

Input_protein_ID

pathway

NP_001034488.1

KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950

NP_001034489.1

KEGG:map04391,KEGG:tca04391
tca_pathways_acc.tsv: This file contains the aggregation of all input identifiers annotated to each of the pathways.

pathway

Input_protein_ID

Flybase:FBgg0000881

XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1

KEGG:tca03273

XP_008192998.2,XP_009105448.1,XP_008196990.1

OrthoFinder_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

Input_protein_ID

Flybase_protein_ID

Flybase_pathway_ID

Flybase_pathway_name

NP_001034540.1

FBpp0077451

FBgg0001085

BMP Signaling Pathway Core Components

NP_001034503.2

FBpp0084690

FBgg0000904

Insulin-like Receptor Signaling Pathway Core Components

NP_001034492.1

FBpp0078442

FBgg0002045

CHITIN BIOSYNTHESIS

NP_001034491.1

FBpp0290640

FBgg0002045

CHITIN BIOSYNTHESIS

Contact us

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
pipeline		pipeline
Dockerfile		Dockerfile
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intro

Where to Find Pathannotator

Getting the KOfam Databases

Help and Usage Statement

Benchmarking

Pathannotator on the Command Line

Container Technologies

Running Pathannotator using Docker

Getting the Pathannotator container

Getting the Help and Usage Statement

Example Command

Command Explained

Running Pathannotator using Apptainer (formerly Singularity)

Getting the Pathannotator container

Running Pathannotator with Data

Example Slurm Script

Command Explained

Understanding Your Results

Same-species KEGG code and NCBI RefSeq protein IDs

Expected output files:

KEGG code for a related species

Expected output files:

'NA' as KEGG code

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Input_protein_ID	KEGG_KO	KEGG_tca_pathway	KEGG_tca_pathway_name
XP_001813251.1	K01540	tca04820	Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
XP_001812480.1	K02268	tca00190	Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
XP_008195997.1	K04676	tca04350	TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

Input_protein_ID	KEGG_KO	KEGG_ref_pathway	KEGG_ref_pathway_name
XP_015835225.1	K26207	map04024	cAMP signaling pathway
XP_015835225.1	K26207	map04261	Adrenergic signaling in cardiomyocytes
XP_001813251.1	K01540	map04022	cGMP-PKG signaling pathway

Input_protein_ID	pathway
NP_001034488.1	KEGG:map04013,KEGG:tca04013,Flybase:FBgg0000956,Flybase:FBgg0000950
NP_001034489.1	KEGG:map04391,KEGG:tca04391

pathway	Input_protein_ID
Flybase:FBgg0000881	XP_008196394.1,XP_008194025.1,XP_001807060.1,XP_015839080.1
KEGG:tca03273	XP_008192998.2,XP_009105448.1,XP_008196990.1

Input_protein_ID	Flybase_protein_ID	Flybase_pathway_ID	Flybase_pathway_name
NP_001034540.1	FBpp0077451	FBgg0001085	BMP Signaling Pathway Core Components
NP_001034503.2	FBpp0084690	FBgg0000904	Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1	FBpp0078442	FBgg0002045	CHITIN BIOSYNTHESIS

# gene name	KO	thrshld	score	E-value	KO definition
NP_001034280.2	K10180	417.47	374.4	1.2e-113	T-box protein 6
NP_001034280.2	K10177	886.07	309.5	7.2e-94	T-box protein 3
NP_001034280.2	K10176	750.77	300.4	4.6e-91	T-box protein 2

pathway	Input_protein_ID
Flybase:FBgg0000918	XP044254039.1,XP_044272825.1
KEGG:tca00780	XP_044253000.1,XP_044261349.1,XP_044272235.1

AgBase/pathannotator

Folders and files

Latest commit

History

Repository files navigation

Intro

Where to Find Pathannotator

Getting the KOfam Databases

Help and Usage Statement

Benchmarking

Pathannotator on the Command Line

Container Technologies

Running Pathannotator using Docker

Getting the Pathannotator container

Getting the Help and Usage Statement

Example Command

Command Explained

Running Pathannotator using Apptainer (formerly Singularity)

Getting the Pathannotator container

Running Pathannotator with Data

Example Slurm Script

Command Explained

Understanding Your Results

Same-species KEGG code and NCBI RefSeq protein IDs

Expected output files:

KEGG code for a related species

Expected output files:

'NA' as KEGG code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages