HD_Oppiella/annotation at master · AsexGenomeEvol/HD_Oppiella

Name	Name	Last commit message	Last commit date
parent directory ..
config_files	config_files
.DS_Store	.DS_Store
README.md	README.md

Pipeline description for protein-coding genes annotation.

Required files
Structural annotation
Functional annotation
EMBL format

1) Required files

1.1) Final genome

<sp>_b1v03 have been used.

1.2) BUSCO

BUSCO has been run with the database arthropoda_odb9.

source Software/busco/3.0.2b.sh

run_BUSCO.py --long -i *.fasta -o output -l arthropoda_odb9 -m geno -c 10

1.3) Transcript evidence

For the transcriptome: STAR for mapping and Trinity genome guided for the assembly.

Trimmomatic was used for data filtering.

java -jar trimmomatic.jar PE -threads 1 -phred33 *.raw.pair1.fastq *.raw.pair2.fastq \
trim/180/*.trim.pair1.fastq trim/180/*.trim.single1.fastq trim/180/*.trim.pair2.fastq trim/180/*.trim.single2.fastq \
ILLUMINACLIP:/scratch/beegfs/monthly/ptranvan/Software/trimmomatic/0.36/adapters/all-PE_rna.fa:2:30:12:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.4 MINLEN:80

Map the reads with STAR (single end mode).

module add UHTS/Aligner/STAR/2.5.3a;

# Index creation
 
mkdir index
STAR --runMode genomeGenerate --genomeDir index --genomeFastaFiles genome.fasta --runThreadN 15

# Mapping step

mkdir 2pass_basic

STAR --genomeDir ../index/ --readFilesIn <*.fastq.gz ...> --runThreadN 10 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --readFilesCommand zcat --limitBAMsortRAM 8051268437 --outFileNamePrefix 2pass_basic/out.

Assembly with Trinity (genome guided mode).

module add UHTS/Assembler/trinityrnaseq/2.5.1;

# out.Aligned.sortedByCoord.out.bam output by STAR.

Trinity --genome_guided_bam out.Aligned.sortedByCoord.out.bam --genome_guided_max_intron 100000 --max_memory 200G --CPU 35 --SS_lib_type RF --jaccard_clip

Transcriptome unfiltered: *Tv01.fasta.

Filter the transcriptome.

a) Map all samples against transcriptome and compute the TPM value per sample using Kallisto.

module add UHTS/Analysis/kallisto/0.43.1;

kallisto quant -i <index_transcriptome> -t 10 --single -l 200 -s 20 --rf-stranded --bias -o <output> <reads>

b) Keep contigs that have at least 1 TPM in any sample.

module add R/3.4.2;

/software/UHTS/Assembler/trinityrnaseq/2.5.1/util/abundance_estimates_to_matrix.pl --est_method kallisto --out_prefix merge --name_sample_by_basedir <*abundance.tsv ...> 

/software/UHTS/Assembler/trinityrnaseq/2.5.1/util/misc/count_matrix_features_given_MIN_TPM_threshold.pl merge.TPM.not_cross_norm | tee merge.TPM.not_cross_norm.counts_by_min_TPM

/software/UHTS/Assembler/trinityrnaseq/2.5.1/util/filter_low_expr_transcripts.pl --matrix merge.TPM.not_cross_norm --transcripts *Tv01.fasta --min_expr_any 1 > *Tv02.fasta

Transcriptome filtered: *Tv02.fasta.

1.4) Protein evidence

UniProtKB/Swiss-Prot (release 2018_01) + BUSCO arthropoda_odb9.

cat uniprot_sprot.fasta arthropoda_odb9/ancestral_variants > uniprot_sprot_arthropoda.fasta

2) Structural annotation

MAKER (v. 2.31.8) has been used.

Set the environment (Only for Vital-IT):

export PATH=/scratch/beegfs/monthly/ptranvan/Software/mpich-3.2/mpi_install/bin:/scratch/beegfs/monthly/ptranvan/Software/maker_7_local/bin:/software/SequenceAnalysis/GenePrediction/snoscan/0.9.1/bin:/software/SequenceAnalysis/GenePrediction/tRNAscan-SE/2.0.0/bin:/software/SequenceAnalysis/GenePrediction/augustus/3.2.3/scripts:/software/SequenceAnalysis/GenePrediction/augustus/3.2.3/bin:/scratch/beegfs/monthly/ptranvan/Software/RepeatMasker/4.0.7_local:/software/Blast/ncbi-blast/2.7.1+/bin:/software/SequenceAnalysis/Repeat/trf/4.07b/bin:/software/SequenceAnalysis/HMM-Profile/hmmer/3.1b2/bin:/software/SequenceAnalysis/Repeat/RMBlast/2.6.0+/bin:/software/UHTS/Aligner/phrap/0.990329/bin:/software/SequenceAnalysis/SequenceAlignment/exonerate/2.4.0/bin:/software/SequenceAnalysis/GenePrediction/snap/2013.11.29/bin:/home/ptranvan/qiime/bin/:/home/ptranvan/qiime/bin/:/mnt/common/lsf/9.1/linux2.6-glibc2.3-x86_64/etc:/mnt/common/lsf/9.1/linux2.6-glibc2.3-x86_64/bin:/software/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/scratch/beegfs/monthly/ptranvan/Software/assemblathon/1.0:/home/ptranvan/bin:/scratch/beegfs/monthly/ptranvan/Software/assemblathon/1.0

export AUGUSTUS_CONFIG_PATH=/scratch/beegfs/monthly/ptranvan/Software/busco/3.0.2b/augustus_config_7

2.1) Iteration 1

Create and modify control files. See config_files for more details.

mkdir maker
cd maker

maker -CTL

Parameters to change:

In maker_bopts.ctl:

depth_blastn=10
depth_blastx=10
depth_tblastx=10

In maker_opts.ctl:

genome=../*b1v03.fasta
est=../evidence/*_Tv02.fasta
protein=../evidence/uniprot_sprot_arthropoda.fasta

augustus_species=BUSCO_*
est2genome=1
protein2genome=1

max_dna_len=300000
alt_splice=1
split_hit=100000
correct_est_fusion=1

Run MAKER.

mpiexec -n 20 maker

Once finished:

gff3_merge -d *.maker.output/*_master_datastore_index.log

mv maker_opts.ctl maker_opts_round1.ctl
mv *.all.gff *.all.round1.gff
cp maker_opts_round1.ctl maker_opts.ctl

2.2) Iteration 2

Train MAKER result from iteration 1 with SNAP.

mkdir -p snap/round1
cd snap/round1

maker2zff ../../maker/*.all.round1.gff 

fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna

hmm-assembler.pl * . > *.maker_round1_snap.hmm

Train MAKER result from iteration 1 with Augustus.

autoAug.pl --genome=../snap/round1/export.dna --species=AUGUSTUS_*_round1 --trainingset=../snap/round1/export.aa --singleCPU -v --useexisting > output.log

Modify control files.

Parameters to change:

In maker_opts.ctl:

snaphmm=../snap/round1/*.maker_round1_snap.hmm
augustus_species=AUGUSTUS_*_round1
est2genome=0
protein2genome=0

Run MAKER.

mpiexec -n 20 maker

Once finished:

gff3_merge -d *.maker.output/*_master_datastore_index.log

mv maker_opts.ctl maker_opts_round2.ctl
mv *.all.gff *.all.round2.gff
cp maker_opts_round2.ctl maker_opts.ctl

3) Functional annotation

Create GFF file.

mkdir functional
cd functional

awk '/\tmaker\t/' ../*.all.round2.gff > *.max.gff

Create protein and transcript fasta files.

fasta_merge -d ../*.maker.output/*_master_datastore_index.log

cp *.all.maker.proteins.fasta *.max.proteins.fasta
cp *.all.maker.transcripts.fasta *.max.transcripts.fasta

Assign short id to each protein coding genes.

maker_map_ids --prefix ON_ --justify 5 *.max.gff > *.max.map	#example for O. nova.
    
map_gff_ids *.max.map *.max.gff
map_fasta_ids *.max.map *.max.proteins.fasta
map_fasta_ids *.max.map *.max.transcripts.fasta

Assign gene functions and GO-terms with Blast2GO.

Default parameters against both the NCBI non-redundant arthropods protein database and the Drosophila melanogaster drosoph database.

4) EMBL format

The genome and the annotation have been converted to EMBL format in order to be deposited in the European Nucleotide Archive (ENA) using AGAT and EMBLmyGFF3.

Fixing mRNA duplicates and flag genes with short intron.

agat_sp_fix_features_locations_duplicated.pl --gff * --out *

agat_sp_flag_short_introns.pl --gff * --out *

Convert to EMBL format.

EMBLmyGFF3 *.gff *.fasta --topology linear --molecule_type 'genomic DNA' --transl_table 1  --species '*' --locus_tag * --project_id * --de * -o *

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotation

annotation

README.md

Pipeline description for protein-coding genes annotation.

Table of contents

1) Required files

1.1) Final genome

1.2) BUSCO

1.3) Transcript evidence

1.4) Protein evidence

2) Structural annotation

2.1) Iteration 1

2.2) Iteration 2

3) Functional annotation

4) EMBL format

Files

annotation

Directory actions

More options

Directory actions

More options

Latest commit

History

annotation

Folders and files

parent directory

README.md

Pipeline description for protein-coding genes annotation.

Table of contents

1) Required files

1.1) Final genome

1.2) BUSCO

1.3) Transcript evidence

1.4) Protein evidence

2) Structural annotation

2.1) Iteration 1

2.2) Iteration 2

3) Functional annotation

4) EMBL format