<sp>_b1v03
have been used.
BUSCO has been run with the database arthropoda_odb9
.
source Software/busco/3.0.2b.sh
run_BUSCO.py --long -i *.fasta -o output -l arthropoda_odb9 -m geno -c 10
For the transcriptome: STAR for mapping and Trinity genome guided for the assembly.
- Trimmomatic was used for data filtering.
java -jar trimmomatic.jar PE -threads 1 -phred33 *.raw.pair1.fastq *.raw.pair2.fastq \
trim/180/*.trim.pair1.fastq trim/180/*.trim.single1.fastq trim/180/*.trim.pair2.fastq trim/180/*.trim.single2.fastq \
ILLUMINACLIP:/scratch/beegfs/monthly/ptranvan/Software/trimmomatic/0.36/adapters/all-PE_rna.fa:2:30:12:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.4 MINLEN:80
- Map the reads with STAR (single end mode).
module add UHTS/Aligner/STAR/2.5.3a;
# Index creation
mkdir index
STAR --runMode genomeGenerate --genomeDir index --genomeFastaFiles genome.fasta --runThreadN 15
# Mapping step
mkdir 2pass_basic
STAR --genomeDir ../index/ --readFilesIn <*.fastq.gz ...> --runThreadN 10 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --readFilesCommand zcat --limitBAMsortRAM 8051268437 --outFileNamePrefix 2pass_basic/out.
- Assembly with Trinity (genome guided mode).
module add UHTS/Assembler/trinityrnaseq/2.5.1;
# out.Aligned.sortedByCoord.out.bam output by STAR.
Trinity --genome_guided_bam out.Aligned.sortedByCoord.out.bam --genome_guided_max_intron 100000 --max_memory 200G --CPU 35 --SS_lib_type RF --jaccard_clip
Transcriptome unfiltered: *Tv01.fasta
.
- Filter the transcriptome.
a) Map all samples against transcriptome and compute the TPM value per sample using Kallisto.
module add UHTS/Analysis/kallisto/0.43.1;
kallisto quant -i <index_transcriptome> -t 10 --single -l 200 -s 20 --rf-stranded --bias -o <output> <reads>
b) Keep contigs that have at least 1 TPM in any sample.
module add R/3.4.2;
/software/UHTS/Assembler/trinityrnaseq/2.5.1/util/abundance_estimates_to_matrix.pl --est_method kallisto --out_prefix merge --name_sample_by_basedir <*abundance.tsv ...>
/software/UHTS/Assembler/trinityrnaseq/2.5.1/util/misc/count_matrix_features_given_MIN_TPM_threshold.pl merge.TPM.not_cross_norm | tee merge.TPM.not_cross_norm.counts_by_min_TPM
/software/UHTS/Assembler/trinityrnaseq/2.5.1/util/filter_low_expr_transcripts.pl --matrix merge.TPM.not_cross_norm --transcripts *Tv01.fasta --min_expr_any 1 > *Tv02.fasta
Transcriptome filtered: *Tv02.fasta
.
- UniProtKB/Swiss-Prot (release 2018_01) + BUSCO
arthropoda_odb9
.
cat uniprot_sprot.fasta arthropoda_odb9/ancestral_variants > uniprot_sprot_arthropoda.fasta
MAKER (v. 2.31.8) has been used.
Set the environment (Only for Vital-IT):
export PATH=/scratch/beegfs/monthly/ptranvan/Software/mpich-3.2/mpi_install/bin:/scratch/beegfs/monthly/ptranvan/Software/maker_7_local/bin:/software/SequenceAnalysis/GenePrediction/snoscan/0.9.1/bin:/software/SequenceAnalysis/GenePrediction/tRNAscan-SE/2.0.0/bin:/software/SequenceAnalysis/GenePrediction/augustus/3.2.3/scripts:/software/SequenceAnalysis/GenePrediction/augustus/3.2.3/bin:/scratch/beegfs/monthly/ptranvan/Software/RepeatMasker/4.0.7_local:/software/Blast/ncbi-blast/2.7.1+/bin:/software/SequenceAnalysis/Repeat/trf/4.07b/bin:/software/SequenceAnalysis/HMM-Profile/hmmer/3.1b2/bin:/software/SequenceAnalysis/Repeat/RMBlast/2.6.0+/bin:/software/UHTS/Aligner/phrap/0.990329/bin:/software/SequenceAnalysis/SequenceAlignment/exonerate/2.4.0/bin:/software/SequenceAnalysis/GenePrediction/snap/2013.11.29/bin:/home/ptranvan/qiime/bin/:/home/ptranvan/qiime/bin/:/mnt/common/lsf/9.1/linux2.6-glibc2.3-x86_64/etc:/mnt/common/lsf/9.1/linux2.6-glibc2.3-x86_64/bin:/software/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/scratch/beegfs/monthly/ptranvan/Software/assemblathon/1.0:/home/ptranvan/bin:/scratch/beegfs/monthly/ptranvan/Software/assemblathon/1.0
export AUGUSTUS_CONFIG_PATH=/scratch/beegfs/monthly/ptranvan/Software/busco/3.0.2b/augustus_config_7
- Create and modify control files. See config_files for more details.
mkdir maker
cd maker
maker -CTL
Parameters to change:
- In
maker_bopts.ctl
:
depth_blastn=10
depth_blastx=10
depth_tblastx=10
- In
maker_opts.ctl
:
genome=../*b1v03.fasta
est=../evidence/*_Tv02.fasta
protein=../evidence/uniprot_sprot_arthropoda.fasta
augustus_species=BUSCO_*
est2genome=1
protein2genome=1
max_dna_len=300000
alt_splice=1
split_hit=100000
correct_est_fusion=1
- Run MAKER.
mpiexec -n 20 maker
Once finished:
gff3_merge -d *.maker.output/*_master_datastore_index.log
mv maker_opts.ctl maker_opts_round1.ctl
mv *.all.gff *.all.round1.gff
cp maker_opts_round1.ctl maker_opts.ctl
- Train MAKER result from iteration 1 with SNAP.
mkdir -p snap/round1
cd snap/round1
maker2zff ../../maker/*.all.round1.gff
fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl * . > *.maker_round1_snap.hmm
- Train MAKER result from iteration 1 with Augustus.
autoAug.pl --genome=../snap/round1/export.dna --species=AUGUSTUS_*_round1 --trainingset=../snap/round1/export.aa --singleCPU -v --useexisting > output.log
- Modify control files.
Parameters to change:
- In
maker_opts.ctl
:
snaphmm=../snap/round1/*.maker_round1_snap.hmm
augustus_species=AUGUSTUS_*_round1
est2genome=0
protein2genome=0
- Run MAKER.
mpiexec -n 20 maker
Once finished:
gff3_merge -d *.maker.output/*_master_datastore_index.log
mv maker_opts.ctl maker_opts_round2.ctl
mv *.all.gff *.all.round2.gff
cp maker_opts_round2.ctl maker_opts.ctl
- Create GFF file.
mkdir functional
cd functional
awk '/\tmaker\t/' ../*.all.round2.gff > *.max.gff
- Create protein and transcript fasta files.
fasta_merge -d ../*.maker.output/*_master_datastore_index.log
cp *.all.maker.proteins.fasta *.max.proteins.fasta
cp *.all.maker.transcripts.fasta *.max.transcripts.fasta
- Assign short id to each protein coding genes.
maker_map_ids --prefix ON_ --justify 5 *.max.gff > *.max.map #example for O. nova.
map_gff_ids *.max.map *.max.gff
map_fasta_ids *.max.map *.max.proteins.fasta
map_fasta_ids *.max.map *.max.transcripts.fasta
- Assign gene functions and GO-terms with Blast2GO.
Default parameters against both the NCBI non-redundant arthropods protein
database and the Drosophila melanogaster drosoph
database.
The genome and the annotation have been converted to EMBL format in order to be deposited in the European Nucleotide Archive (ENA) using AGAT and EMBLmyGFF3.
- Fixing mRNA duplicates and flag genes with short intron.
agat_sp_fix_features_locations_duplicated.pl --gff * --out *
agat_sp_flag_short_introns.pl --gff * --out *
- Convert to EMBL format.
EMBLmyGFF3 *.gff *.fasta --topology linear --molecule_type 'genomic DNA' --transl_table 1 --species '*' --locus_tag * --project_id * --de * -o *