Trimmomatic was used for paired end filtering.
java -jar trimmomatic.jar PE -threads 1 -phred33 *.raw.180.pair1.fastq *.raw.180.pair2.fastq \
trim/180/*.trim.180.pair1.fastq trim/180/*.trim.180.single1.fastq trim/180/*.trim.180.pair2.fastq trim/180/*.trim.180.single2.fastq \
ILLUMINACLIP:trimmomatic/0.36/adapters/all-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:3:20 MINLEN:100
Nxtrim and Trimmomatic were used for mate pair filtering.
module add UHTS/Quality_control/NxTrim/0.4.1
nxtrim -1 *.pair1.fastq -2 *.pair2.fastq -O name --separate --preserve-mp --minlength 40
Categories MP and UNKNOWN were concatenated (as suggested by the authors) then Trimmomatic was used for further fitlering.
java -jar trimmomatic.jar PE -threads 1 -phred33 *.nxtrim.3000.pair1.fastq *.nxtrim.3000.pair2.fastq \
*.nxtrim.trim.3000.pair1.fastq *.nxtrim.trim.3000.single1.fastq *.nxtrim.trim.3000.pair2.fastq *.nxtrim.trim.3000.single2.fastq \
ILLUMINACLIP:trimmomatic/0.36/adapters/all-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:60
PCR whole genome amplification produce unveven coverage. BBnorm was used to reduce that coverage bias as most of assemblers assume uniform coverage.
- Reformat the overlapped library to interleaved format and merge the overlapped reads:
module add UHTS/Aligner/BBMap/36.59
reformat.sh in1=*.trim.180.pair1.fastq in2=*.trim.180.pair2.fastq out=*.trim.180.pair12.fastq 2> *.reformat.log
bbmerge.sh in=*.trim.180.pair12.fastq out=*.trim.merged.180.pair12.fastq minoverlap=15 mismatches=0 ecct strict -Xmx90g threads=35 2> *.bbmerge.log
- Reverse complement the S2:
module add UHTS/Analysis/fastx_toolkit/0.0.13.2;
fastx_reverse_complement -i *.trim.180.single2.fastq -o *.trim.rc.180.single2.fastq -Q33
- Concatenate the merged PE and the single reads:
cat *.trim.merged.180.pair12.fastq *.trim.180.single1.fastq *.trim.rc.180.single2.fastq > *.trim.180.all.fastq
- Normalized the concatenate file (composed of single reads):
bbnorm.sh in=*.trim.180.all.fastq out=*.trim.norm.180.all.fastq target=65 min=3 -Xmx150g threads=20 2> bbnorm.log
- Assembly with SPAdes:
export PATH=$PATH:/scratch/beegfs/monthly/ptranvan/Software/SPAdes-3.10.1-Linux/bin
spades.py -m 400 -t 27 --careful -k 21,33,55,77,99,111,127 -o * --s1 *.trim.norm.180.all.fastq
The other libraries (where insert_size>2*len(reads)) were used for scaffolding and gap-closing.
- Scaffolding with SSPACE:
export PERL5LIB=/home/ptranvan/perl5/lib/perl5
module add UHTS/Aligner/bwa/0.7.13
perl /scratch/beegfs/monthly/ptranvan/Software/sspace/3.0/SSPACE_Standard_v3.0.pl -l *.sspace.txt -s */contigs.fasta -b * -T 25
- Gap-closing with Gapcloser:
/scratch/beegfs/monthly/ptranvan/Software/GapCloser/1.12-r6/GapCloser -a ../*.final.scaffolds.fasta -b *.gapclose.txt -o *v01.fasta -l 125 -t 25 >> sm2.gapclose.log
BlobTools was used for contamination checking.
- Map reads back to the genome:
module add UHTS/Aligner/bwa/0.7.15
bwa mem -M db_bwa/* *.trim.350.pair1.fastq.gz *.trim.350.pair2.fastq.gz -t 15 2> bwa/*.trim.350.pair12.bwa.output.log | samtools view -bS - > bwa/*.trim.350.pair12.bam
- Compute coverage for each scaffold:
cat ../*.sam > merge.sam
module add UHTS/Analysis/BBMap/37.82
module add UHTS/Analysis/samtools/1.4
pileup.sh in=merge.sam out=merge.stats.txt hist=histogram.txt
# Reformat the output to <Name_scaffold>\t<Coverage>
awk {'printf ("%s\t%s\n", $1, $2)'} merge.stats.txt | awk '{if(NR>1)print}' > merge.stats_parse.txt
- Blastn against nt:
module add Blast/ncbi-blast/2.7.1+;
export BLASTDB=/archive/dee/schwander/ptranvan/Database/taxdb:$BLASTDB
blastn -query *v01.fasta -db /archive/dee/schwander/ptranvan/Database/nt/nt \
-outfmt '6 qseqid staxids bitscore evalue std sscinames sskingdoms stitle' -max_target_seqs 10 \
-max_hsps 1 -evalue 1e-25 -num_threads 15 -out *.vs.nt.max10.1e25.blastn.out
- Run BlobTools:
source /scratch/beegfs/monthly/ptranvan/Software/blobtools/1.0.sh
blobtools create -i *v01.fasta -t *.vs.nt.max10.1e25.blastn.out \
--nodes /scratch/beegfs/monthly/ptranvan/Software/blobtools/1.0/data/nodes.dmp \
--names /scratch/beegfs/monthly/ptranvan/Software/blobtools/1.0/data/names.dmp \
-c merge.stats_parse.txt -x bestsumorder -o *
blobtools blobplot -i *.blobDB.json --sort count --hist count -x bestsumorder
blobtools view -i *.blobDB.json --hits --rank all -x bestsumorder
- Scaffolds without hits to metazoans were filtered out.
python contamination_filtration.py -s contamination_identification -i1 *.blobDB.table.txt
module add UHTS/Analysis/BBMap/37.82
filterbyname.sh in=*v01.fasta names=contaminant_scaffolds.txt out=*.filtered.fasta include=f -Xmx20g
- Sort by descreasing size and rename the scaffolds header.
seqkit sort --by-length --reverse *.filtered.fasta -o - | rename_headers.awk -v sp=* version=b1v02 - > *v02.fasta
- Scaffolds > 500 bp were kept:
prinseq-lite/0.20.4/prinseq-lite.pl -fasta *.fasta -min_len 501 -out_good *.501 -out_bad *.bad 2> prinseq.stat
- Sort by descreasing size and rename the scaffolds header.
seqkit sort --by-length --reverse *.501 -o - | rename_headers.awk -v sp=* version=b1v03 - > *v03.fasta
➡️ v03 is the final assembly.
module add SequenceAnalysis/CEGMA/2.5
cegma -g *v03.fasta -T 10 -o *v03 2> cegma.output_log.txt
source /scratch/beegfs/monthly/ptranvan/Software/busco/3.0.2b.sh
run_BUSCO.py --long -i *v03.fasta -o output -l /scratch/beegfs/monthly/ptranvan/Software/busco/3.0/arthropoda_odb9 -m geno -c 10