HD_Oppiella/assembly at master · AsexGenomeEvol/HD_Oppiella

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
contamination_filtration.py	contamination_filtration.py

Pipeline description for genome assembly.

QC pre-processing on raw data
- Paired end
- Mate pair
Assembly
- Contig
- Scaffolding and Gap-closing
Filtration
- Contamination removal
- Size filtration
Evaluation

1) QC pre-processing on raw data

1.1) Paired end

Trimmomatic was used for paired end filtering.

java -jar trimmomatic.jar PE -threads 1 -phred33 *.raw.180.pair1.fastq *.raw.180.pair2.fastq \
trim/180/*.trim.180.pair1.fastq trim/180/*.trim.180.single1.fastq trim/180/*.trim.180.pair2.fastq trim/180/*.trim.180.single2.fastq \
ILLUMINACLIP:trimmomatic/0.36/adapters/all-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:3:20 MINLEN:100

1.2) Mate pair

Nxtrim and Trimmomatic were used for mate pair filtering.

module add UHTS/Quality_control/NxTrim/0.4.1

nxtrim -1 *.pair1.fastq -2 *.pair2.fastq -O name --separate --preserve-mp --minlength 40

Categories MP and UNKNOWN were concatenated (as suggested by the authors) then Trimmomatic was used for further fitlering.

java -jar trimmomatic.jar PE -threads 1 -phred33 *.nxtrim.3000.pair1.fastq *.nxtrim.3000.pair2.fastq \
*.nxtrim.trim.3000.pair1.fastq *.nxtrim.trim.3000.single1.fastq *.nxtrim.trim.3000.pair2.fastq *.nxtrim.trim.3000.single2.fastq \
ILLUMINACLIP:trimmomatic/0.36/adapters/all-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:60

2) Assembly

2.1) Contig

PCR whole genome amplification produce unveven coverage. BBnorm was used to reduce that coverage bias as most of assemblers assume uniform coverage.

Reformat the overlapped library to interleaved format and merge the overlapped reads:

module add UHTS/Aligner/BBMap/36.59

reformat.sh in1=*.trim.180.pair1.fastq in2=*.trim.180.pair2.fastq out=*.trim.180.pair12.fastq 2> *.reformat.log

bbmerge.sh in=*.trim.180.pair12.fastq out=*.trim.merged.180.pair12.fastq minoverlap=15 mismatches=0 ecct strict -Xmx90g threads=35 2> *.bbmerge.log

Reverse complement the S2:

module add UHTS/Analysis/fastx_toolkit/0.0.13.2;
 
fastx_reverse_complement -i *.trim.180.single2.fastq -o *.trim.rc.180.single2.fastq -Q33

Concatenate the merged PE and the single reads:

cat *.trim.merged.180.pair12.fastq *.trim.180.single1.fastq *.trim.rc.180.single2.fastq > *.trim.180.all.fastq

Normalized the concatenate file (composed of single reads):

bbnorm.sh in=*.trim.180.all.fastq out=*.trim.norm.180.all.fastq target=65 min=3 -Xmx150g threads=20 2> bbnorm.log

Assembly with SPAdes:

export PATH=$PATH:/scratch/beegfs/monthly/ptranvan/Software/SPAdes-3.10.1-Linux/bin
 
spades.py -m 400 -t 27 --careful -k 21,33,55,77,99,111,127 -o * --s1 *.trim.norm.180.all.fastq

2.2) Scaffolding and Gap-closing

The other libraries (where insert_size>2*len(reads)) were used for scaffolding and gap-closing.

Scaffolding with SSPACE:

export PERL5LIB=/home/ptranvan/perl5/lib/perl5
module add UHTS/Aligner/bwa/0.7.13
 
perl /scratch/beegfs/monthly/ptranvan/Software/sspace/3.0/SSPACE_Standard_v3.0.pl -l *.sspace.txt -s */contigs.fasta -b * -T 25

Gap-closing with Gapcloser:

/scratch/beegfs/monthly/ptranvan/Software/GapCloser/1.12-r6/GapCloser -a ../*.final.scaffolds.fasta -b *.gapclose.txt -o *v01.fasta -l 125 -t 25 >> sm2.gapclose.log

3) Filtration

3.1) Contamination removal

BlobTools was used for contamination checking.

Map reads back to the genome:

module add UHTS/Aligner/bwa/0.7.15

bwa mem -M db_bwa/* *.trim.350.pair1.fastq.gz *.trim.350.pair2.fastq.gz -t 15 2> bwa/*.trim.350.pair12.bwa.output.log | samtools view -bS - > bwa/*.trim.350.pair12.bam

Compute coverage for each scaffold:

cat ../*.sam > merge.sam

module add UHTS/Analysis/BBMap/37.82
module add UHTS/Analysis/samtools/1.4

pileup.sh in=merge.sam out=merge.stats.txt hist=histogram.txt

# Reformat the output to <Name_scaffold>\t<Coverage>

awk {'printf ("%s\t%s\n", $1, $2)'} merge.stats.txt | awk '{if(NR>1)print}' > merge.stats_parse.txt

Blastn against nt:

module add Blast/ncbi-blast/2.7.1+;
 
export BLASTDB=/archive/dee/schwander/ptranvan/Database/taxdb:$BLASTDB

blastn -query *v01.fasta -db /archive/dee/schwander/ptranvan/Database/nt/nt \
-outfmt '6 qseqid staxids bitscore evalue std sscinames sskingdoms stitle' -max_target_seqs 10 \
-max_hsps 1 -evalue 1e-25 -num_threads 15 -out *.vs.nt.max10.1e25.blastn.out

Run BlobTools:

source /scratch/beegfs/monthly/ptranvan/Software/blobtools/1.0.sh

blobtools create -i *v01.fasta -t *.vs.nt.max10.1e25.blastn.out \
--nodes /scratch/beegfs/monthly/ptranvan/Software/blobtools/1.0/data/nodes.dmp \
--names /scratch/beegfs/monthly/ptranvan/Software/blobtools/1.0/data/names.dmp \
-c merge.stats_parse.txt -x bestsumorder -o *

blobtools blobplot -i *.blobDB.json --sort count --hist count -x bestsumorder

blobtools view -i *.blobDB.json --hits --rank all -x bestsumorder

Scaffolds without hits to metazoans were filtered out.

python contamination_filtration.py -s contamination_identification -i1 *.blobDB.table.txt

module add UHTS/Analysis/BBMap/37.82

filterbyname.sh in=*v01.fasta names=contaminant_scaffolds.txt out=*.filtered.fasta include=f -Xmx20g

Sort by descreasing size and rename the scaffolds header.

seqkit sort --by-length --reverse *.filtered.fasta -o - | rename_headers.awk -v sp=* version=b1v02 - > *v02.fasta

3.2) Size filtration

Scaffolds > 500 bp were kept:

prinseq-lite/0.20.4/prinseq-lite.pl -fasta *.fasta -min_len 501 -out_good *.501 -out_bad *.bad 2> prinseq.stat

Sort by descreasing size and rename the scaffolds header.

seqkit sort --by-length --reverse *.501 -o - | rename_headers.awk -v sp=* version=b1v03 - > *v03.fasta

➡️ v03 is the final assembly.

4) Evaluation

CEGMA:

module add SequenceAnalysis/CEGMA/2.5

cegma -g *v03.fasta -T 10 -o *v03 2> cegma.output_log.txt

BUSCO:

source /scratch/beegfs/monthly/ptranvan/Software/busco/3.0.2b.sh

run_BUSCO.py --long -i *v03.fasta -o output -l /scratch/beegfs/monthly/ptranvan/Software/busco/3.0/arthropoda_odb9 -m geno -c 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assembly

assembly

README.md

Pipeline description for genome assembly.

Table of contents

1) QC pre-processing on raw data

1.1) Paired end

1.2) Mate pair

2) Assembly

2.1) Contig

2.2) Scaffolding and Gap-closing

3) Filtration

3.1) Contamination removal

3.2) Size filtration

4) Evaluation

Files

assembly

Directory actions

More options

Directory actions

More options

Latest commit

History

assembly

Folders and files

parent directory

README.md

Pipeline description for genome assembly.

Table of contents

1) QC pre-processing on raw data

1.1) Paired end

1.2) Mate pair

2) Assembly

2.1) Contig

2.2) Scaffolding and Gap-closing

3) Filtration

3.1) Contamination removal

3.2) Size filtration

4) Evaluation