General pipeline

This file describes how to run mSWEEP and mGEMS on a custom reference database following some best practices we have established. This assumes that you already have a set of reference assemblies to use. For more info on obtaining the reference assemblies, see steps under Usage -> Reference data).

Grouping reference sequences

Use PopPUNK to cluster the reference sequences

sed 's/^.*\///g' reference_assembly_paths.txt | sed 's/[.].*$//g' > reference_assembly_names.txt
paste reference_assembly_names.txt reference_assembly_paths.txt > poppunk_input_list.tsv

poppunk --create-db --r-files poppunk_input_list.tsv --output poppunk_fit

poppunk --fit-model bgmm --K 4 --ref-db poppunk_fit

Match the poppunk clusters with the assembly paths

echo -e "id\tcluster\tassembly" > ref_info.tsv
join -1 1 -2 1 <(sed '1d' poppunk_sketch/poppunk_sketch_clusters.csv | tr ',' '\t' | sort) <(sort poppunk_input_list.tsv) | tr ' ' '\t' >> ref_info.tsv

Check the poppunk documentation for information on how to create a good clustering.

Building a pseudoalignment index

Extract the reference assembly paths

cut -f3 ref_info.tsv | sed '1d' > ref_paths.txt

Index the reference sequences for pseudoalignment with Themisto:

themisto build -k 31 -i ref_paths.txt -o themisto_index --temp-dir tmp

Pseudoaligning reads

Error correct paired-end reads with fastp

fastp --in1 raw_fwd.fastq.gz --in2 raw_rev.fastq.gz --out1 corr_fwd.fastq.gz --out2 corr_rev.fastq.gz -c --thread 2

Pseudoalign the reads and compress the alignment file with alignment-writer

ntargets=$(wc -l ref_paths.txt | cut -f1 -d' ')
nreads=$((`gunzip -c reads_fwd.fastq.gz | wc -l` / 4 ))
themisto pseudoalign -q corr_fwd.fastq.gz -i themisto_index --temp-dir tmp -t 1 | alignment-writer -n $ntargets -r $nreads > fwd.aln
themisto pseudoalign -q corr_rev.fastq.gz -i themisto_index --temp-dir tmp -t 1 | alignment-writer -n $ntargets -r $nreads > rev.aln

Estimate abundances

Extract the cluster indicators

cut -f2 ref_info.tsv | sed '1d' > ref_clusters.txt

Estimate abundances with mSWEEP and bin reads for groups with abundance higher than 0.01 with

mSWEEP --themisto-1 fwd.aln --themisto-2 rev.aln -i ref_clusters.txt -t 2 --bin-reads --min-abundance 0.01 -o res

Extract binned reads

Extract the reads with mGEMS

bins=$(ls *.bin | tr '[[:space:]]' ',' | sed 's/,$//g')
mGEMS extract --bins $bins -r 215_1.fastq.gz,215_2.fastq.gz -o ./

Setup demix_check for quality checking

Setup the reference for coreutils_demix_check with

mkdir demix_ref
cd demix_ref
setup_reference.sh --ref_info ../ref_info.tsv --threads 2
cd ../

Check all reads that were binned with

for f in *.bin; do
	cluster=$(echo $f | sed 's/[.]bin$//g')
	fwd=$cluster"_1.fastq.gz"
	rev=$cluster"_2.fastq.gz"
	check_reads.sh --abundances res_abundances.txt --cluster $cluster --reference demix_ref --fwd $fwd --rev $rev --threads 2
	mv clu_score.tsv $cluster"_clu_score.tsv"
done

groups that receive a score of 1 or 2 are considered to pass the quality check (1 is better than 2) and can be used in assembly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline.md

pipeline.md

General pipeline

Grouping reference sequences

Building a pseudoalignment index

Pseudoaligning reads

Estimate abundances

Extract binned reads

Setup demix_check for quality checking

Files

pipeline.md

Latest commit

History

pipeline.md

File metadata and controls

General pipeline

Grouping reference sequences

Building a pseudoalignment index

Pseudoaligning reads

Estimate abundances

Extract binned reads

Setup demix_check for quality checking