Methods from Paper: "Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses"
To compare the performance of different variant callers,we used the following tools:
-
LoFreq (parameter: -q 20 -Q 20 -m 20):
- LoFreq separates true LFV from erroneous variant calls by using Phred-scores as probability error in a Poisson-binomial distribution. The consensus sequence of each sample was used as its own reference to call LFV, in order to avoid calling high-frequency non-reference bases due to an inadequate choice of a single reference sequence for all samples used by LoFreq to call variants, i.e. nucleotides at low allelic frequency differing from the consensus at a specific genomic position [659]" from here.
- setting the strand bias threshold for reporting a variant to the maximum allowed value by using the option “--sb-thresh 2147483647” to allow strand-biased variants to be retained, to account for the non-random distribution of reads due to the design of the amplification panel."
-
VarScan2 (--min-avg-qual 20 --p-value 0.01)
-
FreeBayes (--p 1 -m 20 -q 20 -f 0.01 --min-coverage 10)
-
BCFtools (--p 0.01 —ploidy 1 -mv -Ob)
-
GATK HaplotypeCaller (—min-base-quality-score 20 -ploidy 1)
-
Ground truth:
- Differences between genomes detected by MUMmer were considered as positive variants. Based on this standard, precision, recall and F1-score were computed to evaluate those callers. The pairwise genome differences of 30 E. coli or 30 HIV genomes were determined by MUMmer as well. To evaluate the performance of different callers for SNP and InDel prediction, the command vcfeval in RTG-tools [60] was used to generate recall-precision curves based on the Phred scaled ‘QUAL’ score field (—squash-ploidy -f QUAL —sample ALT)."
- Description : https://github.com/imgag/umiVar2
- Parameters:
umiVar.py
-tbam {input.bam}
-b {params.target}
-r {params.ref}
-o Sample_{wildcards.s}/umivar2
-ac 4
-ns -1
-sb 0
-mp 30
-bq 20
-kt
- Usage
usage: umiVar - variant calling with unique molecular barcodes
[-h] -tbam TBAM [-nbam NBAM] -r REF [-b BED] [-m MONITORING]
[-o OUT_FOLDER] [-p PARAM] [-mq MQ] [-bq BQ] [-d DIST] [-ac AC]
[-af AF] [-ns NUM_SITES] [-sb {0,1}] [-t TEMP_DIR] [-kt]
[-crb CUSTOM_RSCRIPT_BINARY]
optional arguments:
-h, --help show this help message and exit
-tbam TBAM, --tbam TBAM
Tumor bam file
-nbam NBAM, --nbam NBAM
Normal bam file
-r REF, --ref REF Reference genome - fasta
-b BED, --bed BED Bed file of the targeted regions. O-based
-m MONITORING, --monitoring MONITORING
VCF file with genomic positions for monitoring or
sample-IDing.
-o OUT_FOLDER, --out_folder OUT_FOLDER
Output folder. Will be created if not existing
-p PARAM, --param PARAM
Beta-binomial parameters table
-mq MQ, --mq MQ Minimum mapping qualitydefault=30
-bq BQ, --bq BQ Minimum base quality default=20
-d DIST, --dist DIST Minimum distance between variants
-ac AC, --ac AC Minimum number of reads supporting a variant
-af AF, --af AF Minimum fraction of reads supporting a variant
-ns NUM_SITES, --num_sites NUM_SITES
Number of sites to be analysed
-sb {0,1}, --strand_bias {0,1}
Fisher strand bias filter. Default [0]
-t TEMP_DIR, --temp_dir TEMP_DIR
Temporary directory
-kt, --keep_temp Don't delete temporary directory
- Description: call variants from BAM file . LoFreq (suggested paper parameter: -q 20 -Q 20 -m 30)
- Parameters:add only q=20 and Q=20 C=4 -m default?
lofreq call --call-indels
-f /ref/MN908947.3.fasta
-o Sample_21014a009_01/lofreq/21014a009_01_lofreq.tsv
Sample_21014a009_01/dedup/21014a009_01_bamclipoverlap_sorted.bam
- Usage:
Options:
- Reference:
-f | --ref FILE Indexed reference fasta file (gzip supported) [null]
- Output:
-o | --out FILE Vcf output file [- = stdout]
- Base-call quality:
-q | --min-bq INT Skip any base with baseQ smaller than INT [6]
-Q | --min-alt-bq INT Skip alternate bases with baseQ smaller than INT [6]
- Mapping quality:
-m | --min-mq INT Skip reads with mapping quality smaller than INT [0]
- P-values:
-a | --sig P-Value cutoff / significance level [0.010000]
-b | --bonf Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']
- Indels:1
--call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!)
--only-indels Only call indels; no SNVs
- Misc.:
-C | --min-cov INT Test only positions having at least this coverage [1]
(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)
-d | --max-depth INT Cap coverage at this depth [1000000]
- Description: https://varscan.sourceforge.net/
- Parameters: to be consistent: -min-reads2 4 --min-reads2 20
VarScan2 (—min-avg-qual 20 —P-value 0.01)
samtools mpileup -aa -A -d 0 -B -Q 0 --reference {params.ref} {input.bam} | varscan pileup2snp --variants - > {output}
paper suggestion: samtools mpileup -aa -A -d 0 -B -Q 0 --reference {params.ref} {input.bam} | varscan pileup2snp --min-reads2 4 --min-coverage 4 --min-avg-qual 20 --p-value 0.01 --variants - > {output}
# --min-var-freq Minimum variant allele frequency threshold [0.01]
# --p-value Default p-value threshold for calling variants [99e-02]
- Usage
varscan pileup2snp -h
USAGE: java -jar VarScan.jar pileup2cns [pileup file] OPTIONS
pileup file - The SAMtools pileup file
OPTIONS:
--min-coverage Minimum read depth at a position to make a call [8]
--min-reads2 Minimum supporting reads at a position to call variants [2]
--min-avg-qual Minimum base quality at a position to count a read [15]
--min-var-freq Minimum variant allele frequency threshold [0.01]
--min-freq-for-hom Minimum frequency to call homozygote [0.75]
--p-value Default p-value threshold for calling variants [99e-02]
--variants Report only variant (SNP/indel) positions [0]
This command calls SNPs from a pileup file based on user-defined parameters:
USAGE: java -jar VarScan.jar pileup2snp [pileup file] OPTIONS
pileup file - The SAMtools pileup file
OPTIONS:
--min-coverage Minimum read depth at a position to make a call [10]
--min-reads2 Minimum supporting reads at a position to call variants [2]
--min-avg-qual Minimum base quality at a position to count a read [15]
--min-var-freq Minimum variant allele frequency threshold [0.01]
--p-value Default p-value threshold for calling variants [99e-02]
OUTPUT
Tab-delimited SNP calls with the following columns:
Chrom chromosome name
Position position (1-based)
Ref reference allele at this position
Var variant allele at this position
Reads1 reads supporting reference allele
Reads2 reads supporting variant allele
VarFreq frequency of variant allele by read count
Strands1 strands on which reference allele was observed
Strands2 strands on which variant allele was observed
Qual1 average base quality of reference-supporting read bases
Qual2 average base quality of variant-supporting read bases
Pvalue Significance of variant read count vs. expected baseline error
- ivar variants
- Description:
There are two parameters that can be set for variant calling using iVar
- minimum quality(Default: 20) Minimum quality is the minimum quality for a base to be used in frequency calculations at a given position.
- minimum frequency(Default: 0.03). Minimum frequency is the minimum frequency required for a SNV or indel to be reported.
- Description:
There are two parameters that can be set for variant calling using iVar
"For Samtools mpileup, the only non-default parameters used are do not discard anomalous read pairs (-A), disable per-base alignment quality (-B), skip bases with base quality smaller than 0 (-Q 0), and especially the maximum per-file coverage -d 1000000. The coverage of the pool amplicons can vary more than 100x in tiled amplicon approaches. Therefore, the value of the -d parameter should be at least 5-10x greater than the Avg. Coverage (Unassembled). For consensus calling by default the minimum quality score threshold (-q 20), minimum coverage to call consensus (-m 10), and minimum frequency threshold (-t 0.7) parameter values are applied. For stricter consensus calling, e.g., a called base must make up at least 90% presence at a position, the latter parameter must be changed to -t 0.9. https://www.ridom.de/u/SARS-CoV-2_Analysis_Quick_Start.html
- Parameters:
samtools mpileup -aa -A -d 0 -B -Q 0 --reference {params.ref} ../../{input.bam} | ivar variants -p {wildcards.s}_ivar -q 20 -t 0 -r {ref} -m 4
- Usage
Usage: samtools mpileup -aa -A -d 0 -B -Q 0 --reference [<reference-fasta] <input.bam> | ivar variants -p <prefix> [-q <min-quality>] [-t <min-frequency-threshold>] [-m <minimum depth>] [-r <reference-fasta>] [-g GFF file]
Input Options Description
-q Minimum quality score threshold to count base (Default: 20)
-t Minimum frequency threshold(0 - 1) to call variants (Default: 0.03)
-m Minimum read depth to call variants (Default: 0)
-r Reference file used for alignment. This is used to translate the nucleotide sequences and identify intra host single nucleotide variants
-g A GFF file in the GFF3 format can be supplied to specify coordinates of open reading frames (ORFs). In absence of GFF file, amino acid translation will not be done.
Output Options Description
-p (Required) Prefix for the output tsv variant file
- ivar consensus
- Description:
- Parameters:
samtools mpileup -aa -A -d 0 -Q 0 {input} | \
ivar consensus -t 0.9 -m 10 \
-p Sample_{wildcards.s}/consensus/{wildcards.s}_consensus_ivar
- Usage:
ivar consensus
Usage: samtools mpileup -A -d 300000 -Q 0 -F 0 <input.bam> | ivar consensus -p <prefix>
Note : samtools mpileup output must be piped into ivar consensus
Input Options Description
-q Minimum quality score threshold to count base (Default: 20)
-t Minimum frequency threshold(0 - 1) to call consensus. (Default: 0)
Frequently used thresholds | Description
---------------------------|------------
0 | Majority or most common base
0.2 | Bases that make up atleast 20% of the depth at a position
0.5 | Strict or bases that make up atleast 50% of the depth at a position
0.9 | Strict or bases that make up atleast 90% of the depth at a position
1 | Identical or bases that make up 100% of the depth at a position. Will have highest ambiguities
-m Minimum depth to call consensus(Default: 1)
-k If '-k' flag is added, regions with depth less than minimum depth will not be added to the consensus sequence. Using '-k' will override any option specified using -n
-n (N/-) Character to print in regions with less than minimum coverage(Default: -)