diff --git a/bcftools-man.html b/bcftools-man.html index f17e00be8..dca2ffbd5 100644 --- a/bcftools-man.html +++ b/bcftools-man.html @@ -50,7 +50,7 @@

DESCRIPTION

VERSION

-

This manual page was last updated 2023-05-30 09:18 BST and refers to bcftools git version 1.17-50-ga8249495+.

+

This manual page was last updated 2024-04-29 08:11 BST and refers to bcftools git version 1.20-6-g5977f1f3+.

@@ -426,9 +426,12 @@

Common Options

Use multithreading with INT worker threads. The option is currently used only for the compression of the output stream, only when --output-type is b or z. Default: 0.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output files. Can be used only for compressed BCF and VCF output.

+

Automatically index the output files. FMT is optional and can be +one of "tbi" or "csi" depending on output file format. Defaults to +CSI unless specified otherwise. Can be used only for compressed +BCF and VCF output.

@@ -487,7 +490,7 @@

bcftools annotate [OPTIONS] FILE

Comma-separated list of columns or tags to carry over from the annotation file (see also -a, --annotations). If the annotation file is not a VCF/BCF, list describes the columns of the annotation file and must include CHROM, -POS (or, alternatively, FROM and TO), and optionally REF and ALT. Unused +POS (or, alternatively, FROM,TO or BEG,END), and optionally REF and ALT. Unused columns which should be ignored can be indicated by "-".  
 
@@ -511,16 +514,50 @@

bcftools annotate [OPTIONS] FILE

To append to existing values (rather than replacing or leaving untouched), use "=TAG" (instead of "TAG" or "+TAG"). To replace only existing values without modifying missing annotations, use "-TAG". +As a special case of this, if position needs to be replaced, mark the column with the new coordinate as "-POS". +(Note that in previous releases this used to be "~POS", now deprecated.) + 

To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END". -If position needs to be replaced, mark the column with the new position as "~POS". +Note that this works only for ID and POS, for other fields see the description of -i below.  
 
If the annotation file is not a VCF/BCF, all new annotations must be defined via -h, --header-lines.  
 
-See also the -l, --merge-logic option.

+See also the -l, --merge-logic option. + 

+Summary of -c, --columns:

+ + +
+
+
    CHROM,POS,TAG       .. match by chromosome and position, transfer annotation from TAG
+    CHROM,POS,-,TAG     .. same as above, but ignore the third column of the annotation file
+    CHROM,BEG,END,TAG   .. match by region (BEG,END are synonymous to FROM,TO)
+    CHROM,POS,REF,ALT   .. match by CHROM, POS, REF and ALT
+
+    DST_TAG:=SRC_TAG    .. transfer the SRC_TAG using the new name DST_TAG
+    INFO                .. transfer all INFO annotations
+    ^INFO/TAG           .. transfer all INFO annotations except "TAG"
+
+    TAG       .. add or overwrite existing target value if source is not "." and skip otherwise
+    +TAG      .. add or overwrite existing target value only it is "."
+    .TAG      .. add or overwrite existing target value even if source is "."
+    .+TAG     .. add new but never overwrite existing tag, regardless of its value; can transfer "." if target does not exist
+    -TAG      .. overwrite existing value, never add new if target does not exist
+    =TAG      .. do not overwrite but append value to existing tags
+
+    ~FIELD    .. use this column to match lines with -i/-e expression (see the description of -i below)
+    ~ID       .. in addition to CHROM,POS,REF,ALT match by also ID
+    ~INFO/END .. in addition to CHROM,POS,REF,ALT match by also INFO/END
+
+
+
+
-C, --columns-file file

Read the list of columns from a file (normally given via the -c, --columns option). @@ -532,7 +569,7 @@

bcftools annotate [OPTIONS] FILE

-e, --exclude EXPRESSION

exclude sites for which EXPRESSION is true. For valid expressions see -EXPRESSIONS.

+EXPRESSIONS and the extension described in -i, --include below.

--force
@@ -573,8 +610,27 @@

bcftools annotate [OPTIONS] FILE

-i, --include EXPRESSION

include only sites for which EXPRESSION is true. For valid expressions see -EXPRESSIONS.

+EXPRESSIONS. + 

+Additionally, the command bcftools annotate supports expressions updated from the annotation +file dynamically for each record:

+
+
+
+
+
    # The field 'STR' from the -a file is required to match INFO/TAG in VCF. In the first example
+    # the alleles REF,ALT must match, in the second example they are ignored. The option -k is required
+    # to output also records that are not annotated. The third example shows the same concept with
+    # a numerical expression.
+    bcftools annotate -a annots.tsv.gz -c CHROM,POS,REF,ALT,SCORE,~STR -i'TAG={STR}' -k input.vcf
+    bcftools annotate -a annots.tsv.gz -c CHROM,POS,-,-,SCORE,~STR     -i'TAG={STR}' -k input.vcf
+    bcftools annotate -a annots.tsv.gz -c CHROM,POS,-,-,SCORE,~INT     -i'TAG>{INT}' -k input.vcf
+
+
+
+
-k, --keep-sites

keep sites which do not pass -i and -e expressions instead of discarding them

@@ -681,9 +737,10 @@

bcftools annotate [OPTIONS] FILE

"^INFO/FOO,INFO/BAR" (and similarly for FORMAT and FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -720,7 +777,7 @@

bcftools annotate [OPTIONS] FILE

# that INFO/END is already present in the VCF header. bcftools annotate -a annots.tab.gz -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf - # For more examples see http://samtools.github.io/bcftools/howtos/annotate.html + # For (many) more examples see http://samtools.github.io/bcftools/howtos/annotate.html @@ -814,9 +871,10 @@

File format options:

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -830,6 +888,10 @@

Input/output options:

output all alternate alleles present in the alignments even if they do not appear in any of the genotypes

+
-*, --keep-unseen-allele
+
+

keep the unobserved allele <*> or <NON_REF>, useful mainly for gVCF output

+
-f, --format-fields list

comma-separated list of FORMAT fields to output for each sample. Currently @@ -866,7 +928,7 @@

Input/output options:

-G, --group-samples FILE|-
-

by default, all samples are assumed to come from a single population. This option allows to group samples +

by default, all samples are assumed to come from a single population. This option groups samples into populations and apply the HWE assumption within but not across the populations. FILE is a tab-delimited text file with sample names in the first column and group names in the second column. If - is given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that @@ -1182,9 +1244,10 @@

bcftools concat [OPTIONS] FILE1 FILE2

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -1306,6 +1369,11 @@

bcftools consensus [OPTIONS] FILE

write output to a file

+
--regions-overlap 0|1|2
+
+

how to treat VCF variants overlapping the target region in the fasta file: +see Common Options

+
-s, --samples LIST

apply variants of the listed samples. See also the option -I, --iupac-codes

@@ -1401,9 +1469,10 @@

VCF input options:

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -1740,6 +1809,10 @@

bcftools csq [OPTIONS] FILE

if more are required, see the --ncsq option.

+

Note that the program annotates only records with a functional consequence and +intergenic regions will pass through unchanged.

+
+

The program requires on input a VCF/BCF file, the reference genome in fasta format (--fasta-ref) and genomic features in the GFF3 format downloadable from the Ensembl website (--gff-annot), and outputs an annotated VCF/BCF @@ -1789,7 +1862,7 @@

bcftools csq [OPTIONS] FILE

--force
-

run even if some sanity checks fail. Currently the option allows to skip +

run even if some sanity checks fail. Currently the option enables skipping transcripts in malformatted GFFs with incorrect phase

-g, --gff-annot FILE
@@ -1946,9 +2019,10 @@

bcftools csq [OPTIONS] FILE

and VCF, such as "chrX" vs "X". The chromosome names in the output VCF will match that of the input VCF. The default is to attempt the automatic translation.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -2141,7 +2215,7 @@

bcftools filter [OPTIONS] FILE

-s, --soft-filter STRING|+

annotate FILTER column with STRING or, with +, a unique filter name generated -by the program ("Filter%d").

+by the program ("Filter%d"). Applies to records that do not meet filter expression.

-S, --set-GTs .|0
@@ -2163,9 +2237,10 @@

bcftools filter [OPTIONS] FILE

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -2178,6 +2253,11 @@

bcftools gtcheck [OPTIONS] [-g ge is checked against the samples in the -g file. Without the -g option, multi-sample cross-check of samples in query.vcf.gz is performed.

+
+

Note that the interpretation of the discordance score depends on the options provided (specifically -e and +-u) and on the available annotations (FORMAT/PL vs FORMAT/GT). +The discordance score can be interpreted as the number of mismatching genotypes if only GT-vs-GT matching is performed.

+
--distinctive-sites NUM[,MEM[,DIR]]
@@ -2191,16 +2271,29 @@

bcftools gtcheck [OPTIONS] [-g ge

Stop after first record to estimate required time.

-
-e, --error-probability INT
+
-e, --exclude [qry|gt]:'EXPRESSION'
+
+

Exclude sites from query file (qry:) or genotype file (gt:) for which EXPRESSION is true. +For valid expressions see EXPRESSIONS.

+
+
-E, --error-probability INT

Interpret genotypes and genotype likelihoods probabilistically. The value of INT represents genotype quality when GT tag is used (e.g. Q=30 represents one error in 1,000 genotypes and Q=40 one error in 10,000 genotypes) and is ignored when PL tag is used (in that case an arbitrary -non-zero integer can be provided). See also the -u, --use option below. If set to 0, -the discordance equals to the number of mismatching genotypes when GT vs GT is compared. -Note that the values with and without -e are not comparable, only values generated -with -e 0 correspond to mismatching genotypes. -If performance is an issue, set to 0 for faster run but less accurate results.

+non-zero integer can be provided). + 

+If -E is set to 0, the discordance score can be interpreted as the number of mismatching genotypes, +but only in the GT-vs-GT matching mode. See the -u, --use option below for additional notes and caveats. + 

+If performance is an issue, set -E 0 for faster run times but less accurate results. + 

+Note that in previous versions of bcftools (⇐1.18), this option used to be a smaller case -e. It +changed to make room for the filtering option -e, --exclude to stay consistent across other +commands.

-g, --genotypes FILE
@@ -2210,6 +2303,11 @@

bcftools gtcheck [OPTIONS] [-g ge

Homozygous genotypes only, useful with low coverage data (requires -g, --genotypes)

+
-i, --include [qry|gt]:'EXPRESSION'
+
+

Include sites from query file (qry:) or genotype file (gt:) for which EXPRESSION is true. +For valid expressions see EXPRESSIONS.

+
--n-matches INT

Print only top INT matches for each sample, 0 for unlimited. Use negative value @@ -2221,6 +2319,14 @@

bcftools gtcheck [OPTIONS] [-g ge

Disable calculation of HWE probability to reduce memory requirements with comparisons between very large number of sample pairs.

+
-o, --output FILE
+
+

Write to FILE rather than to standard output, where it is written by default.

+
+
-O, --output-type t|z
+
+

Write a plain (t) or compressed (z) text tab-delimited output.

+
-p, --pairs LIST

A comma-separated list of sample pairs to compare. When the -g option is given, the first @@ -2274,8 +2380,13 @@

bcftools gtcheck [OPTIONS] [-g ge
-u, --use TAG1[,TAG2]

specifies which tag to use in the query file (TAG1) and the -g (TAG2) file. -By default, the PL tag is used in the query file and GT in the -g file when -available.

+By default, the PL tag is used in the query file and, when available, the GT tags in the +-g file. + 

+Note that when the requested tag is not available, the program will attempt to use +the other tag. The output includes the number of sites that were matched by the four +possible modes (for example GT-vs-GT or GT-vs-PL).

@@ -2284,10 +2395,10 @@

bcftools gtcheck [OPTIONS] [-g ge
-
   # Check discordance of all samples from B against all sample in A
+
   # Check discordance of all samples from B against all samples in A
    bcftools gtcheck -g A.bcf B.bcf
 
-   # Limit comparisons to the fiven list of samples
+   # Limit comparisons to the given list of samples
    bcftools gtcheck -s gt:a1,a2,a3 -s qry:b1,b2 -g A.bcf B.bcf
 
    # Compare only two pairs a1,b1 and a1,b2
@@ -2322,6 +2433,13 @@ 

Options:

Also display the first INT variant records. By default, no variant records are displayed.

+
-s, --samples INT
+
+

Display the first INT variant records including the last #CHROM header line with samples. +Running with -s 0 alone outputs the #CHROM header line only. Note that +the list of samples, with each sample per line, can be obtained with bcftools query using +the option -l, --list-samples.

+
@@ -2430,6 +2548,10 @@

bcftools isec [OPTIONS] A.vcf.gz B.vcf.gzinclude only sites for which EXPRESSION is true. See discussion of -e, --exclude above.

+
-f, --file-list FILE
+
+

Read file names from FILE, one file name per line.

+
-n, --nfiles [+-=]INT|~BITMAP

output positions present in this many (=), this many or more (+), this @@ -2474,12 +2596,14 @@

bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz
-w, --write LIST
-

list of input files to output given as 1-based indices. With -p and no +

comma-separated list of input files to output given as 1-based indices. With -p and no -w, all files are written.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file. This is done automatically with the -p option.

+

Automatically index the output file. FMT is optional and defaults +to tbi for vcf.gz and csi for bcf. This is done automatically +with the -p option if the output format is compressed.

@@ -2550,6 +2674,10 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz<
+
--force-no-index
+
+

synonymous to --no-index

+
--force-samples

if the merged files contain duplicate samples names, proceed anyway. @@ -2557,6 +2685,10 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz< as it appeared on the command line to the conflicting sample name (see 2:S3 in the above example).

+
--force-single
+
+

run even if only one file is given on input

+
--print-header

print only merged header and exit

@@ -2605,16 +2737,18 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz<

Sites with many alternate alleles can require extremely large storage space which can exceed the 2GB size limit representable by BCF. This is caused by Number=G tags (such as FORMAT/PL) which store a value for each combination of reference -and alternate alleles. The -L, --local-alleles option allows to replace such tags +and alternate alleles. The -L, --local-alleles option allows replacement of such tags with a localized tag (FORMAT/LPL) which only includes a subset of alternate alleles relevant for that sample. A new FORMAT/LAA tag is added which lists 1-based indices of the alternate alleles relevant (local) for the current sample. The number INT gives the maximum number of alternate alleles that can be included in the PL tag. The default value is 0 which disables the feature and outputs values for all alternate alleles.

-
-m, --merge snps|indels|both|snp-ins-del|all|none|id
+
-m, --merge snps|indels|both|snp-ins-del|all|none|id[,*]
-

The option controls what types of multiallelic records can be created:

+

The option controls what types of multiallelic records can be created. If single asterisk +* is appended, the unobserved allele <*> or <NON_REF> will be removed at variant sites; +if two asterisks ** are appended, the unobserved allele will be removed all sites.

@@ -2624,6 +2758,8 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz< -m snps .. allow multiallelic SNP records -m indels .. allow multiallelic indel records -m both .. both SNP and indel records can be multiallelic +-m both,* .. same as above but remove <*> (or <NON_REF>) from variant sites +-m both,** .. same as above but remove <*> (or <NON_REF>) at all sites -m all .. SNP records can be merged with indel records -m snp-ins-del .. allow multiallelic SNVs, insertions, deletions, but don't mix them -m id .. merge by ID @@ -2637,13 +2773,13 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz< alleles, vector fields pertaining to unobserved alleles are set to missing (.) by default. The METHOD is one of . (the default, use missing values), NUMBER (use a constant value, e.g. 0), max (the maximum value observed for other alleles in the sample). When --gvcf option is set, -the rule -M PL:max,AD:0 is implied. This can be overriden with providing -M - or -M PL:.,AD:.. +the rule -M PL:max,AD:0 is implied. This can be overridden with providing -M - or -M PL:.,AD:.. Note that if the unobserved allele is explicitly present as <*> or <NON_REF>, then its corresponding value will be used regardless of -M settings.

--no-index
-

the option allows to merge files without indexing them first. In order for this +

the option allows files to be merged without indexing them first. In order for this option to work, the user must ensure that the input files have chromosomes in the same order and consistent with the order of sequences in the VCF header.

@@ -2675,9 +2811,10 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz<

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -2817,7 +2954,23 @@

Input options

A new EXPERIMENTAL indel calling model which aims to address some known deficiencies of the current indel calling algorithm. Specifically, it uses diploid reference consensus sequence. Note that in the current version it has the potential to increase sensitivity -but at the cost of decreased specificity

+but at the cost of decreased specificity. +Only works with short-read sequencing technologies.

+ +
--indels-cns
+
+

Another EXPERIMENTAL indel calling method, predating indels-2.0 in +PR form, but merged more recently. It also uses a diploid +reference consensus, but with added parameters and heuristics to +optimise for a variety of sequencing platforms. This is usually +faster and more accurate than the default caller and --indels-2.0, +but has not been tested on non-diploid samples and samples without +approximately even allele frequency.

+
+
--no-indels-cns
+
+

May be used to turn off --indels-cns mode when using one of the +newer profiles that has this enabled by default.

-q, -min-MQ INT
@@ -2991,9 +3144,10 @@

Output options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -3004,15 +3158,70 @@

Options for SNP/I
-X, --config STR
-

Specify a platform specific configuration profile. The profile -should be one of 1.12, illumina, ont or pacbio-ccs. -Settings applied are as follows:

+

Specify a platform specific configuration profile. Specifying the +profile as "list" will list the available profile names and the +parameters they change. There are profiles named after a release, +which should be used if you wish to ensure forward compatibility +of results. The non-versioned names (eg "illumina") will always +point to the most recent set of parameters for that instrument type. +The current values are:

-
1.12           -Q13 -h100 -m1
-illumina       [ default values ]
-ont	           -B -Q5 --max-BQ 30 -I
-pacbio-ccs     -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
+
1.12            -Q13 -h100 -m1
+
+
+
+
+
bgi
+bgi-1.20        --indels-cns -B --indel-size 80 -F0.1 --indel-bias 0.9
+                --seqq-offset 120
+
+
+
+
+
illumina-1.18   [ default values ]
+
+
+
+
+
illumina
+illumina-1.20   --indels-cns --seqq-offset 125
+
+
+
+
+
ont             -B -Q5 --max-BQ 30 -I
+
+
+
+
+
ont-sup
+ont-sup-1.20    --indels-cns -B -Q1 --max-BQ 35 --delta-BQ 99 -F0.2
+                -o15 -e1 -h110 --del-bias 0.4 --indel-bias 0.7
+                --poly-mqual --seqq-offset 130 --indel-size 80
+
+
+
+
+
pacbio-ccs-1.18 -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
+
+
+
+
+
pacbio-ccs
+pacbio-ccs-1.20  --indels-cns -B -Q5 --max-BQ 50 -F0.1 -o25 -e1 -h300
+                 --delta-BQ 10 --del-bias 0.4 --poly-mqual
+                 --indel-bias 0.9 --seqq-offset 118 --indel-size 80
+                 --score-vs-ref 0.7
+
+
+
+
+
ultima
+ultima-1.20      --indels-cns -B -Q1 --max-BQ 30 --delta-BQ 10 -F0.15
+                 -o20 -e10 -h250 --del-bias 0.3 --indel-bias 0.7
+                 --poly-mqual --seqq-offset 140 --score-vs-ref 0.3
+                 --indel-size 80
@@ -3058,12 +3267,32 @@

Options for SNP/I 0.75) while higher depth samples or where you favour recall rates over precision may work better with a higher value such as 2.0.

+
--del-bias FLOAT
+
+

Skews the likelihood of deletions over insertions. Defaults to an +even distribution value of 1.0. Lower values imply a higher rate +of false positive deletions (meaning candidate deletions are less +likely to be real).

+
--indel-size INT

Indel window size to use when assessing the quality of candidate indels. Note that although the window size approximately corresponds to the maximum indel size considered, it is not an exact threshold [110]

+
--seqq-offset INT
+
+

Tunes the importance of indel sequence quality per depth. The +final "seqQ" quality used is "offset - 5*MIN(depth,20)". [120]

+
+
--poly-mqual
+
+

Use the lowest quality value within a homopolymer run, instead of +the quality immediately adjacent to the indel. This may be +important for unclocked instruments, particularly ones with a flow +chemistry where runs of bases of identical type are incorporated +together.

+
-I, --skip-indels

Do not perform INDEL calling

@@ -3157,14 +3386,14 @@

bcftools norm [OPTIONS] file.vcf.gz

100 CC C,GG 1/2 # After: - # bcftools norm -a . + # bcftools norm -a --atom-overlaps . 100 C G ./1 100 CC C 1/. 101 C G ./1 # After: - # bcftools norm -a '*' - # bcftools norm -a \* + # bcftools norm -a --atom-overlaps '*' + # bcftools norm -a --atom-overlaps \* 100 C G,* 2/1 100 CC C,* 1/2 101 C G,* 2/1 @@ -3205,6 +3434,12 @@

bcftools norm [OPTIONS] file.vcf.gz

try to proceed with -m- even if malformed tags with incorrect number of fields are encountered, discarding such tags. (Experimental, use at your own risk.)

+
-g, --gff-annot FILE
+
+

when a GFF file is provided, follow HGVS 3’rule and right-align variants in transcripts on the forward +strand. In case of overlapping transcripts, the default mode is to left-align the variant. For a +description of the supported GFF3 file format see bcftools csq.

+
--keep-sum TAG[,…​]

keep vector sum constant when splitting multiallelic sites. Only AD tag @@ -3218,7 +3453,11 @@

bcftools norm [OPTIONS] file.vcf.gz

together: If only SNP records should be split or merged, specify snps; if both SNPs and indels should be merged separately into two records, specify both; if SNPs and indels should be merged into a single record, specify -any.

+any. + 

+Note that multiallelic sites with both SNPs and indels will be split into +biallelic sites with both -m -snps and -m -indels.

--multi-overlaps 0|.
@@ -3285,9 +3524,10 @@

bcftools norm [OPTIONS] file.vcf.gz

maximum distance between two records to consider when locally sorting variants which changed position during the realignment

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -3364,9 +3604,10 @@

VCF output options:

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -3613,13 +3854,14 @@

List of plugins coming wi
split-vep
-

extract fields from structured annotations such as INFO/CSQ created by bcftools/csq or VEP. These -can be added as a new INFO field to the VCF or in a custom text format. See +

extract fields from structured annotations such as INFO/CSQ created by VEP or INFO/BCSQ created by +bcftools/csq. These can be added as a new INFO field to the VCF or in a custom text format. See http://samtools.github.io/bcftools/howtos/plugin.split-vep.html for more.

tag2tag
-

Convert between similar tags, such as GL,PL,GP or QR,QA,QS.

+

Convert between similar tags, such as GL,PL,GP or QR,QA,QS or tags with localized alleles e.g. LPL,LAD. +See http://samtools.github.io/bcftools/howtos/plugin.tag2tag.html for more.

trio-dnm2
@@ -3830,6 +4072,12 @@

bcftools query [OPTIONS] file.vcf.gz [file.

learn by example, see below

+
-F, --print-filtered STR
+
+

by default, samples failing -i/-e filtering expressions are suppressed from output +when FORMAT fields are queried (for example %CHROM %POS [ %GT]). With -F, such +fields will be still printed but instead of their actual value, STR will be used.

+
-H, --print-header

print header

@@ -3843,6 +4091,14 @@

bcftools query [OPTIONS] file.vcf.gz [file.

list sample names and exit

+
-N, --disable-automatic-newline
+
+

disable automatic addition of a missing newline character at the end of the formatting +expression. By default, the program checks if the expression contains a newline +and appends it if not, to prevent formatting the entire output into a single +line by mistake. Note that versions prior to 1.18 had no automatic check and newline +had to be included explicitly.

+
-o, --output FILE

see Common Options

@@ -3913,6 +4169,7 @@

Format:

%TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples. %TGT Translated genotype (e.g. C/A) %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER) +%VKX VariantKey, biallelic hexadecimal encoding of CHROM,POS,REF,ALT (https://github.com/tecnickcom/variantkey) [] Format fields must be enclosed in brackets to loop over all samples \n new line \t tab character @@ -3976,6 +4233,14 @@

Examples:

bcftools query -f '%AC{1}\n' -i 'AC[1]>10' file.vcf.gz +
+
+
# Print all samples at sites where at least one sample has DP=1 or DP=2. In the second case
+# print only samples with DP=1 or DP=2, the difference is in the logical operator used, || vs |.
+bcftools query -f '[%SAMPLE %GT %DP\n]' -i 'FMT/DP=1 || FMT/DP=2' file.vcf
+bcftools query -f '[%SAMPLE %GT %DP\n]' -i 'FMT/DP=1 |  FMT/DP=2' file.vcf
+
+
@@ -4010,7 +4275,7 @@

bcftools reheader [OPTIONS] file.vcf.gz

-T, --temp-prefix PATH
-

template for temporary file names, used with -f

+

this option is ignored, but left for compatibility with earlier versions of bcftools.

--threads INT
@@ -4248,11 +4513,13 @@

bcftools sort [OPTIONS] file.bcf

-T, --temp-dir DIR
-

Use this directory to store temporary files

+

Use this directory to store temporary files. If the last six characters of the string DIR are XXXXXX, +then these are replaced with a string that makes the directory name unique.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -4457,9 +4724,10 @@

Output options

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -4468,6 +4736,11 @@

Output options

Subset options:

+
-A, --trim-unseen-alleles
+
+

remove the unseen allele <*> or <NON_REF> at variant sites when the option is given once (-A) or +at all sites when the options is given twice (-AA).

+
-a, --trim-alt-alleles

remove alleles not seen in the genotype fields from the ALT column. Note that if no alternate allele @@ -4660,6 +4933,98 @@

bcftools [--version-only]

+

SCRIPTS

+
+
+

gff2gff

+
+

Attempts to fix a GFF file to be correctly parsed by csq.

+
+
+
+
+
+
zcat in.gff.gz | gff2gff | gzip -c > out.gff.gz
+
+
+
+
+
+
+

plot-vcfstats [OPTIONS] file.vchk […​]

+
+

Script for processing output of bcftools stats. It can merge +results from multiple outputs (useful when running the stats for each +chromosome separately), plots graphs and creates a PDF presentation.

+
+
+
+
-m, --merge
+
+

Merge vcfstats files to STDOUT, skip plotting.

+
+
-p, --prefix DIR
+
+

The output directory. This directory will be created if it does not exist.

+
+
-P, --no-PDF
+
+

Skip the PDF creation step.

+
+
-r, --rasterize
+
+

Rasterize PDF images for faster rendering. This is the default and the opposite of -v, --vectors.

+
+
-s, --sample-names
+
+

Use sample names for xticks rather than numeric IDs.

+
+
-t, --title STRING
+
+

Identify files by these titles in plots. The option can be given multiple +times, for each ID in the bcftools stats output. If not +present, the script will use abbreviated source file names for the titles.

+
+
-v, --vectors
+
+

Generate vector graphics for PDF images, the opposite of -r, --rasterize.

+
+
-T, --main-title STRING
+
+

Main title for the PDF.

+
+
+
+
+

Example:

+
+
+
+
+
+
# Generate the stats
+bcftools stats -s - > file.vchk
+
+
+
+
+
# Plot the stats
+plot-vcfstats -p outdir file.vchk
+
+
+
+
+
# The final looks can be customized by editing the generated
+# 'outdir/plot.py' script and re-running manually
+cd outdir && python plot.py && pdflatex summary.tex
+
+
+
+
+
+
+
+

FILTERING EXPRESSIONS

@@ -4669,8 +5034,7 @@

FILTERING EXPRESSIONS

Valid expressions may contain:
  • -

    numerical constants, string constants, file names (this is currently -supported only to filter by the ID column)

    +

    numerical constants, string constants, file names (indicated by the prefix @)

    1, 1.0, 1e-4
    @@ -4804,7 +5168,7 @@ 

    FILTERING EXPRESSIONS

  • -

    TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex +

    TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap, see TERMINOLOGY). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare

    @@ -5052,12 +5416,17 @@

    FILTERING EXPRESSIONS

    -
    ID=@file       .. selects lines with ID present in the file
    +
    ID=@file               .. selects lines with ID present in the file
    +
    +
    +
    +
    +
    ID!=@~/file            .. skip lines with ID present in the ~/file
    -
    ID!=@~/file    .. skip lines with ID present in the ~/file
    +
    INFO/TAG=@file         .. selects lines with INFO/TAG value present in the file
    @@ -5096,91 +5465,27 @@

    FILTERING EXPRESSIONS

-

SCRIPTS

+

TERMINOLOGY

-
-

gff2gff

-
-

Attempts to fix a GFF file to be correctly parsed by csq.

-
-
-
-
-
-
zcat in.gff.gz | gff2gff | gzip -c > out.gff.gz
-
-
-
-
-
-
-

plot-vcfstats [OPTIONS] file.vchk […​]

-
-

Script for processing output of bcftools stats. It can merge -results from multiple outputs (useful when running the stats for each -chromosome separately), plots graphs and creates a PDF presentation.

-
-
-
-
-m, --merge
-
-

Merge vcfstats files to STDOUT, skip plotting.

-
-
-p, --prefix DIR
-
-

The output directory. This directory will be created if it does not exist.

-
-
-P, --no-PDF
-
-

Skip the PDF creation step.

-
-
-r, --rasterize
-
-

Rasterize PDF images for faster rendering. This is the default and the opposite of -v, --vectors.

-
-
-s, --sample-names
-
-

Use sample names for xticks rather than numeric IDs.

-
-
-t, --title STRING
-
-

Identify files by these titles in plots. The option can be given multiple -times, for each ID in the bcftools stats output. If not -present, the script will use abbreviated source file names for the titles.

-
-
-v, --vectors
-
-

Generate vector graphics for PDF images, the opposite of -r, --rasterize.

-
-
-T, --main-title STRING
-
-

Main title for the PDF.

-
-
-
-

Example:

+

The program and the documentation uses the following terminology, multiple terms can be used +interchangeably for the same VCF record type

-
# Generate the stats
-bcftools stats -s - > file.vchk
-
-
-
-
-
# Plot the stats
-plot-vcfstats -p outdir file.vchk
-
-
-
-
-
# The final looks can be customized by editing the generated
-# 'outdir/plot.py' script and re-running manually
-cd outdir && python plot.py && pdflatex summary.tex
-
+
REF   ALT
+---------
+C     .         .. reference allele / non-variant site / ref-only site
+C     T         .. SNP or SNV (single-nucleotide polymorphism or variant), used interchangeably
+CC    TT        .. MNP (multi-nucleotide polymorphism)
+CAAA  C         .. indel, deletion (regardless of length)
+C     CAAA      .. indel, insertion (regardless of length)
+C     <*>       .. gVCF block, the allele <*> is a placeholder for alternate allele possibly missed because of low coverage
+C     <NON_REF> .. synonymous to <*>
+C     *         .. overlapping deletion
+C     <INS>     .. symbolic allele, known also as 'other [than above]'
@@ -5257,7 +5562,7 @@

COPYING

diff --git a/bcftools.html b/bcftools.html index f17e00be8..dca2ffbd5 100644 --- a/bcftools.html +++ b/bcftools.html @@ -50,7 +50,7 @@

DESCRIPTION

VERSION

-

This manual page was last updated 2023-05-30 09:18 BST and refers to bcftools git version 1.17-50-ga8249495+.

+

This manual page was last updated 2024-04-29 08:11 BST and refers to bcftools git version 1.20-6-g5977f1f3+.

@@ -426,9 +426,12 @@

Common Options

Use multithreading with INT worker threads. The option is currently used only for the compression of the output stream, only when --output-type is b or z. Default: 0.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output files. Can be used only for compressed BCF and VCF output.

+

Automatically index the output files. FMT is optional and can be +one of "tbi" or "csi" depending on output file format. Defaults to +CSI unless specified otherwise. Can be used only for compressed +BCF and VCF output.

@@ -487,7 +490,7 @@

bcftools annotate [OPTIONS] FILE

Comma-separated list of columns or tags to carry over from the annotation file (see also -a, --annotations). If the annotation file is not a VCF/BCF, list describes the columns of the annotation file and must include CHROM, -POS (or, alternatively, FROM and TO), and optionally REF and ALT. Unused +POS (or, alternatively, FROM,TO or BEG,END), and optionally REF and ALT. Unused columns which should be ignored can be indicated by "-".  
 
@@ -511,16 +514,50 @@

bcftools annotate [OPTIONS] FILE

To append to existing values (rather than replacing or leaving untouched), use "=TAG" (instead of "TAG" or "+TAG"). To replace only existing values without modifying missing annotations, use "-TAG". +As a special case of this, if position needs to be replaced, mark the column with the new coordinate as "-POS". +(Note that in previous releases this used to be "~POS", now deprecated.) + 

To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END". -If position needs to be replaced, mark the column with the new position as "~POS". +Note that this works only for ID and POS, for other fields see the description of -i below.  
 
If the annotation file is not a VCF/BCF, all new annotations must be defined via -h, --header-lines.  
 
-See also the -l, --merge-logic option.

+See also the -l, --merge-logic option. + 

+Summary of -c, --columns:

+ + +
+
+
    CHROM,POS,TAG       .. match by chromosome and position, transfer annotation from TAG
+    CHROM,POS,-,TAG     .. same as above, but ignore the third column of the annotation file
+    CHROM,BEG,END,TAG   .. match by region (BEG,END are synonymous to FROM,TO)
+    CHROM,POS,REF,ALT   .. match by CHROM, POS, REF and ALT
+
+    DST_TAG:=SRC_TAG    .. transfer the SRC_TAG using the new name DST_TAG
+    INFO                .. transfer all INFO annotations
+    ^INFO/TAG           .. transfer all INFO annotations except "TAG"
+
+    TAG       .. add or overwrite existing target value if source is not "." and skip otherwise
+    +TAG      .. add or overwrite existing target value only it is "."
+    .TAG      .. add or overwrite existing target value even if source is "."
+    .+TAG     .. add new but never overwrite existing tag, regardless of its value; can transfer "." if target does not exist
+    -TAG      .. overwrite existing value, never add new if target does not exist
+    =TAG      .. do not overwrite but append value to existing tags
+
+    ~FIELD    .. use this column to match lines with -i/-e expression (see the description of -i below)
+    ~ID       .. in addition to CHROM,POS,REF,ALT match by also ID
+    ~INFO/END .. in addition to CHROM,POS,REF,ALT match by also INFO/END
+
+
+
+
-C, --columns-file file

Read the list of columns from a file (normally given via the -c, --columns option). @@ -532,7 +569,7 @@

bcftools annotate [OPTIONS] FILE

-e, --exclude EXPRESSION

exclude sites for which EXPRESSION is true. For valid expressions see -EXPRESSIONS.

+EXPRESSIONS and the extension described in -i, --include below.

--force
@@ -573,8 +610,27 @@

bcftools annotate [OPTIONS] FILE

-i, --include EXPRESSION

include only sites for which EXPRESSION is true. For valid expressions see -EXPRESSIONS.

+EXPRESSIONS. + 

+Additionally, the command bcftools annotate supports expressions updated from the annotation +file dynamically for each record:

+
+
+
+
+
    # The field 'STR' from the -a file is required to match INFO/TAG in VCF. In the first example
+    # the alleles REF,ALT must match, in the second example they are ignored. The option -k is required
+    # to output also records that are not annotated. The third example shows the same concept with
+    # a numerical expression.
+    bcftools annotate -a annots.tsv.gz -c CHROM,POS,REF,ALT,SCORE,~STR -i'TAG={STR}' -k input.vcf
+    bcftools annotate -a annots.tsv.gz -c CHROM,POS,-,-,SCORE,~STR     -i'TAG={STR}' -k input.vcf
+    bcftools annotate -a annots.tsv.gz -c CHROM,POS,-,-,SCORE,~INT     -i'TAG>{INT}' -k input.vcf
+
+
+
+
-k, --keep-sites

keep sites which do not pass -i and -e expressions instead of discarding them

@@ -681,9 +737,10 @@

bcftools annotate [OPTIONS] FILE

"^INFO/FOO,INFO/BAR" (and similarly for FORMAT and FILTER). "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT".

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -720,7 +777,7 @@

bcftools annotate [OPTIONS] FILE

# that INFO/END is already present in the VCF header. bcftools annotate -a annots.tab.gz -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf - # For more examples see http://samtools.github.io/bcftools/howtos/annotate.html + # For (many) more examples see http://samtools.github.io/bcftools/howtos/annotate.html @@ -814,9 +871,10 @@

File format options:

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -830,6 +888,10 @@

Input/output options:

output all alternate alleles present in the alignments even if they do not appear in any of the genotypes

+
-*, --keep-unseen-allele
+
+

keep the unobserved allele <*> or <NON_REF>, useful mainly for gVCF output

+
-f, --format-fields list

comma-separated list of FORMAT fields to output for each sample. Currently @@ -866,7 +928,7 @@

Input/output options:

-G, --group-samples FILE|-
-

by default, all samples are assumed to come from a single population. This option allows to group samples +

by default, all samples are assumed to come from a single population. This option groups samples into populations and apply the HWE assumption within but not across the populations. FILE is a tab-delimited text file with sample names in the first column and group names in the second column. If - is given instead, no HWE assumption is made at all and single-sample calling is performed. (Note that @@ -1182,9 +1244,10 @@

bcftools concat [OPTIONS] FILE1 FILE2

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -1306,6 +1369,11 @@

bcftools consensus [OPTIONS] FILE

write output to a file

+
--regions-overlap 0|1|2
+
+

how to treat VCF variants overlapping the target region in the fasta file: +see Common Options

+
-s, --samples LIST

apply variants of the listed samples. See also the option -I, --iupac-codes

@@ -1401,9 +1469,10 @@

VCF input options:

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -1740,6 +1809,10 @@

bcftools csq [OPTIONS] FILE

if more are required, see the --ncsq option.

+

Note that the program annotates only records with a functional consequence and +intergenic regions will pass through unchanged.

+
+

The program requires on input a VCF/BCF file, the reference genome in fasta format (--fasta-ref) and genomic features in the GFF3 format downloadable from the Ensembl website (--gff-annot), and outputs an annotated VCF/BCF @@ -1789,7 +1862,7 @@

bcftools csq [OPTIONS] FILE

--force
-

run even if some sanity checks fail. Currently the option allows to skip +

run even if some sanity checks fail. Currently the option enables skipping transcripts in malformatted GFFs with incorrect phase

-g, --gff-annot FILE
@@ -1946,9 +2019,10 @@

bcftools csq [OPTIONS] FILE

and VCF, such as "chrX" vs "X". The chromosome names in the output VCF will match that of the input VCF. The default is to attempt the automatic translation.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -2141,7 +2215,7 @@

bcftools filter [OPTIONS] FILE

-s, --soft-filter STRING|+

annotate FILTER column with STRING or, with +, a unique filter name generated -by the program ("Filter%d").

+by the program ("Filter%d"). Applies to records that do not meet filter expression.

-S, --set-GTs .|0
@@ -2163,9 +2237,10 @@

bcftools filter [OPTIONS] FILE

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -2178,6 +2253,11 @@

bcftools gtcheck [OPTIONS] [-g ge is checked against the samples in the -g file. Without the -g option, multi-sample cross-check of samples in query.vcf.gz is performed.

+
+

Note that the interpretation of the discordance score depends on the options provided (specifically -e and +-u) and on the available annotations (FORMAT/PL vs FORMAT/GT). +The discordance score can be interpreted as the number of mismatching genotypes if only GT-vs-GT matching is performed.

+
--distinctive-sites NUM[,MEM[,DIR]]
@@ -2191,16 +2271,29 @@

bcftools gtcheck [OPTIONS] [-g ge

Stop after first record to estimate required time.

-
-e, --error-probability INT
+
-e, --exclude [qry|gt]:'EXPRESSION'
+
+

Exclude sites from query file (qry:) or genotype file (gt:) for which EXPRESSION is true. +For valid expressions see EXPRESSIONS.

+
+
-E, --error-probability INT

Interpret genotypes and genotype likelihoods probabilistically. The value of INT represents genotype quality when GT tag is used (e.g. Q=30 represents one error in 1,000 genotypes and Q=40 one error in 10,000 genotypes) and is ignored when PL tag is used (in that case an arbitrary -non-zero integer can be provided). See also the -u, --use option below. If set to 0, -the discordance equals to the number of mismatching genotypes when GT vs GT is compared. -Note that the values with and without -e are not comparable, only values generated -with -e 0 correspond to mismatching genotypes. -If performance is an issue, set to 0 for faster run but less accurate results.

+non-zero integer can be provided). + 

+If -E is set to 0, the discordance score can be interpreted as the number of mismatching genotypes, +but only in the GT-vs-GT matching mode. See the -u, --use option below for additional notes and caveats. + 

+If performance is an issue, set -E 0 for faster run times but less accurate results. + 

+Note that in previous versions of bcftools (⇐1.18), this option used to be a smaller case -e. It +changed to make room for the filtering option -e, --exclude to stay consistent across other +commands.

-g, --genotypes FILE
@@ -2210,6 +2303,11 @@

bcftools gtcheck [OPTIONS] [-g ge

Homozygous genotypes only, useful with low coverage data (requires -g, --genotypes)

+
-i, --include [qry|gt]:'EXPRESSION'
+
+

Include sites from query file (qry:) or genotype file (gt:) for which EXPRESSION is true. +For valid expressions see EXPRESSIONS.

+
--n-matches INT

Print only top INT matches for each sample, 0 for unlimited. Use negative value @@ -2221,6 +2319,14 @@

bcftools gtcheck [OPTIONS] [-g ge

Disable calculation of HWE probability to reduce memory requirements with comparisons between very large number of sample pairs.

+
-o, --output FILE
+
+

Write to FILE rather than to standard output, where it is written by default.

+
+
-O, --output-type t|z
+
+

Write a plain (t) or compressed (z) text tab-delimited output.

+
-p, --pairs LIST

A comma-separated list of sample pairs to compare. When the -g option is given, the first @@ -2274,8 +2380,13 @@

bcftools gtcheck [OPTIONS] [-g ge
-u, --use TAG1[,TAG2]

specifies which tag to use in the query file (TAG1) and the -g (TAG2) file. -By default, the PL tag is used in the query file and GT in the -g file when -available.

+By default, the PL tag is used in the query file and, when available, the GT tags in the +-g file. + 

+Note that when the requested tag is not available, the program will attempt to use +the other tag. The output includes the number of sites that were matched by the four +possible modes (for example GT-vs-GT or GT-vs-PL).

@@ -2284,10 +2395,10 @@

bcftools gtcheck [OPTIONS] [-g ge
-
   # Check discordance of all samples from B against all sample in A
+
   # Check discordance of all samples from B against all samples in A
    bcftools gtcheck -g A.bcf B.bcf
 
-   # Limit comparisons to the fiven list of samples
+   # Limit comparisons to the given list of samples
    bcftools gtcheck -s gt:a1,a2,a3 -s qry:b1,b2 -g A.bcf B.bcf
 
    # Compare only two pairs a1,b1 and a1,b2
@@ -2322,6 +2433,13 @@ 

Options:

Also display the first INT variant records. By default, no variant records are displayed.

+
-s, --samples INT
+
+

Display the first INT variant records including the last #CHROM header line with samples. +Running with -s 0 alone outputs the #CHROM header line only. Note that +the list of samples, with each sample per line, can be obtained with bcftools query using +the option -l, --list-samples.

+
@@ -2430,6 +2548,10 @@

bcftools isec [OPTIONS] A.vcf.gz B.vcf.gzinclude only sites for which EXPRESSION is true. See discussion of -e, --exclude above.

+
-f, --file-list FILE
+
+

Read file names from FILE, one file name per line.

+
-n, --nfiles [+-=]INT|~BITMAP

output positions present in this many (=), this many or more (+), this @@ -2474,12 +2596,14 @@

bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz
-w, --write LIST
-

list of input files to output given as 1-based indices. With -p and no +

comma-separated list of input files to output given as 1-based indices. With -p and no -w, all files are written.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file. This is done automatically with the -p option.

+

Automatically index the output file. FMT is optional and defaults +to tbi for vcf.gz and csi for bcf. This is done automatically +with the -p option if the output format is compressed.

@@ -2550,6 +2674,10 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz<
+
--force-no-index
+
+

synonymous to --no-index

+
--force-samples

if the merged files contain duplicate samples names, proceed anyway. @@ -2557,6 +2685,10 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz< as it appeared on the command line to the conflicting sample name (see 2:S3 in the above example).

+
--force-single
+
+

run even if only one file is given on input

+
--print-header

print only merged header and exit

@@ -2605,16 +2737,18 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz<

Sites with many alternate alleles can require extremely large storage space which can exceed the 2GB size limit representable by BCF. This is caused by Number=G tags (such as FORMAT/PL) which store a value for each combination of reference -and alternate alleles. The -L, --local-alleles option allows to replace such tags +and alternate alleles. The -L, --local-alleles option allows replacement of such tags with a localized tag (FORMAT/LPL) which only includes a subset of alternate alleles relevant for that sample. A new FORMAT/LAA tag is added which lists 1-based indices of the alternate alleles relevant (local) for the current sample. The number INT gives the maximum number of alternate alleles that can be included in the PL tag. The default value is 0 which disables the feature and outputs values for all alternate alleles.

-
-m, --merge snps|indels|both|snp-ins-del|all|none|id
+
-m, --merge snps|indels|both|snp-ins-del|all|none|id[,*]
-

The option controls what types of multiallelic records can be created:

+

The option controls what types of multiallelic records can be created. If single asterisk +* is appended, the unobserved allele <*> or <NON_REF> will be removed at variant sites; +if two asterisks ** are appended, the unobserved allele will be removed all sites.

@@ -2624,6 +2758,8 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz< -m snps .. allow multiallelic SNP records -m indels .. allow multiallelic indel records -m both .. both SNP and indel records can be multiallelic +-m both,* .. same as above but remove <*> (or <NON_REF>) from variant sites +-m both,** .. same as above but remove <*> (or <NON_REF>) at all sites -m all .. SNP records can be merged with indel records -m snp-ins-del .. allow multiallelic SNVs, insertions, deletions, but don't mix them -m id .. merge by ID @@ -2637,13 +2773,13 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz< alleles, vector fields pertaining to unobserved alleles are set to missing (.) by default. The METHOD is one of . (the default, use missing values), NUMBER (use a constant value, e.g. 0), max (the maximum value observed for other alleles in the sample). When --gvcf option is set, -the rule -M PL:max,AD:0 is implied. This can be overriden with providing -M - or -M PL:.,AD:.. +the rule -M PL:max,AD:0 is implied. This can be overridden with providing -M - or -M PL:.,AD:.. Note that if the unobserved allele is explicitly present as <*> or <NON_REF>, then its corresponding value will be used regardless of -M settings.

--no-index
-

the option allows to merge files without indexing them first. In order for this +

the option allows files to be merged without indexing them first. In order for this option to work, the user must ensure that the input files have chromosomes in the same order and consistent with the order of sequences in the VCF header.

@@ -2675,9 +2811,10 @@

bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz<

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -2817,7 +2954,23 @@

Input options

A new EXPERIMENTAL indel calling model which aims to address some known deficiencies of the current indel calling algorithm. Specifically, it uses diploid reference consensus sequence. Note that in the current version it has the potential to increase sensitivity -but at the cost of decreased specificity

+but at the cost of decreased specificity. +Only works with short-read sequencing technologies.

+ +
--indels-cns
+
+

Another EXPERIMENTAL indel calling method, predating indels-2.0 in +PR form, but merged more recently. It also uses a diploid +reference consensus, but with added parameters and heuristics to +optimise for a variety of sequencing platforms. This is usually +faster and more accurate than the default caller and --indels-2.0, +but has not been tested on non-diploid samples and samples without +approximately even allele frequency.

+
+
--no-indels-cns
+
+

May be used to turn off --indels-cns mode when using one of the +newer profiles that has this enabled by default.

-q, -min-MQ INT
@@ -2991,9 +3144,10 @@

Output options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -3004,15 +3158,70 @@

Options for SNP/I
-X, --config STR
-

Specify a platform specific configuration profile. The profile -should be one of 1.12, illumina, ont or pacbio-ccs. -Settings applied are as follows:

+

Specify a platform specific configuration profile. Specifying the +profile as "list" will list the available profile names and the +parameters they change. There are profiles named after a release, +which should be used if you wish to ensure forward compatibility +of results. The non-versioned names (eg "illumina") will always +point to the most recent set of parameters for that instrument type. +The current values are:

-
1.12           -Q13 -h100 -m1
-illumina       [ default values ]
-ont	           -B -Q5 --max-BQ 30 -I
-pacbio-ccs     -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
+
1.12            -Q13 -h100 -m1
+
+
+
+
+
bgi
+bgi-1.20        --indels-cns -B --indel-size 80 -F0.1 --indel-bias 0.9
+                --seqq-offset 120
+
+
+
+
+
illumina-1.18   [ default values ]
+
+
+
+
+
illumina
+illumina-1.20   --indels-cns --seqq-offset 125
+
+
+
+
+
ont             -B -Q5 --max-BQ 30 -I
+
+
+
+
+
ont-sup
+ont-sup-1.20    --indels-cns -B -Q1 --max-BQ 35 --delta-BQ 99 -F0.2
+                -o15 -e1 -h110 --del-bias 0.4 --indel-bias 0.7
+                --poly-mqual --seqq-offset 130 --indel-size 80
+
+
+
+
+
pacbio-ccs-1.18 -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
+
+
+
+
+
pacbio-ccs
+pacbio-ccs-1.20  --indels-cns -B -Q5 --max-BQ 50 -F0.1 -o25 -e1 -h300
+                 --delta-BQ 10 --del-bias 0.4 --poly-mqual
+                 --indel-bias 0.9 --seqq-offset 118 --indel-size 80
+                 --score-vs-ref 0.7
+
+
+
+
+
ultima
+ultima-1.20      --indels-cns -B -Q1 --max-BQ 30 --delta-BQ 10 -F0.15
+                 -o20 -e10 -h250 --del-bias 0.3 --indel-bias 0.7
+                 --poly-mqual --seqq-offset 140 --score-vs-ref 0.3
+                 --indel-size 80
@@ -3058,12 +3267,32 @@

Options for SNP/I 0.75) while higher depth samples or where you favour recall rates over precision may work better with a higher value such as 2.0.

+
--del-bias FLOAT
+
+

Skews the likelihood of deletions over insertions. Defaults to an +even distribution value of 1.0. Lower values imply a higher rate +of false positive deletions (meaning candidate deletions are less +likely to be real).

+
--indel-size INT

Indel window size to use when assessing the quality of candidate indels. Note that although the window size approximately corresponds to the maximum indel size considered, it is not an exact threshold [110]

+
--seqq-offset INT
+
+

Tunes the importance of indel sequence quality per depth. The +final "seqQ" quality used is "offset - 5*MIN(depth,20)". [120]

+
+
--poly-mqual
+
+

Use the lowest quality value within a homopolymer run, instead of +the quality immediately adjacent to the indel. This may be +important for unclocked instruments, particularly ones with a flow +chemistry where runs of bases of identical type are incorporated +together.

+
-I, --skip-indels

Do not perform INDEL calling

@@ -3157,14 +3386,14 @@

bcftools norm [OPTIONS] file.vcf.gz

100 CC C,GG 1/2 # After: - # bcftools norm -a . + # bcftools norm -a --atom-overlaps . 100 C G ./1 100 CC C 1/. 101 C G ./1 # After: - # bcftools norm -a '*' - # bcftools norm -a \* + # bcftools norm -a --atom-overlaps '*' + # bcftools norm -a --atom-overlaps \* 100 C G,* 2/1 100 CC C,* 1/2 101 C G,* 2/1 @@ -3205,6 +3434,12 @@

bcftools norm [OPTIONS] file.vcf.gz

try to proceed with -m- even if malformed tags with incorrect number of fields are encountered, discarding such tags. (Experimental, use at your own risk.)

+
-g, --gff-annot FILE
+
+

when a GFF file is provided, follow HGVS 3’rule and right-align variants in transcripts on the forward +strand. In case of overlapping transcripts, the default mode is to left-align the variant. For a +description of the supported GFF3 file format see bcftools csq.

+
--keep-sum TAG[,…​]

keep vector sum constant when splitting multiallelic sites. Only AD tag @@ -3218,7 +3453,11 @@

bcftools norm [OPTIONS] file.vcf.gz

together: If only SNP records should be split or merged, specify snps; if both SNPs and indels should be merged separately into two records, specify both; if SNPs and indels should be merged into a single record, specify -any.

+any. + 

+Note that multiallelic sites with both SNPs and indels will be split into +biallelic sites with both -m -snps and -m -indels.

--multi-overlaps 0|.
@@ -3285,9 +3524,10 @@

bcftools norm [OPTIONS] file.vcf.gz

maximum distance between two records to consider when locally sorting variants which changed position during the realignment

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -3364,9 +3604,10 @@

VCF output options:

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -3613,13 +3854,14 @@

List of plugins coming wi
split-vep
-

extract fields from structured annotations such as INFO/CSQ created by bcftools/csq or VEP. These -can be added as a new INFO field to the VCF or in a custom text format. See +

extract fields from structured annotations such as INFO/CSQ created by VEP or INFO/BCSQ created by +bcftools/csq. These can be added as a new INFO field to the VCF or in a custom text format. See http://samtools.github.io/bcftools/howtos/plugin.split-vep.html for more.

tag2tag
-

Convert between similar tags, such as GL,PL,GP or QR,QA,QS.

+

Convert between similar tags, such as GL,PL,GP or QR,QA,QS or tags with localized alleles e.g. LPL,LAD. +See http://samtools.github.io/bcftools/howtos/plugin.tag2tag.html for more.

trio-dnm2
@@ -3830,6 +4072,12 @@

bcftools query [OPTIONS] file.vcf.gz [file.

learn by example, see below

+
-F, --print-filtered STR
+
+

by default, samples failing -i/-e filtering expressions are suppressed from output +when FORMAT fields are queried (for example %CHROM %POS [ %GT]). With -F, such +fields will be still printed but instead of their actual value, STR will be used.

+
-H, --print-header

print header

@@ -3843,6 +4091,14 @@

bcftools query [OPTIONS] file.vcf.gz [file.

list sample names and exit

+
-N, --disable-automatic-newline
+
+

disable automatic addition of a missing newline character at the end of the formatting +expression. By default, the program checks if the expression contains a newline +and appends it if not, to prevent formatting the entire output into a single +line by mistake. Note that versions prior to 1.18 had no automatic check and newline +had to be included explicitly.

+
-o, --output FILE

see Common Options

@@ -3913,6 +4169,7 @@

Format:

%TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples. %TGT Translated genotype (e.g. C/A) %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER) +%VKX VariantKey, biallelic hexadecimal encoding of CHROM,POS,REF,ALT (https://github.com/tecnickcom/variantkey) [] Format fields must be enclosed in brackets to loop over all samples \n new line \t tab character @@ -3976,6 +4233,14 @@

Examples:

bcftools query -f '%AC{1}\n' -i 'AC[1]>10' file.vcf.gz +
+
+
# Print all samples at sites where at least one sample has DP=1 or DP=2. In the second case
+# print only samples with DP=1 or DP=2, the difference is in the logical operator used, || vs |.
+bcftools query -f '[%SAMPLE %GT %DP\n]' -i 'FMT/DP=1 || FMT/DP=2' file.vcf
+bcftools query -f '[%SAMPLE %GT %DP\n]' -i 'FMT/DP=1 |  FMT/DP=2' file.vcf
+
+
@@ -4010,7 +4275,7 @@

bcftools reheader [OPTIONS] file.vcf.gz

-T, --temp-prefix PATH
-

template for temporary file names, used with -f

+

this option is ignored, but left for compatibility with earlier versions of bcftools.

--threads INT
@@ -4248,11 +4513,13 @@

bcftools sort [OPTIONS] file.bcf

-T, --temp-dir DIR
-

Use this directory to store temporary files

+

Use this directory to store temporary files. If the last six characters of the string DIR are XXXXXX, +then these are replaced with a string that makes the directory name unique.

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -4457,9 +4724,10 @@

Output options

see Common Options

-
--write-index
+
-W[FMT], -W[=FMT], --write-index[=FMT]
-

Automatically index the output file

+

Automatically index the output file. FMT is optional and can be +one of "tbi" or "csi" depending on output file format.

@@ -4468,6 +4736,11 @@

Output options

Subset options:

+
-A, --trim-unseen-alleles
+
+

remove the unseen allele <*> or <NON_REF> at variant sites when the option is given once (-A) or +at all sites when the options is given twice (-AA).

+
-a, --trim-alt-alleles

remove alleles not seen in the genotype fields from the ALT column. Note that if no alternate allele @@ -4660,6 +4933,98 @@

bcftools [--version-only]

+

SCRIPTS

+
+
+

gff2gff

+
+

Attempts to fix a GFF file to be correctly parsed by csq.

+
+
+
+
+
+
zcat in.gff.gz | gff2gff | gzip -c > out.gff.gz
+
+
+
+
+
+
+

plot-vcfstats [OPTIONS] file.vchk […​]

+
+

Script for processing output of bcftools stats. It can merge +results from multiple outputs (useful when running the stats for each +chromosome separately), plots graphs and creates a PDF presentation.

+
+
+
+
-m, --merge
+
+

Merge vcfstats files to STDOUT, skip plotting.

+
+
-p, --prefix DIR
+
+

The output directory. This directory will be created if it does not exist.

+
+
-P, --no-PDF
+
+

Skip the PDF creation step.

+
+
-r, --rasterize
+
+

Rasterize PDF images for faster rendering. This is the default and the opposite of -v, --vectors.

+
+
-s, --sample-names
+
+

Use sample names for xticks rather than numeric IDs.

+
+
-t, --title STRING
+
+

Identify files by these titles in plots. The option can be given multiple +times, for each ID in the bcftools stats output. If not +present, the script will use abbreviated source file names for the titles.

+
+
-v, --vectors
+
+

Generate vector graphics for PDF images, the opposite of -r, --rasterize.

+
+
-T, --main-title STRING
+
+

Main title for the PDF.

+
+
+
+
+

Example:

+
+
+
+
+
+
# Generate the stats
+bcftools stats -s - > file.vchk
+
+
+
+
+
# Plot the stats
+plot-vcfstats -p outdir file.vchk
+
+
+
+
+
# The final looks can be customized by editing the generated
+# 'outdir/plot.py' script and re-running manually
+cd outdir && python plot.py && pdflatex summary.tex
+
+
+
+
+
+
+
+

FILTERING EXPRESSIONS

@@ -4669,8 +5034,7 @@

FILTERING EXPRESSIONS

Valid expressions may contain:
  • -

    numerical constants, string constants, file names (this is currently -supported only to filter by the ID column)

    +

    numerical constants, string constants, file names (indicated by the prefix @)

    1, 1.0, 1e-4
    @@ -4804,7 +5168,7 @@ 

    FILTERING EXPRESSIONS

  • -

    TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap). Use the regex +

    TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,bnd,other,overlap, see TERMINOLOGY). Use the regex operator "\~" to require at least one allele of the given type or the equal sign "=" to require that all alleles are of the given type. Compare

    @@ -5052,12 +5416,17 @@

    FILTERING EXPRESSIONS

    -
    ID=@file       .. selects lines with ID present in the file
    +
    ID=@file               .. selects lines with ID present in the file
    +
    +
    +
    +
    +
    ID!=@~/file            .. skip lines with ID present in the ~/file
    -
    ID!=@~/file    .. skip lines with ID present in the ~/file
    +
    INFO/TAG=@file         .. selects lines with INFO/TAG value present in the file
    @@ -5096,91 +5465,27 @@

    FILTERING EXPRESSIONS

-

SCRIPTS

+

TERMINOLOGY

-
-

gff2gff

-
-

Attempts to fix a GFF file to be correctly parsed by csq.

-
-
-
-
-
-
zcat in.gff.gz | gff2gff | gzip -c > out.gff.gz
-
-
-
-
-
-
-

plot-vcfstats [OPTIONS] file.vchk […​]

-
-

Script for processing output of bcftools stats. It can merge -results from multiple outputs (useful when running the stats for each -chromosome separately), plots graphs and creates a PDF presentation.

-
-
-
-
-m, --merge
-
-

Merge vcfstats files to STDOUT, skip plotting.

-
-
-p, --prefix DIR
-
-

The output directory. This directory will be created if it does not exist.

-
-
-P, --no-PDF
-
-

Skip the PDF creation step.

-
-
-r, --rasterize
-
-

Rasterize PDF images for faster rendering. This is the default and the opposite of -v, --vectors.

-
-
-s, --sample-names
-
-

Use sample names for xticks rather than numeric IDs.

-
-
-t, --title STRING
-
-

Identify files by these titles in plots. The option can be given multiple -times, for each ID in the bcftools stats output. If not -present, the script will use abbreviated source file names for the titles.

-
-
-v, --vectors
-
-

Generate vector graphics for PDF images, the opposite of -r, --rasterize.

-
-
-T, --main-title STRING
-
-

Main title for the PDF.

-
-
-
-

Example:

+

The program and the documentation uses the following terminology, multiple terms can be used +interchangeably for the same VCF record type

-
# Generate the stats
-bcftools stats -s - > file.vchk
-
-
-
-
-
# Plot the stats
-plot-vcfstats -p outdir file.vchk
-
-
-
-
-
# The final looks can be customized by editing the generated
-# 'outdir/plot.py' script and re-running manually
-cd outdir && python plot.py && pdflatex summary.tex
-
+
REF   ALT
+---------
+C     .         .. reference allele / non-variant site / ref-only site
+C     T         .. SNP or SNV (single-nucleotide polymorphism or variant), used interchangeably
+CC    TT        .. MNP (multi-nucleotide polymorphism)
+CAAA  C         .. indel, deletion (regardless of length)
+C     CAAA      .. indel, insertion (regardless of length)
+C     <*>       .. gVCF block, the allele <*> is a placeholder for alternate allele possibly missed because of low coverage
+C     <NON_REF> .. synonymous to <*>
+C     *         .. overlapping deletion
+C     <INS>     .. symbolic allele, known also as 'other [than above]'
@@ -5257,7 +5562,7 @@

COPYING

diff --git a/howtos/FAQ.html b/howtos/FAQ.html index 6681b6060..8b5d93c53 100644 --- a/howtos/FAQ.html +++ b/howtos/FAQ.html @@ -4,7 +4,7 @@ - + Frequently Asked Questions @@ -83,6 +83,36 @@

Frequently Asked Questions

+
+
'XYZ' is not defined in the header, assuming Type=String
+

The VCF specification recommends that all INFO and +FORMAT tags that appear throughout the file body are defined in the VCF header.

+
+
+

Fix the header using the reheader command

+
+
+
+
# Write out the header to be modified
+bcftools view -h old.vcf > header.txt
+
+# Edit the header using your favorite text editor and add the missing definition, eg
+#   ##INFO=<ID=XYZ,Number=1,Type=Integer,Description="Describe the tag">
+vi header.txt
+
+# Reheader the file
+bcftools reheader -h header.txt -o new.vcf old.vcf
+
+
+
+

Why do you have to do it? Although VCF specification allows undefined tags, HTSlib and BCFtools internally +treat VCF as BCF, where all tags must be defined in the header. This is because of the way BCF is designed: +the tags throughout the BCF file are represented as pointers to the dictionary of tags stored in the header. +We work around this problem by adding missing definitions on the fly. Note this can work for read-only operations, but +will still lead to problems when writing the file out as BCF: even though the reader +updated its internal structures with a dummy definition and continued reading, the writer was not +aware about the new tag when the header was written.

+
Incorrect number of fields at chr1:1234567

This error is triggered when the number of values in the data line does not match @@ -110,7 +140,7 @@

Frequently Asked Questions

-

The error above is printed when different number of values is encoutered, for example AC=1 or AC=1,1,1 in the example above.

+

The error above is printed when different number of values is encountered, for example AC=1 or AC=1,1,1 in the example above.

Other such definitions are Number=R (there must be as many values as there are REF+ALT alleles in total), diff --git a/howtos/FAQ.txt b/howtos/FAQ.txt index ffbd3e39b..ddbf26032 100644 --- a/howtos/FAQ.txt +++ b/howtos/FAQ.txt @@ -4,8 +4,34 @@ include::header.inc[] Frequently Asked Questions -------------------------- -.*Incorrect number of fields at chr1:1234567* +.*'XYZ' is not defined in the header, assuming Type=String* +[#undefined-tag] +The link:https://samtools.github.io/hts-specs/VCFv4.3.pdf[VCF specification] recommends that all INFO and +FORMAT tags that appear throughout the file body are defined in the VCF header. + +Fix the header using the reheader command +---- +# Write out the header to be modified +bcftools view -h old.vcf > header.txt +# Edit the header using your favorite text editor and add the missing definition, eg +# ##INFO= +vi header.txt + +# Reheader the file +bcftools reheader -h header.txt -o new.vcf old.vcf +---- + +Why do you have to do it? Although VCF specification allows undefined tags, HTSlib and BCFtools internally +treat VCF as BCF, where all tags must be defined in the header. This is because of the way BCF is designed: +the tags throughout the BCF file are represented as pointers to the dictionary of tags stored in the header. +We work around this problem by adding missing definitions on the fly. Note this can work for read-only operations, but +will still lead to problems when writing the file out as BCF: even though the reader +updated its internal structures with a dummy definition and continued reading, the writer was not +aware about the new tag when the header was written. + + +.*Incorrect number of fields at chr1:1234567* [#incorrect-nfields] This error is triggered when the number of values in the data line does not match its definition in the header. For example, one may see an error like @@ -20,7 +46,7 @@ and expects a value for each ALT allele, for example ---- chr1 64334 . A C,T . . AC=1,1 GT 0/1 0/1 ---- -The error above is printed when different number of values is encoutered, for example `AC=1` or `AC=1,1,1` in the example above. +The error above is printed when different number of values is encountered, for example `AC=1` or `AC=1,1,1` in the example above. Other such definitions are `Number=R` (there must be as many values as there are REF+ALT alleles in total), and `Number=G` (this is more complicated, see the section 1.4.2 of the link:http://samtools.github.io/hts-specs/VCFv4.3.pdf[VCF specification]). diff --git a/howtos/bcftools.txt b/howtos/bcftools.txt index 29a1d4003..f62ff0981 100644 --- a/howtos/bcftools.txt +++ b/howtos/bcftools.txt @@ -408,7 +408,7 @@ Add or remove annotations. # Annotate from a tab-delimited file with regions (1-based coordinates, inclusive) tabix -s1 -b2 -e3 annots.tab.gz - bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG inut.vcf + bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf # Annotate from a bed file (0-based coordinates, half-closed, half-open intervals) bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf @@ -1022,7 +1022,7 @@ See the usage examples below. # %TBCSQ .. print consequences in both haplotypes in separate columns # %TBCSQ{0} .. print the first haplotype only # %TBCSQ{1} .. print the second haplotype only - # %TBCSQ{*} .. print a list of unique consquences present in either haplotype + # %TBCSQ{*} .. print a list of unique consequences present in either haplotype bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf ---- @@ -2069,7 +2069,7 @@ Extracts fields from VCF or BCF files and outputs them in user-defined format. %SAMPLE Sample name %POS0 POS in 0-based coordinates %END End position of the REF allele - %END0 End position of the REF allele in 0-based cordinates + %END0 End position of the REF allele in 0-based coordinates \n new line \t tab character diff --git a/howtos/cnv-calling.html b/howtos/cnv-calling.html index 8e3bd529f..8438c19b6 100644 --- a/howtos/cnv-calling.html +++ b/howtos/cnv-calling.html @@ -181,7 +181,7 @@

Detecting subchromosomal CNVs

-
bcftools cnv -c conrol_sample -s query_sample -o outdir/ -p 0 file.vcf
+
bcftools cnv -c control_sample -s query_sample -o outdir/ -p 0 file.vcf
diff --git a/howtos/cnv-calling.txt b/howtos/cnv-calling.txt index e6437c2ec..4a40ab3bb 100644 --- a/howtos/cnv-calling.txt +++ b/howtos/cnv-calling.txt @@ -81,7 +81,7 @@ differences between two samples. This greatly helps to reduce the number of false calls and also allows one to distinguish between normal and novel copy number variation. The command is ---- -bcftools cnv -c conrol_sample -s query_sample -o outdir/ -p 0 file.vcf +bcftools cnv -c control_sample -s query_sample -o outdir/ -p 0 file.vcf ---- The ``-p 0`` option tells the program to automatically call matplotlib and produce plots like the one in this example: diff --git a/howtos/index.html b/howtos/index.html index 64d2b8139..adc2d9d74 100644 --- a/howtos/index.html +++ b/howtos/index.html @@ -97,7 +97,7 @@

About BCFtools

BCFtools is a program for variant calling and manipulating files in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed. -In order to avoid tedious repetion, throughout this document we will use +In order to avoid tedious repetition, throughout this document we will use "VCF" and "BCF" interchangeably, unless specifically noted.

diff --git a/howtos/index.txt b/howtos/index.txt index 7318b68dd..de8ed53ec 100644 --- a/howtos/index.txt +++ b/howtos/index.txt @@ -15,7 +15,7 @@ https://github.com/samtools/bcftools/issues[github]. BCFtools is a program for variant calling and manipulating files in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed. -In order to avoid tedious repetion, throughout this document we will use +In order to avoid tedious repetition, throughout this document we will use "VCF" and "BCF" interchangeably, unless specifically noted. Most commands accept VCF, bgzipped VCF and BCF with filetype detected diff --git a/howtos/plugin.fixref.html b/howtos/plugin.fixref.html index 10ae8a3c3..1d95e8de3 100644 --- a/howtos/plugin.fixref.html +++ b/howtos/plugin.fixref.html @@ -155,7 +155,7 @@

Plugin fixref

In the most extreme case when nothing else is working, one can simply force -the unambigous alleles onto the forward strand and drop the ambigous genotypes.

+the unambiguous alleles onto the forward strand and drop the ambiguous genotypes.

diff --git a/howtos/plugin.fixref.txt b/howtos/plugin.fixref.txt index b5997e03f..58a68f781 100644 --- a/howtos/plugin.fixref.txt +++ b/howtos/plugin.fixref.txt @@ -54,7 +54,7 @@ bcftools sort fixref.bcf -Ob -o fixref.sorted.bcf In the most extreme case when nothing else is working, one can simply force -the unambigous alleles onto the forward strand and drop the ambigous genotypes. +the unambiguous alleles onto the forward strand and drop the ambiguous genotypes. ---- bcftools +fixref test.bcf -Ob -o output.bcf -- -f ref.fa -m flip -d ---- diff --git a/howtos/plugin.setGT.html b/howtos/plugin.setGT.html new file mode 100644 index 000000000..3109caf13 --- /dev/null +++ b/howtos/plugin.setGT.html @@ -0,0 +1,157 @@ + + + + + + + +Plugin setGT + + + + +
+ +
+
+

Plugin setGT

+
+
+

The plugin +setGT allows to edit genotypes

+
+
+

The list of plugin-specific options can be obtained by running +bcftools +setGT -h, which will print the following usage page:

+
+
+
+
About: Sets genotypes. The target genotypes can be specified as:
+           ./.     .. completely missing ("." or "./.", depending on ploidy)
+           ./x     .. partially missing (e.g., "./0" or ".|1" but not "./.")
+           .       .. partially or completely missing
+           a       .. all genotypes
+           b       .. heterozygous genotypes failing two-tailed binomial test (example below)
+           q       .. select genotypes using -i/-e options
+           r:FLOAT .. select randomly a proportion of FLOAT genotypes (can be combined with other modes)
+       and the new genotype can be one of:
+           .       .. missing ("." or "./.", keeps ploidy)
+           0       .. reference allele (e.g. 0/0 or 0, keeps ploidy)
+           c:GT    .. custom genotype (e.g. 0/0, 0, 0/1, m/M, 0/X overrides ploidy)
+           m       .. minor (the second most common) allele as determined from INFO/AC or FMT/GT (e.g. 1/1 or 1, keeps ploidy)
+           M       .. major allele as determined from INFO/AC or FMT/GT (e.g. 1/1 or 1, keeps ploidy)
+           X       .. allele with bigger read depth as determined from FMT/AD
+           p       .. phase genotype (0/1 becomes 0|1)
+           u       .. unphase genotype and sort by allele (1|0 becomes 0/1)
+Usage: bcftools +setGT [General Options] -- [Plugin Options]
+Options:
+   run "bcftools plugin" for a list of common options
+
+Plugin options:
+   -e, --exclude EXPR        Exclude a genotype if true (requires -t q)
+   -i, --include EXPR        include a genotype if true (requires -t q)
+   -n, --new-gt TYPE         Genotypes to set, see above
+   -s, --seed INT            Random seed to use with -t r [0]
+   -t, --target-gt TYPE      Genotypes to change, see above
+
+Example:
+   # set missing genotypes ("./.") to phased ref genotypes ("0|0")
+   bcftools +setGT in.vcf -- -t . -n 0p
+
+   # set missing genotypes with DP>0 and GQ>20 to ref genotypes ("0/0")
+   bcftools +setGT in.vcf -- -t q -n 0 -i 'GT="." && FMT/DP>0 && GQ>20'
+
+   # set partially missing genotypes to completely missing
+   bcftools +setGT in.vcf -- -t ./x -n .
+
+   # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3
+   bcftools +setGT in.vcf -- -t "b:AD<1e-3" -n 0
+
+   # force unphased heterozygous genotype if binom.test(nAlt,nRef+nAlt,0.5)>0.1
+   bcftools +setGT in.vcf -- -t ./x -n c:'m/M'
+
+
+
+

Feedback

+
+

We welcome your feedback, please help us improve this page by +either opening an issue on github or editing it directly and sending +a pull request.

+
+
+
+
+
+
+ + + \ No newline at end of file diff --git a/howtos/plugin.setGT.txt b/howtos/plugin.setGT.txt new file mode 100644 index 000000000..45837110e --- /dev/null +++ b/howtos/plugin.setGT.txt @@ -0,0 +1,60 @@ +include::header.inc[] + + +Plugin setGT +------------ + +The plugin `+setGT` allows to edit genotypes + +The list of plugin-specific options can be obtained by running +`bcftools +setGT -h`, which will print the following usage page: +---- +About: Sets genotypes. The target genotypes can be specified as: + ./. .. completely missing ("." or "./.", depending on ploidy) + ./x .. partially missing (e.g., "./0" or ".|1" but not "./.") + . .. partially or completely missing + a .. all genotypes + b .. heterozygous genotypes failing two-tailed binomial test (example below) + q .. select genotypes using -i/-e options + r:FLOAT .. select randomly a proportion of FLOAT genotypes (can be combined with other modes) + and the new genotype can be one of: + . .. missing ("." or "./.", keeps ploidy) + 0 .. reference allele (e.g. 0/0 or 0, keeps ploidy) + c:GT .. custom genotype (e.g. 0/0, 0, 0/1, m/M, 0/X overrides ploidy) + m .. minor (the second most common) allele as determined from INFO/AC or FMT/GT (e.g. 1/1 or 1, keeps ploidy) + M .. major allele as determined from INFO/AC or FMT/GT (e.g. 1/1 or 1, keeps ploidy) + X .. allele with bigger read depth as determined from FMT/AD + p .. phase genotype (0/1 becomes 0|1) + u .. unphase genotype and sort by allele (1|0 becomes 0/1) +Usage: bcftools +setGT [General Options] -- [Plugin Options] +Options: + run "bcftools plugin" for a list of common options + +Plugin options: + -e, --exclude EXPR Exclude a genotype if true (requires -t q) + -i, --include EXPR include a genotype if true (requires -t q) + -n, --new-gt TYPE Genotypes to set, see above + -s, --seed INT Random seed to use with -t r [0] + -t, --target-gt TYPE Genotypes to change, see above + +Example: + # set missing genotypes ("./.") to phased ref genotypes ("0|0") + bcftools +setGT in.vcf -- -t . -n 0p + + # set missing genotypes with DP>0 and GQ>20 to ref genotypes ("0/0") + bcftools +setGT in.vcf -- -t q -n 0 -i 'GT="." && FMT/DP>0 && GQ>20' + + # set partially missing genotypes to completely missing + bcftools +setGT in.vcf -- -t ./x -n . + + # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3 + bcftools +setGT in.vcf -- -t "b:AD<1e-3" -n 0 + + # force unphased heterozygous genotype if binom.test(nAlt,nRef+nAlt,0.5)>0.1 + bcftools +setGT in.vcf -- -t ./x -n c:'m/M' +---- + + +include::footer.inc[] + + diff --git a/howtos/plugins.html b/howtos/plugins.html index f9d1d421e..fad338ac4 100644 --- a/howtos/plugins.html +++ b/howtos/plugins.html @@ -234,7 +234,7 @@

List of plugins

Prune sites by missingness, allele frequency or linkage disequilibrium. Alternatively, annotate sites with r2, Lewontin’s D' (PMID:19433632), Ragsdale’s D (PMID:31697386).

-
setGT
+
setGT

Sets genotypes according to the specified criteria and filtering expressions. For example, missing genotypes can be set to ref, but much more than that.

diff --git a/howtos/plugins.txt b/howtos/plugins.txt index 98d3032ce..76e84fbec 100644 --- a/howtos/plugins.txt +++ b/howtos/plugins.txt @@ -76,7 +76,7 @@ parental-origin:: determine parental origin of a CNV region prune:: Prune sites by missingness, allele frequency or linkage disequilibrium. Alternatively, annotate sites with r2, Lewontin's D' (PMID:19433632), Ragsdale's D (PMID:31697386). -setGT:: Sets genotypes according to the specified criteria and filtering expressions. For example, missing genotypes can be set to ref, but much more than that. +link:plugin.setGT.html[setGT]:: Sets genotypes according to the specified criteria and filtering expressions. For example, missing genotypes can be set to ref, but much more than that. smpl-stats:: calculates basic per-sample stats. The usage and format is similar to ``indel-stats`` and ``trio-stats``. diff --git a/howtos/query.html b/howtos/query.html index 79a9d64a0..e6b922413 100644 --- a/howtos/query.html +++ b/howtos/query.html @@ -111,7 +111,7 @@

Extracting information from VCFs

-

In this example, the -f otion defines the output format. The %POS string +

In this example, the -f option defines the output format. The %POS string indicates that for each VCF line we want the POS column printed. The \n stands for a newline character, a notation commonly used in the world of computer programming. Any characters without a special meaning diff --git a/howtos/query.txt b/howtos/query.txt index d4ba4fee4..a2bff8bb1 100644 --- a/howtos/query.txt +++ b/howtos/query.txt @@ -25,7 +25,7 @@ bcftools query -l file.bcf | wc -l ---- bcftools query -f '%POS\n' file.bcf ---- -In this example, the `-f` otion defines the output format. The `%POS` string +In this example, the `-f` option defines the output format. The `%POS` string indicates that for each VCF line we want the POS column printed. The `\n` stands for a newline character, a notation commonly used in the world of computer programming. Any characters without a special meaning diff --git a/howtos/roh-calling.html b/howtos/roh-calling.html index 5a7d71f1f..f52f98880 100644 --- a/howtos/roh-calling.html +++ b/howtos/roh-calling.html @@ -230,7 +230,7 @@

Troubleshooting

-

If the number of the processed sites is too low, check what was the reason for exluding +

If the number of the processed sites is too low, check what was the reason for excluding them. This command should give the number of sites that were processed:

diff --git a/howtos/roh-calling.txt b/howtos/roh-calling.txt index f3c46351e..344cb8878 100644 --- a/howtos/roh-calling.txt +++ b/howtos/roh-calling.txt @@ -148,7 +148,7 @@ program. For example in this run many sites were filtered: Number of lines: total/processed: 599218/37730 ---- -If the number of the processed sites is too low, check what was the reason for exluding +If the number of the processed sites is too low, check what was the reason for excluding them. This command should give the number of sites that were processed: ----