Skip to content

Commit

Permalink
VCF parsing failing test
Browse files Browse the repository at this point in the history
See comment in vcf.t for a full description of the bug. Fixing this
will require modifications to TreeTime.

The `read_vcf` function is used in augur commands ancestral, refine,
sequence-traits, translate and tree.
  • Loading branch information
jameshadfield committed Dec 12, 2023
1 parent 84343d8 commit 0d4b80e
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 0 deletions.
39 changes: 39 additions & 0 deletions tests/functional/ancestral/cram/vcf.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Setup

$ source "$TESTDIR"/_setup.sh

$ export DATA="$TESTDIR/../data/simple-genome"

This ~should~ be the same as the first test in general.t, however
with VCF input instead of a FASTA MSA.
The output will not have the full sequence attached to every node,
but it will have the reference sequence attached.

BUG: The SNPs at nt 33 are encoded in the VCF as:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample_A sample_B sample_C
1 33 . A C,G . . . GT 1 2 0
where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B.
The ALT 2 is not being parsed by Augur (Treetime), which results in
a changed mutation profile at pos 33:
. **FASTA input** **VCF input**
. |---G33C-- sample_A |---A33C-- sample_A
. --A33G-| -------|
. |--------- sample_B |--------- sample_B
.
Because of this bug, the following test fails.

$ ${AUGUR} ancestral \
> --tree $DATA/tree.nwk \
> --alignment $DATA/snps.vcf \
> --vcf-reference $DATA/reference.fasta \
> --output-node-data "nt_muts.vcf-input.ref-seq.json" \
> --output-vcf "nt_muts.vcf-input.ref-seq.vcf" \
> --inference marginal > /dev/null


$ python3 "$TESTDIR/../../../../scripts/diff_jsons.py" \
> "$DATA/nt_muts.ref-seq.json" \
> "nt_muts.vcf-input.ref-seq.json" \
> --exclude-regex-paths "root\['nodes'\]\['.+'\]\['sequence'\]"
{}

15 changes: 15 additions & 0 deletions tests/functional/ancestral/data/simple-genome/snps.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
##fileformat=VCFv4.3
##contig=<ID=1,length=50>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample_A sample_B sample_C
1 5 . T C . . . GT 1 1 1
1 7 . T G . . . GT 1 1 0
1 14 . C T . . . GT 1 1 0
1 18 . C T . . . GT 0 0 1
1 28 . A N . . . GT 0 0 1
1 29 . A N . . . GT 0 0 1
1 30 . A N . . . GT 0 0 1
1 33 . A C,G . . . GT 1 2 0
1 39 . C T . . . GT 1 0 0
1 42 . G A . . . GT 0 1 0
1 43 . A T . . . GT 1 1 0

0 comments on commit 0d4b80e

Please sign in to comment.