Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not correct indels processing when generating insilico genome #11

Open
michael-kotliar opened this issue Apr 10, 2018 · 1 comment
Open

Comments

@michael-kotliar
Copy link

michael-kotliar commented Apr 10, 2018

According to VCFv4.2 specification

For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event...

Although, it's noticed that this padding base is not required, all of the INDEL files from Sanger, as long as the files mentioned as testing data for MEA, follow this padding rule. We can see it from the identical first base pair for REF and ALT columns:
(example is taken from 129S1_SvImJ.mgp.v5.indels.dbSNP142.normed.vcf.gz)

chr1    3000748 .       T       TTTTG   182.00  PASS    AC1=1;AC=2;AF1=1;AN=2;DP4=0,0,16,16;DP=40;INDEL;MQ=38;VDB=0.0281        GT:GQ:DP:SP:PL:FI       .       .       .       .
chr1    3000763 .       G       GT      98.50   PASS    AC1=1;AC=2;AF1=1;AN=2;DP4=0,0,9,6;DP=22;INDEL;MQ=54;VDB=0.0404  GT:GQ:DP:SP:PL:FI       .       .       .       .       .
chr1    3000899 .       TGCG    T       217.00  PASS    AC1=1;AC=2;AF1=1;AN=2;DP4=1,0,13,17;DP=36;INDEL;MDV=99;MQ=49;MSD=31;PV0=0.45;PV1=1;PV2=0.19;PV3=1;PV4=0.45,1,0.19,1;QD=0.008
chr1    3000900 .       G       GCC     209.50  PASS    AC1=1;AC=4;AF1=1;AN=4;DP4=0,0,38,37;DP=90;INDEL;MQ=50;VDB=0.0399        GT:GQ:DP:SP:PL:FI       .       .       1/1:99:42:0:

It means that the actual insertion/deletion for the above mentioned INDEL file are TTTG, T, GCG, CC

Running alea.jar for SNP files returns correct results, as long as it's just enough to substitute REF to ALT, but INDEL processing adds extra base pairs to output fasta file, which is not correct.

Testing example:

java -jar /opt/mea/bin/alea.jar insilico --input-fasta=reference_genome.fa --input-vcf=indels.vcf --strain=129S1_SvImJ --output-fasta=output.fa

Reference fasta file reference_genome.fa:

>1
caatgataccttggtcaaggaaggaataaagaatgaaattaaagactttt
tagagtttaatgaaaatgaagccacaacatacccaaacctatgggacaca
atgaaagcatttctaagagggaaactcatagctctgagtgcctccaagaa
gaaatgggagaaagcacatactagcagcttaacaacacatctaaaagctc
tagaaaaaaaggatgcaaattcacccaagaggagtagacggcaggaaata
atcaaactcaggggtgaaatcaaccaagtggaaacaagaagaactattca
aagaattaaccaaactcggagttggttctttgagaaaatcaacaagatag
ataaacccttagctagactcactagagggtctagtgacacagggacaaca

Indels file indels.vcf (without header)

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	129S1_SvImJ
1	60	.	A	AG	92	PASS	INDEL;DP4=0,0,5,0;DP=5;CSQ=A||||intergenic_variant||||||||	GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI	1/1:19:5:0.2:134,19,0:120,15,0:2:32:5:0,0,5,0:0:-0.590765:.:1
1	110	.	T	TAAAAA	228	PASS	INDEL;DP4=19,7,12,26;DP=64;CSQ=TTTTT||||intergenic_variant||||||||	GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI	1/1:19:64:0:278,19,0:255,11,0:2:57:38:19,7,12,26:27:-0.693143:.:1
1	210	.	AGGAT	A	228	PASS	INDEL;DP4=1,0,25,21;DP=47;CSQ=-||||intergenic_variant||||||||	GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI	1/1:127:47:0:294,140,0:255,124,0:2:60:46:1,0,25,21:0:-0.693147:.:1

Expected output:

>1
caatgataccttggtcaaggaaggaataaagaatgaaattaaagactttt
tagagtttaa    +++ G ++++++    tgaaaatgaagccacaacatacccaaacctatgggacaca
atgaaagcat    +++ AAAAA ++    ttctaagagggaaactcatagctctgagtgcctccaagaa
gaaatgggagaaagcacatactagcagcttaacaacacatctaaaagctc
tagaaaaaaa    --- g̶g̶a̶t̶ --     gcaaattcacccaagaggagtagacggcaggaaata
atcaaactcaggggtgaaatcaaccaagtggaaacaagaagaactattca
aagaattaaccaaactcggagttggttctttgagaaaatcaacaagatag
ataaacccttagctagactcactagagggtctagtgacacagggacaaca

Received output:

>1
caatgataccttggtcaaggaaggaataaagaatgaaattaaagactttttagagtttaaAGtgaaaatgaagccacaac
atacccaaacctatgggacacaatgaaagcatTAAAAAttctaagagggaaactcatagctctgagtgcctccaagaaga
aatgggagaaagcacatactagcagcttaacaacacatctaaaagctctagaaaaaaaAgcaaattcacccaagaggagt
agacggcaggaaataatcaaactcaggggtgaaatcaaccaagtggaaacaagaagaactattcaaagaattaaccaaac
tcggagttggttctttgagaaaatcaacaagatagataaacccttagctagactcactagagggtctagtgacacaggga

As we can see, INDEL processing was made exactly the same way as if it was SNP file (REF has been completely substituted by ALT)

  • added AG instead of G
  • added TAAAAA instead of AAAAA
  • added A instead of NULL
@michael-kotliar
Copy link
Author

Input files for testing example
indels.vcf.gz
reference_genome.fa.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant