One nucleotide shift for every SNP Positions #169

DrMahsaRahnama76 · 2024-12-10T22:08:34Z

Hello,

I ran parsnp version 2.1.1 and 2.0.6, but both of them found wrong locations for SNP positions. For example, it should be 12730 but these versions found 12731. I made the VCF file from .ggr file by this code: harvesttools -i parsnp.ggr -V parsnp.vcf.
I, also, ran parsnp version 1.5.4 and 1.5.3, but although they showed the correct posiotions, they did not make the .mblocks file and I need the SNP matrix for my project.

I appriciate your help.

Best,
Mahsa

DrMahsaRahnama76 · 2024-12-12T19:49:52Z

Hi,

The issue is although you mentioned "use 0-based start-end coords for .xmfa. Fixes #169", still the output of pasnp 2.1.1 is one nucleotide shift for every SNP Position. I tried to use parsnp 2.1.1 for 6 isolates from NCBI ( SRR863393, SRR863392, SRR1264965, SRR1514990, SRR1586566, and SRR1586589) with the reference sequence of NZ_CP016581.1. I got the VCF file from parsnp.ggr and the positions are like
12731
21098
34097
62195
157793
164691
218493 ...

instead of:
12730
21097
34096
62194
70249
157792
164690
218492 ...

In addition, I need the exact positions of bases in parsnp.snps.mblocks file which is used to build parsnp.tree file. The VCF file I got showes 189 positions, while parsnp.snps.mblocks includes 120 bases for each isolate. Can you please tell me how I can have access to the exact positions of bases that are used to build parsnp.tree?

Thank you so much.
Best,
Mahsa

bkille · 2024-12-12T19:59:13Z

Hi @DrMahsaRahnama76 ,

Yes, the fix will apply to the next release of Parsnp (2.1.2). However, this issue unfortunately may be with HarvestTools, which expects the xmfa file to be 0-indexed when it should be 1-indexed. I'll reopen this for now until it is fixed.

As far as the mblocks file goes, the reason for the 120 vs 189 difference is that the SNPs used to build the tree are a filtered subset (for example, no indels). There's currently no way to recover the coordinates from the mblocks file alone.

Best,
Bryce

fish2022Jul · 2025-01-08T17:13:34Z

Could you update parsnp in bioconda from 2.1.1 to 2.1.2, please?

Kadsae · 2025-01-14T14:44:36Z

Hi,

do you have a time window for when this issue with HarvestTools may be adressed and fixed? Unfortunately an upgrade to parsnp 2.1.2 did not fix the issue mentioned above.

Also, the results we gathered from parsnp2 not only differ from the results gathered with parsnp1, but most of the SNPs also get flagged as homoplastic SNPs instead of unique ones. Maybe this issue is connected?

Kind regards,
Kadsae

bkille · 2025-01-16T21:11:00Z

@fish2022Jul, Parsnp v2.1.2 is on conda now. Please let me know if you run into any issues.

@Kadsae, to clarify, you are seeing off by one errors in v2.1.2 as well? The coordinates in 2.1.2 should be shifted (corrected) relative to other Parsnp 2 releases:

==> vcf-coords-2.1.1/parsnp.vcf <==
#CHROM  POS     ID      REF     ALT
England1.split.fna.ref  926     CTCAAGTTTG.GTTACTTCTG   G       A
England1.split.fna.ref  1518    TGGTTTAATG.CGTTGCGTGA   C       T
England1.split.fna.ref  2395    GTTTGTGTGC.GTTCCTGACT   G       A
England1.split.fna.ref  3136    TCTGGTGCCC.ACGACACGTA   A       G

==> vcf-coords-2.1.2/parsnp.vcf <==
#CHROM  POS     ID      REF     ALT
England1.split.fna.ref  925     CTCAAGTTTG.GTTACTTCTG   G       A
England1.split.fna.ref  1517    TGGTTTAATG.CGTTGCGTGA   C       T
England1.split.fna.ref  2394    GTTTGTGTGC.GTTCCTGACT   G       A
England1.split.fna.ref  3135    TCTGGTGCCC.ACGACACGTA   A       G

Also, I've create a new issue (#171) regarding your second question.

DrMahsaRahnama76 · 2025-01-17T16:38:50Z

Hi,

I ran version 2.1.2, but the problem still exists. Can you please tell me which version of Harvesttools you used? You mentioned in your last comment that the positions are only different and ALT are the same, but in my results, I got the wrong ALT according to their wrong positions. For example, it showed 12731 with ALT G instead of 12730 with ALT A.

Thank you.
Best,
Mahsa

Kadsae · 2025-01-20T12:16:40Z

Hi @bkille,

thank you for the fast responses and the eagerness to solve my issue(s).

For some more clarification: We have an inhouse database with around 800 bacterial genomes.
For testing I ran the parsnp command with standard parameters on this database with parsnp versions 1.6.2, 2.1.1 and 2.1.2.

Command:
parsnp -p $THREADS -v --vcf -C 1000 -e -u -d $GENOMES -r $REF -c -o $OUTDIR

For v1.6.2 I got 9719 SNPs reported, in comparison to published SNPs at the same positions.
For v2.1.1 I got 5998 SNPs reported, in comparison to published SNPs exactly off by one (+1).
For v2.1.2 I got 6300 SNPs reported, in comparison to published SNPs exactly off by one (+1).

During the v2.1.2 run, one partition failed - therefore 50 genomes got excluded from the analysis.

So yes, it seems the nucleotide shift is still there.

Best,
Kadsae

bkille · 2025-01-23T05:55:19Z

Hi @Kadsae,

This is very strange. Other than some error reporting improvements, the only change between v2.1.1 and v2.1.2 is the use of the
##FormatVersion Mauve header instead of ##FormatVersion Parsnp in the xmfa output file. This is what signals Harvest Tools to treat the xmfa as 0-indexed. I'm surprised to see you run into a partition failure with one version but not another.

With respect to the nucleotide shift issue, I created separate conda envs for v1.6.2, v2.1.1, and v2.1.2 and ran on example data. The coordinates of v2.1.2 match v1.6.2 and are indeed -1 relative to v2.1.1.

>$ tail example-out-v*/*.vcf -n 5 | cut -f1-5
==> example-out-v1.6.2/parsnp.vcf <==
gi|471258596|gb|KC164505.2|     29722   CTAAGGAGCA.TCGTGTGCAA   T       G
gi|471258596|gb|KC164505.2|     29742   GGTACTCAGC.GCACTCGCAC   G       A
gi|471258596|gb|KC164505.2|     29804   TGATTAGTGT.CACTCAAAGT   C       T
gi|471258596|gb|KC164505.2|     29844   TTGTGTTTGG.TAACCCCATC   T       C
gi|471258596|gb|KC164505.2|     29847   TGTTTGGTAA.CCCCATCTCA   C       T

==> example-out-v2.1.1/parsnp.vcf <==
England1.fna.ref        29723   CTAAGGAGCA.TCGTGTGCAA   T       G
England1.fna.ref        29743   GGTACTCAGC.GCACTCGCAC   G       A
England1.fna.ref        29805   TGATTAGTGT.CACTCAAAGT   C       T
England1.fna.ref        29845   TTGTGTTTGG.TAACCCCATC   T       C
England1.fna.ref        29848   TGTTTGGTAA.CCCCATCTCA   C       T

==> example-out-v2.1.2/parsnp.vcf <==
England1.fna.ref        29722   CTAAGGAGCA.TCGTGTGCAA   T       G
England1.fna.ref        29742   GGTACTCAGC.GCACTCGCAC   G       A
England1.fna.ref        29804   TGATTAGTGT.CACTCAAAGT   C       T
England1.fna.ref        29844   TTGTGTTTGG.TAACCCCATC   T       C
England1.fna.ref        29847   TGTTTGGTAA.CCCCATCTCA   C       T

Are you using the version from Bioconda, or a locally built version? My only guess is that if you are using a locally built Parsnp install, the parsnp_core binary hasn't been rebuilt in order to output XMFA files w/ the Mauve header, other than that its not immediately clear what could be causing this discrepancy...

Please feel free to email me if you'd prefer to discuss more offline.

Kadsae · 2025-01-23T09:52:00Z

Hi @bkille,

I did indeed use a locally built version for my last tests. This time I created new conda envs and used the bioconda versions of parsnp v1.6.2 and parsnp 2.1.2 respectively.

Using the same dataset as last time (800 bacterial genomes), I got the same results as last time. Surprisingly, I did not run into a partition failure this time.

I also did another test and tried to experiment with the parameter partition-size. First, I created a subset of 50 of our 800 bacterial genomes, then I ran v1.6.2 and v2.1.2 - once with the flag -no-partition and once with the flag -min-partition-size set to 2 partitions.

To my surprise, the results from v1.6.2 and v2.1.2 without partitioning match! The run with partitioning again showed SNPs off by +1 to the other runs. So it seems that the error with the nucleotide shifts has something to do with the partitioning of parsnp2 for bigger datasets.

Another interesting aspect is that v1.6.2 reported 2703 SNPs, v2.1.2 w/o partitioning 2786 SNPs and v2.1.2 w/ partitioning 2211 SNPs. You mentioned in #171 your findings showed that parsnp2 should find more SNPs than parsnp1 and without the partitioning rule this statement seems to be correct.

Lastly, I also checked the status of the SNPs across all three runs and for v1.6.2 and v2.1.2 w/o partitioning close to all of the SNPs are unique, while the run of v2.1.2 w/ partitioning again shows the problem reported in #171 (many homoplastic SNPs).

Summarized, I think both problems, this one and #171, are connected and have something to do with the partitioning-flag.

I hope I could help a little. If you want more info or discuss more offline feel free to email me as well.

All the best,
Kadsae

bkille closed this as completed in 87bba27 Dec 10, 2024

bkille reopened this Dec 12, 2024

bkille mentioned this issue Jan 16, 2025

homoplastic vs unique SNPs #171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One nucleotide shift for every SNP Positions #169

One nucleotide shift for every SNP Positions #169

DrMahsaRahnama76 commented Dec 10, 2024

DrMahsaRahnama76 commented Dec 12, 2024

bkille commented Dec 12, 2024

fish2022Jul commented Jan 8, 2025

Kadsae commented Jan 14, 2025

bkille commented Jan 16, 2025

DrMahsaRahnama76 commented Jan 17, 2025

Kadsae commented Jan 20, 2025 •

edited

Loading

bkille commented Jan 23, 2025

Kadsae commented Jan 23, 2025

One nucleotide shift for every SNP Positions #169

One nucleotide shift for every SNP Positions #169

Comments

DrMahsaRahnama76 commented Dec 10, 2024

DrMahsaRahnama76 commented Dec 12, 2024

bkille commented Dec 12, 2024

fish2022Jul commented Jan 8, 2025

Kadsae commented Jan 14, 2025

bkille commented Jan 16, 2025

DrMahsaRahnama76 commented Jan 17, 2025

Kadsae commented Jan 20, 2025 • edited Loading

bkille commented Jan 23, 2025

Kadsae commented Jan 23, 2025

Kadsae commented Jan 20, 2025 •

edited

Loading