Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One nucleotide shift for every SNP Positions #169

Open
DrMahsaRahnama76 opened this issue Dec 10, 2024 · 9 comments
Open

One nucleotide shift for every SNP Positions #169

DrMahsaRahnama76 opened this issue Dec 10, 2024 · 9 comments

Comments

@DrMahsaRahnama76
Copy link

Hello,

I ran parsnp version 2.1.1 and 2.0.6, but both of them found wrong locations for SNP positions. For example, it should be 12730 but these versions found 12731. I made the VCF file from .ggr file by this code: harvesttools -i parsnp.ggr -V parsnp.vcf.
I, also, ran parsnp version 1.5.4 and 1.5.3, but although they showed the correct posiotions, they did not make the .mblocks file and I need the SNP matrix for my project.

I appriciate your help.

Best,
Mahsa

@bkille bkille closed this as completed in 87bba27 Dec 10, 2024
@DrMahsaRahnama76
Copy link
Author

Hi,

The issue is although you mentioned "use 0-based start-end coords for .xmfa. Fixes #169", still the output of pasnp 2.1.1 is one nucleotide shift for every SNP Position. I tried to use parsnp 2.1.1 for 6 isolates from NCBI ( SRR863393, SRR863392, SRR1264965, SRR1514990, SRR1586566, and SRR1586589) with the reference sequence of NZ_CP016581.1. I got the VCF file from parsnp.ggr and the positions are like
12731
21098
34097
62195
157793
164691
218493 ...

instead of:
12730
21097
34096
62194
70249
157792
164690
218492 ...

In addition, I need the exact positions of bases in parsnp.snps.mblocks file which is used to build parsnp.tree file. The VCF file I got showes 189 positions, while parsnp.snps.mblocks includes 120 bases for each isolate. Can you please tell me how I can have access to the exact positions of bases that are used to build parsnp.tree?

Thank you so much.
Best,
Mahsa

@bkille
Copy link
Contributor

bkille commented Dec 12, 2024

Hi @DrMahsaRahnama76 ,

Yes, the fix will apply to the next release of Parsnp (2.1.2). However, this issue unfortunately may be with HarvestTools, which expects the xmfa file to be 0-indexed when it should be 1-indexed. I'll reopen this for now until it is fixed.

As far as the mblocks file goes, the reason for the 120 vs 189 difference is that the SNPs used to build the tree are a filtered subset (for example, no indels). There's currently no way to recover the coordinates from the mblocks file alone.

Best,
Bryce

@bkille bkille reopened this Dec 12, 2024
@fish2022Jul
Copy link

Could you update parsnp in bioconda from 2.1.1 to 2.1.2, please?

@Kadsae
Copy link

Kadsae commented Jan 14, 2025

Hi,

do you have a time window for when this issue with HarvestTools may be adressed and fixed? Unfortunately an upgrade to parsnp 2.1.2 did not fix the issue mentioned above.

Also, the results we gathered from parsnp2 not only differ from the results gathered with parsnp1, but most of the SNPs also get flagged as homoplastic SNPs instead of unique ones. Maybe this issue is connected?

Kind regards,
Kadsae

@bkille
Copy link
Contributor

bkille commented Jan 16, 2025

@fish2022Jul, Parsnp v2.1.2 is on conda now. Please let me know if you run into any issues.

@Kadsae, to clarify, you are seeing off by one errors in v2.1.2 as well? The coordinates in 2.1.2 should be shifted (corrected) relative to other Parsnp 2 releases:

==> vcf-coords-2.1.1/parsnp.vcf <==
#CHROM  POS     ID      REF     ALT
England1.split.fna.ref  926     CTCAAGTTTG.GTTACTTCTG   G       A
England1.split.fna.ref  1518    TGGTTTAATG.CGTTGCGTGA   C       T
England1.split.fna.ref  2395    GTTTGTGTGC.GTTCCTGACT   G       A
England1.split.fna.ref  3136    TCTGGTGCCC.ACGACACGTA   A       G

==> vcf-coords-2.1.2/parsnp.vcf <==
#CHROM  POS     ID      REF     ALT
England1.split.fna.ref  925     CTCAAGTTTG.GTTACTTCTG   G       A
England1.split.fna.ref  1517    TGGTTTAATG.CGTTGCGTGA   C       T
England1.split.fna.ref  2394    GTTTGTGTGC.GTTCCTGACT   G       A
England1.split.fna.ref  3135    TCTGGTGCCC.ACGACACGTA   A       G

Also, I've create a new issue (#171) regarding your second question.

@DrMahsaRahnama76
Copy link
Author

Hi,

I ran version 2.1.2, but the problem still exists. Can you please tell me which version of Harvesttools you used? You mentioned in your last comment that the positions are only different and ALT are the same, but in my results, I got the wrong ALT according to their wrong positions. For example, it showed 12731 with ALT G instead of 12730 with ALT A.

Thank you.
Best,
Mahsa

@Kadsae
Copy link

Kadsae commented Jan 20, 2025

Hi @bkille,

thank you for the fast responses and the eagerness to solve my issue(s).

For some more clarification: We have an inhouse database with around 800 bacterial genomes.
For testing I ran the parsnp command with standard parameters on this database with parsnp versions 1.6.2, 2.1.1 and 2.1.2.

Command:
parsnp -p $THREADS -v --vcf -C 1000 -e -u -d $GENOMES -r $REF -c -o $OUTDIR

For v1.6.2 I got 9719 SNPs reported, in comparison to published SNPs at the same positions.
For v2.1.1 I got 5998 SNPs reported, in comparison to published SNPs exactly off by one (+1).
For v2.1.2 I got 6300 SNPs reported, in comparison to published SNPs exactly off by one (+1).

During the v2.1.2 run, one partition failed - therefore 50 genomes got excluded from the analysis.

So yes, it seems the nucleotide shift is still there.

Best,
Kadsae

@bkille
Copy link
Contributor

bkille commented Jan 23, 2025

Hi @Kadsae,

This is very strange. Other than some error reporting improvements, the only change between v2.1.1 and v2.1.2 is the use of the
##FormatVersion Mauve header instead of ##FormatVersion Parsnp in the xmfa output file. This is what signals Harvest Tools to treat the xmfa as 0-indexed. I'm surprised to see you run into a partition failure with one version but not another.

With respect to the nucleotide shift issue, I created separate conda envs for v1.6.2, v2.1.1, and v2.1.2 and ran on example data. The coordinates of v2.1.2 match v1.6.2 and are indeed -1 relative to v2.1.1.

>$ tail example-out-v*/*.vcf -n 5 | cut -f1-5
==> example-out-v1.6.2/parsnp.vcf <==
gi|471258596|gb|KC164505.2|     29722   CTAAGGAGCA.TCGTGTGCAA   T       G
gi|471258596|gb|KC164505.2|     29742   GGTACTCAGC.GCACTCGCAC   G       A
gi|471258596|gb|KC164505.2|     29804   TGATTAGTGT.CACTCAAAGT   C       T
gi|471258596|gb|KC164505.2|     29844   TTGTGTTTGG.TAACCCCATC   T       C
gi|471258596|gb|KC164505.2|     29847   TGTTTGGTAA.CCCCATCTCA   C       T

==> example-out-v2.1.1/parsnp.vcf <==
England1.fna.ref        29723   CTAAGGAGCA.TCGTGTGCAA   T       G
England1.fna.ref        29743   GGTACTCAGC.GCACTCGCAC   G       A
England1.fna.ref        29805   TGATTAGTGT.CACTCAAAGT   C       T
England1.fna.ref        29845   TTGTGTTTGG.TAACCCCATC   T       C
England1.fna.ref        29848   TGTTTGGTAA.CCCCATCTCA   C       T

==> example-out-v2.1.2/parsnp.vcf <==
England1.fna.ref        29722   CTAAGGAGCA.TCGTGTGCAA   T       G
England1.fna.ref        29742   GGTACTCAGC.GCACTCGCAC   G       A
England1.fna.ref        29804   TGATTAGTGT.CACTCAAAGT   C       T
England1.fna.ref        29844   TTGTGTTTGG.TAACCCCATC   T       C
England1.fna.ref        29847   TGTTTGGTAA.CCCCATCTCA   C       T

Are you using the version from Bioconda, or a locally built version? My only guess is that if you are using a locally built Parsnp install, the parsnp_core binary hasn't been rebuilt in order to output XMFA files w/ the Mauve header, other than that its not immediately clear what could be causing this discrepancy...

Please feel free to email me if you'd prefer to discuss more offline.

@Kadsae
Copy link

Kadsae commented Jan 23, 2025

Hi @bkille,

I did indeed use a locally built version for my last tests. This time I created new conda envs and used the bioconda versions of parsnp v1.6.2 and parsnp 2.1.2 respectively.

Using the same dataset as last time (800 bacterial genomes), I got the same results as last time. Surprisingly, I did not run into a partition failure this time.

I also did another test and tried to experiment with the parameter partition-size. First, I created a subset of 50 of our 800 bacterial genomes, then I ran v1.6.2 and v2.1.2 - once with the flag -no-partition and once with the flag -min-partition-size set to 2 partitions.

To my surprise, the results from v1.6.2 and v2.1.2 without partitioning match! The run with partitioning again showed SNPs off by +1 to the other runs. So it seems that the error with the nucleotide shifts has something to do with the partitioning of parsnp2 for bigger datasets.

Another interesting aspect is that v1.6.2 reported 2703 SNPs, v2.1.2 w/o partitioning 2786 SNPs and v2.1.2 w/ partitioning 2211 SNPs. You mentioned in #171 your findings showed that parsnp2 should find more SNPs than parsnp1 and without the partitioning rule this statement seems to be correct.

Lastly, I also checked the status of the SNPs across all three runs and for v1.6.2 and v2.1.2 w/o partitioning close to all of the SNPs are unique, while the run of v2.1.2 w/ partitioning again shows the problem reported in #171 (many homoplastic SNPs).

Summarized, I think both problems, this one and #171, are connected and have something to do with the partitioning-flag.

I hope I could help a little. If you want more info or discuss more offline feel free to email me as well.

All the best,
Kadsae

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants