Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should strip starting and trailing Ns in sequences #3666

Open
theosanderson opened this issue Feb 11, 2025 · 1 comment
Open

We should strip starting and trailing Ns in sequences #3666

theosanderson opened this issue Feb 11, 2025 · 1 comment
Labels
preprocessing Issues related to the preprocessing component

Comments

@theosanderson
Copy link
Member

https://pathoplexus.org/seq/PP_0011CQ4.2, the 2025 Ebola sequence, starts and ends with a short set of Ns, even in the unaligned sequence, because that's how it was submitted. In contrast, other Ebola sequences I examined (e.g. https://pathoplexus.org/seq/PP_000034P.4) have no starting or trailing Ns in the unaligned sequence (perhaps due to processing at INSDC? or perhaps not).

This has two consequences:

  • it creates an inconsistency between this sequence (and any like it) and other sequences where these Ns are trimmed that doesn't reflect any biological or technical difference
  • it means that the "length" of this sequence doesn't reflect the real usable length (though in this case the two are pretty similar!) In an extreme case though, where only a short contiguous sequence was sequenced, and the rest of the genome filled in with Ns there could be a very misleading "genome length" in this case.

IMO we should trim Ns at the start and end of sequences early in the preprocessing pipeline.

@theosanderson theosanderson added the discussion Open questions label Feb 11, 2025
@dpark01
Copy link

dpark01 commented Feb 12, 2025

My own opinion is that it would make sense to auto-trim continuous stretches of leading and trailing N's, as they are informationless, and this would be more consistent with how all other genome repositories handle things.

NCBI Genbank actually will reject submissions where the first or last "line" of the fasta exceeds 40% N content (not sure how long a "line" is... especially the last line of a fasta, but that's just what table2asn does). I don't recommend doing that per se, but there's no reason not to chop off pure-Ns at the edges.

@theosanderson theosanderson added preprocessing Issues related to the preprocessing component and removed discussion Open questions labels Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessing Issues related to the preprocessing component
Projects
None yet
Development

No branches or pull requests

2 participants