We should strip starting and trailing `N`s in sequences #3666

theosanderson · 2025-02-11T14:21:13Z

https://pathoplexus.org/seq/PP_0011CQ4.2, the 2025 Ebola sequence, starts and ends with a short set of Ns, even in the unaligned sequence, because that's how it was submitted. In contrast, other Ebola sequences I examined (e.g. https://pathoplexus.org/seq/PP_000034P.4) have no starting or trailing Ns in the unaligned sequence (perhaps due to processing at INSDC? or perhaps not).

This has two consequences:

it creates an inconsistency between this sequence (and any like it) and other sequences where these Ns are trimmed that doesn't reflect any biological or technical difference
it means that the "length" of this sequence doesn't reflect the real usable length (though in this case the two are pretty similar!) In an extreme case though, where only a short contiguous sequence was sequenced, and the rest of the genome filled in with Ns there could be a very misleading "genome length" in this case.

IMO we should trim Ns at the start and end of sequences early in the preprocessing pipeline.

The text was updated successfully, but these errors were encountered:

dpark01 · 2025-02-12T20:06:45Z

My own opinion is that it would make sense to auto-trim continuous stretches of leading and trailing N's, as they are informationless, and this would be more consistent with how all other genome repositories handle things.

NCBI Genbank actually will reject submissions where the first or last "line" of the fasta exceeds 40% N content (not sure how long a "line" is... especially the last line of a fasta, but that's just what table2asn does). I don't recommend doing that per se, but there's no reason not to chop off pure-Ns at the edges.

theosanderson added the discussion Open questions label Feb 11, 2025

theosanderson added preprocessing Issues related to the preprocessing component and removed discussion Open questions labels Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should strip starting and trailing `N`s in sequences #3666

We should strip starting and trailing `N`s in sequences #3666

theosanderson commented Feb 11, 2025

dpark01 commented Feb 12, 2025

We should strip starting and trailing Ns in sequences #3666

We should strip starting and trailing Ns in sequences #3666

Comments

theosanderson commented Feb 11, 2025

dpark01 commented Feb 12, 2025

We should strip starting and trailing `N`s in sequences #3666

We should strip starting and trailing `N`s in sequences #3666