We should strip starting and trailing N
s in sequences
#3666
Labels
preprocessing
Issues related to the preprocessing component
N
s in sequences
#3666
https://pathoplexus.org/seq/PP_0011CQ4.2, the 2025 Ebola sequence, starts and ends with a short set of Ns, even in the unaligned sequence, because that's how it was submitted. In contrast, other Ebola sequences I examined (e.g. https://pathoplexus.org/seq/PP_000034P.4) have no starting or trailing Ns in the unaligned sequence (perhaps due to processing at INSDC? or perhaps not).
This has two consequences:
N
s are trimmed that doesn't reflect any biological or technical differenceN
s there could be a very misleading "genome length" in this case.IMO we should trim
N
s at the start and end of sequences early in the preprocessing pipeline.The text was updated successfully, but these errors were encountered: