Speed up fai_retrieve() by reading whole lines at once #1799
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While looking around for functions that could benefit from SIMD techniques a few weeks ago, I looked at
fai_retrieve()
's search for (uncommon) non-graphic characters. However I soon realised that it already knows exactly where these characters are, so doesn't need to search or check for them at all!Because
fai_retrieve()
is given only well-formatted input containing lines of the same length, it already knows exactly where the base and non-graphic characters are. So in general the interval to be read will look likeand can be read a line at a time instead of a character at a time, with special handling for the partial first and last lines, and discarding the terminator characters at the end of each line read.
Timings in seconds for queries for regions of various lengths, with the iteration counts chosen so that develop took about a minute on each “repeatedly querying the same locus” test:
Opened as a draft PR because this will cause test failures until #1798 is fixed.