Refactor the parsers for better compatibility and memory usage #97
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi again!
This PR refactors the code and interface of the alignment loading in the I/O handlers:
1. A more generic interface for format handlers
I added a
BaseFormatHandler::LoadAlignment(std::istream&)
method to all format handlers, which takes care of parsing the file content without opening/closing the file. The originalBaseFormatHandler::LoadAlignment(std::string &filename)
is still supported but simply takes care of opening and checking the file before passing it to the other method.This also allowed to refactor some common code for handling files between all the different handlers straight into the
BaseFormatHandler
base method implementation.This is something that's needed in Pytrimal so I can use the parser implementations with a Python file-like objects that may be provided by the user and not necessarily corresponds to a file opened somewhere on the local filesystem.
2. A rewrite of the
readLine
util and calling methods.The way
readLine
was currently implemented has three problems:getline
call, so a new buffer needs to be allocated every time.string::erase
one character at a time to remove leading whitespaces is very inefficient (which is probably not so bad in most of the cases because leading whitespaces are rare, but they may happen).This is unfortunate because this method is called in hot loops of all parsers, and since parsers are likely to read more than one line per alignment there is room to reuse buffers. When I profiled the FASTA reader on

example.015.AA.bctoNOG.ENOG41099F3.fasta
, it showed thatmalloc
was called 975,152 times:To address this, I changed
readLine
so that:string
buffer as a second argument. It will clear the buffer (withstring::clear
), and then use it to callgetline
. The cleared strings typically does not need to reallocate before receiving its new content.std::string::find_first_not_of
andstd::string::find_last_not_of
allows trimming the line without actually copying data.buffer
is modified.These changes lead to way less deallocation/reallocations since the string buffer memory is recycled and one unneeded copy is eliminated. In addition most memory management is done by the buffer directly: this removes many instances of manual deallocation in the code of all parsers, because the deallocation just happens when the buffer goes out of scope.
Using the same file as before,

malloc
is now called only 16,665 times:This is not really performance critical but I figured I'd give it a shot while updating the I/O code, it also simplifies the code in several format handlers. The actual performance effect depends heavily on the
malloc
on the host machine but usually the less time is spend outside of the allocator, the better 😄