Refactor the parsers for better compatibility and memory usage #97

althonos · 2024-02-14T17:27:17Z

Hi again!

This PR refactors the code and interface of the alignment loading in the I/O handlers:

1. A more generic interface for format handlers

I added a BaseFormatHandler::LoadAlignment(std::istream&) method to all format handlers, which takes care of parsing the file content without opening/closing the file. The original BaseFormatHandler::LoadAlignment(std::string &filename) is still supported but simply takes care of opening and checking the file before passing it to the other method.

This also allowed to refactor some common code for handling files between all the different handlers straight into the BaseFormatHandler base method implementation.

This is something that's needed in Pytrimal so I can use the parser implementations with a Python file-like objects that may be provided by the user and not necessarily corresponds to a file opened somewhere on the local filesystem.

2. A rewrite of the `readLine` util and calling methods.

The way readLine was currently implemented has three problems:

It has no way to reuse memory used internally in the getline call, so a new buffer needs to be allocated every time.
Its use of string::erase one character at a time to remove leading whitespaces is very inefficient (which is probably not so bad in most of the cases because leading whitespaces are rare, but they may happen).
It allocates a new line on every call, which will be deallocated later instead of being recycled.

This is unfortunate because this method is called in hot loops of all parsers, and since parsers are likely to read more than one line per alignment there is room to reuse buffers. When I profiled the FASTA reader on example.015.AA.bctoNOG.ENOG41099F3.fasta, it showed that malloc was called 975,152 times:

To address this, I changed readLine so that:

It takes a string buffer as a second argument. It will clear the buffer (with string::clear), and then use it to call getline. The cleared strings typically does not need to reallocate before receiving its new content.
std::string::find_first_not_of and std::string::find_last_not_of allows trimming the line without actually copying data.
The return pointer points to the provided buffer rather than a new copy, so the caller doesn't need to free any memory. The pointer is invalidated on the next call or when buffer is modified.

These changes lead to way less deallocation/reallocations since the string buffer memory is recycled and one unneeded copy is eliminated. In addition most memory management is done by the buffer directly: this removes many instances of manual deallocation in the code of all parsers, because the deallocation just happens when the buffer goes out of scope.

Using the same file as before, malloc is now called only 16,665 times:

This is not really performance critical but I figured I'd give it a shot while updating the I/O code, it also simplifies the code in several format handlers. The actual performance effect depends heavily on the malloc on the host machine but usually the less time is spend outside of the allocator, the better 😄

…arguments

nicodr97

Wonderful!
I just marked some commented lines that I guess that can be removed.

source/FormatHandling/mega_interleaved_state.cpp

source/FormatHandling/pir_state.cpp

source/FormatHandling/mega_interleaved_state.cpp

althonos · 2024-02-27T23:20:02Z

I removed them now, thanks for the review 😄

althonos added 5 commits February 14, 2024 14:03

Update LoadAlignment of BaseFormatHandler to accept istream as …

c70f4d3

…arguments

Record alignment filename in BaseFormatHandler::LoadAlignment

1528f18

Update all format handler implementations with LoadAlignment(istream&)

fc6adc1

Remove duplicated readLine code for istream and ifstream

d980f86

Update utils::readLine to avoid allocating memory on every line

e6990ef

nicodr97 requested changes Feb 27, 2024

View reviewed changes

Remove outdated comment lines from parser code

f6892a9

althonos requested a review from nicodr97 February 27, 2024 23:21

nicodr97 approved these changes Feb 28, 2024

View reviewed changes

nicodr97 merged commit 3b3018c into inab:2.0_RC Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the parsers for better compatibility and memory usage #97

Refactor the parsers for better compatibility and memory usage #97

althonos commented Feb 14, 2024

nicodr97 left a comment

althonos commented Feb 27, 2024

Refactor the parsers for better compatibility and memory usage #97

Refactor the parsers for better compatibility and memory usage #97

Conversation

althonos commented Feb 14, 2024

1. A more generic interface for format handlers

2. A rewrite of the readLine util and calling methods.

nicodr97 left a comment

Choose a reason for hiding this comment

althonos commented Feb 27, 2024

2. A rewrite of the `readLine` util and calling methods.