Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the parsers for better compatibility and memory usage #97

Merged
merged 6 commits into from
Feb 28, 2024
Merged

Refactor the parsers for better compatibility and memory usage #97

merged 6 commits into from
Feb 28, 2024

Conversation

althonos
Copy link

Hi again!

This PR refactors the code and interface of the alignment loading in the I/O handlers:

1. A more generic interface for format handlers

I added a BaseFormatHandler::LoadAlignment(std::istream&) method to all format handlers, which takes care of parsing the file content without opening/closing the file. The original BaseFormatHandler::LoadAlignment(std::string &filename) is still supported but simply takes care of opening and checking the file before passing it to the other method.

This also allowed to refactor some common code for handling files between all the different handlers straight into the BaseFormatHandler base method implementation.

This is something that's needed in Pytrimal so I can use the parser implementations with a Python file-like objects that may be provided by the user and not necessarily corresponds to a file opened somewhere on the local filesystem.

2. A rewrite of the readLine util and calling methods.

The way readLine was currently implemented has three problems:

  1. It has no way to reuse memory used internally in the getline call, so a new buffer needs to be allocated every time.
  2. Its use of string::erase one character at a time to remove leading whitespaces is very inefficient (which is probably not so bad in most of the cases because leading whitespaces are rare, but they may happen).
  3. It allocates a new line on every call, which will be deallocated later instead of being recycled.

This is unfortunate because this method is called in hot loops of all parsers, and since parsers are likely to read more than one line per alignment there is room to reuse buffers. When I profiled the FASTA reader on example.015.AA.bctoNOG.ENOG41099F3.fasta, it showed that malloc was called 975,152 times:
image

To address this, I changed readLine so that:

  • It takes a string buffer as a second argument. It will clear the buffer (with string::clear), and then use it to call getline. The cleared strings typically does not need to reallocate before receiving its new content.
  • std::string::find_first_not_of and std::string::find_last_not_of allows trimming the line without actually copying data.
  • The return pointer points to the provided buffer rather than a new copy, so the caller doesn't need to free any memory. The pointer is invalidated on the next call or when buffer is modified.

These changes lead to way less deallocation/reallocations since the string buffer memory is recycled and one unneeded copy is eliminated. In addition most memory management is done by the buffer directly: this removes many instances of manual deallocation in the code of all parsers, because the deallocation just happens when the buffer goes out of scope.

Using the same file as before, malloc is now called only 16,665 times:
image

This is not really performance critical but I figured I'd give it a shot while updating the I/O code, it also simplifies the code in several format handlers. The actual performance effect depends heavily on the malloc on the host machine but usually the less time is spend outside of the allocator, the better 😄

Copy link
Collaborator

@nicodr97 nicodr97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful!
I just marked some commented lines that I guess that can be removed.

@althonos
Copy link
Author

I removed them now, thanks for the review 😄

@althonos althonos requested a review from nicodr97 February 27, 2024 23:21
@nicodr97 nicodr97 merged commit 3b3018c into inab:2.0_RC Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants