Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large increase in memory usage in latest version #120

Open
rob-p opened this issue Feb 5, 2025 · 10 comments
Open

Large increase in memory usage in latest version #120

rob-p opened this issue Feb 5, 2025 · 10 comments

Comments

@rob-p
Copy link

rob-p commented Feb 5, 2025

Hi @jguhlin,

Thanks for all of the recent changes to minimap2-rs. The new builder is much harder to misuse, and with the new index handling, I'm able to get rid of the hacks I had before when using the index from many threads.

However, one issue I've noticed is that while producing identical output to the previous version (when modified to account for the read name for tie breaking), the memory usage is considerably higher, specifically as more threads are used. For example, in oarfish, when mapping about 5M nanopore reads to the human transcriptome, the previous version used just north of 5G of RAM with 32 threads (accounting for all other data structures in the program as well), while with the new minimap2-rs version it uses over 10G of RAM. Given the temporary allocation of multiple ARCs wrapping the read name, I'd expect a small increase but nowhere near this much. This seems much more related to either (1) a much larger person thread buffer (2) a subtle memory leak, as the usage seems to increase slowly but steadily over time or (3) a combination of both of these.

I've tracked the issue down to minimap2-rs, as when I remove the actual calls to map, the memory usage is identical between versions, so the culprit must be something happening in, or triggered by, the map function. I'm happy to help investigate, but wanted to see if you had any immediate ideas.

Note: The big difference seems to occur in the newest version, but is not present in 1.21. It is also present in 1.22. So, whatever is causing the difference happened between 1.21 and 1.22.

Thanks!
Rob

Cc @zzare-umd

@rob-p
Copy link
Author

rob-p commented Feb 5, 2025

Just a small update here, in addition to localizing this increase to the map function, it seems to happen only when I build the aligner with the with_cigar function (which I need). So I presume this may have something to do with how the CIGAR strings are generated and perhaps memory not being freed or being over-allocated?

@rob-p
Copy link
Author

rob-p commented Feb 7, 2025

Hi @jguhlin,

Just another update. I seem to have found the relevant change that causes the large increase in memory usage, but I don't yet understand why. In v1.21 and earlier, you have this definition for the local buffer ThreadLocalBuffer:

/// ThreadLocalBuffer for minimap2 memory management
#[derive(Debug)]
struct ThreadLocalBuffer {
    buf: *mut mm_tbuf_t,
    max_uses: usize,
    uses: usize,
}

impl ThreadLocalBuffer {
    pub fn new() -> Self {
        let buf = unsafe { mm_tbuf_init() };
        Self {
            buf,
            max_uses: 15,
            uses: 0,
        }
    }
    /// Return the buffer, checking how many times it has been borrowed.
    /// Free the memory of the old buffer and reinitialise a new one If
    /// num_uses exceeds max_uses.
    pub fn get_buf(&mut self) -> *mut mm_tbuf_t {
        if self.uses > self.max_uses {
            // println!("renewing threadbuffer");
            self.free_buffer();
            let buf = unsafe { mm_tbuf_init() };
            self.buf = buf;
            self.uses = 0;
        }
        self.uses += 1;
        self.buf
    }

    fn free_buffer(&mut self) {
        unsafe { mm_tbuf_destroy(self.buf) };
    }
}

/// Handle destruction of thread local buffer properly.
impl Drop for ThreadLocalBuffer {
    fn drop(&mut self) {
        unsafe { mm_tbuf_destroy(self.buf) };
        //self.free_buffer();
    }
}

impl Default for ThreadLocalBuffer {
    fn default() -> Self {
        Self::new()
    }
}

but, in the newer version, you stopped this (hack?) of refreshing the buffer every 15 uses, and the defintion looks like:

/// ThreadLocalBuffer for minimap2 memory management
#[derive(Debug)]
struct ThreadLocalBuffer {
    buf: *mut mm_tbuf_t,
    // max_uses: usize,
    // uses: usize,
}

impl ThreadLocalBuffer {
    pub fn new() -> Self {
        let buf = unsafe { mm_tbuf_init() };
        Self {
            buf,
            // max_uses: 15,
            // uses: 0,
        }
    }
    /// Return the buffer, checking how many times it has been borrowed.
    /// Free the memory of the old buffer and reinitialise a new one If
    /// num_uses exceeds max_uses.
    pub fn get_buf(&mut self) -> *mut mm_tbuf_t {
        /* if self.uses > self.max_uses {
            // println!("renewing threadbuffer");
            self.free_buffer();
            let buf = unsafe { mm_tbuf_init() };
            self.buf = buf;
            self.uses = 0;
        }
        self.uses += 1; */
        self.buf
    }

    fn free_buffer(&mut self) {
        unsafe { mm_tbuf_destroy(self.buf) };
    }
}

/// Handle destruction of thread local buffer properly.
impl Drop for ThreadLocalBuffer {
    fn drop(&mut self) {
        // unsafe { mm_tbuf_destroy(self.buf) };
        self.free_buffer();
    }
}

impl Default for ThreadLocalBuffer {
    fn default() -> Self {
        Self::new()
    }
}

This new version leads to substantially higher memory usage than the previous version. In fact, if I just revert this implementation of ThreadLocalBuffer in 1.23, the memory usage goes back to be very similar to 1.21, so this seems to be almost the entire source of the discrepancy. Any idea what might be going on here? What was the point of the hack in the first place? Any idea why keeping one buffer around would lead to substantially increased memory use?

@jguhlin
Copy link
Owner

jguhlin commented Feb 7, 2025

Hey! First, sorry for the delay. My move has gone terribly and will likely result in a court case. So still recovering from that and just put my desk together an hour ago. Second, thanks again for digging into it.

I have no clue where that hack came from, so I removed it recently. I can't imagine I thought of it, as it came in during my very early FFI days, but I can't find where I would have copied it from either. I've even looked at the mappy code in minimap2 from a few years ago and it isn't there. That was my original inspiration for this.

So, my memory has failed me. We can add the hack back in, but since minimap2 doesn't need it, I wonder what the difference is that's causing problems here....

@jguhlin
Copy link
Owner

jguhlin commented Feb 7, 2025

Maybe it's related to this?

https://github.com/lh3/minimap2/blob/618d33515e5853c4576d5a3d126fdcda28f0e8a4/map.c#L367

which isn't represented in the rust code, and the 15 must have been the hack.

I'll look into getting that functionality over, as it looks more mature (too much memory? reinit, rather than just count to 15).

@rob-p
Copy link
Author

rob-p commented Feb 7, 2025

Hi @jguhlin,

Sorry to hear about your move! Things have been going absolutely horribly here in the US, so digging into oarfish and minimap2-rs is a beautiful diversion.

This is a great find, and logically seems it would be closely related. My only immediate question would be, since we’re passing the buffer into mm_map, which calls mm_map_frag, presumably minimap2 itself would be triggering this reset anyway? This seems like a promising place to look, though. Let me know what you find and I’ll take a peek as well.

Thanks!
Rob

@jguhlin
Copy link
Owner

jguhlin commented Feb 7, 2025

I've noticed. May not visit for a few years, though families probably won't let that happen.

That's true, and makes sense. It should be triggering from minimap2. I'll plan to revert the hack early next week, and then work on a real fix. It could be something else clearing up free's in the result processing. Something is missing yet....

There's no way I can pass a bumpalo arena to it, is there? Haha.

@rob-p
Copy link
Author

rob-p commented Feb 7, 2025

Hi @jguhlin,

I don't think there's a way to pass a bumpalo, given mm2 uses kmalloc (I realize you were mostly joking).

On the plus side, I gained some insight. Up through v1.21, you had "the hack" in place. This effectively managed the total memory usage as the threads grew large because, even though a buffer refresh wasn't triggered based on size, since it was triggered every 15 calls to get_buf(), that was quite frequent, and so no local buffer was likely to stay large for long. Certainly, you didn't see situations where multiple threads' buffers got large at the same time.

On the other hand, the default value for cap_kalloc used by minimap2 is ~500M. This means that for 32 threads, if each buffer got as large as possible, you could end up at 16G! Now, I never saw this, but what I did see was multiple threads gaining quite large buffers, leading to a total increase in usage of ~4G for our test dataset. So, the threads were using a lot of memory, but not past the cap_kalloc bound, so the cleanup within map_frag isn't being triggered.

To address this, I set the cap_kalloc in the mapopts to a smaller value. Specifically, I used this:

    let per_thread_cap_kalloc = (1_000_000_000_f64 / (args.threads as f64)).ceil() as i64;
    aligner.mapopt.cap_kalloc = per_thread_cap_kalloc;

This ensures that the total buffer size is such that if all buffers are full simultaneously, we only use ~1G of extra RAM. Obviously, this is adjustable, and the optimal policy may be different (e.g. you may want to allow >1G if you have a huge number of threads like 128 etc.).

Nonetheless, this effectively addresses the memory usage I am seeing. So, I think it's not necessary to revert to the hack, but rather to note this behavior in the documentation somewhere so that users know they may have to change the default if they are using many threads.

I wanted to raise one other question, but if you'd prefer me to move this to another issue so you can close this, please let me know. I noticed that in the CS and CIGAR string generation, for each alignment there is the allocation of a new km. In other words:


                                // This solves a weird segfault...
                                let km = km_init();

                                let _cs_len = mm_gen_cs(
                                    km,
                                    &mut cs_string,
                                    &mut m_cs_string,
                                    idx,
                                    mm_reg1_const_ptr,
                                    seq.as_ptr() as *const libc::c_char,
                                    true.into(),
                                );

                                let _cs_string = std::ffi::CStr::from_ptr(cs_string)
                                    .to_str()
                                    .unwrap()
                                    .to_string();

                                libc::free(cs_string as *mut c_void);
                                km_destroy(km);

That km_init and km_destroy happen for every alignment. This is a lot of allocations. Given that kalloc is designed to be fast for many small allocations, this may not be a huge deal. However, I wonder why this is necessary and if it can be avoided to help reduce potential overhead of alignment when generating CS/CIGAR.

Thanks!
Rob

@jguhlin
Copy link
Owner

jguhlin commented Feb 9, 2025

let per_thread_cap_kalloc = (1_000_000_000_f64 / (args.threads as f64)).ceil() as i64;
aligner.mapopt.cap_kalloc = per_thread_cap_kalloc;

Awesome. I'll make a note of it, and likely provide a function to set it. Great find!

For the cigar strings, there was a weird segfault and this was the only way I found to fix it. But it's definitely a hack. I'll try to dig into it again later this week and see if I can fix the actual issue.

@jguhlin
Copy link
Owner

jguhlin commented Feb 9, 2025

@rob-p I assume you also saw this slight blunder? #119

I was setting best_n to 1 for some reason, instead of letting it follow minimap2 results. So expect some things to change depending on how you were setting things.

@rob-p
Copy link
Author

rob-p commented Feb 10, 2025

Hi @jguhlin,

Yes; I did see that issue. Luckily, we're explicitly setting best_n, as it needs to be a high value in oarfish anyway, for the purposes of catching multimapping to many transcripts, so we weren't affected by a change in the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants