Replace mmap with file io in merkle tree hash calculation #3547

HaoranYi · 2024-11-08T17:22:00Z

Problem

We have noticed that during hash calculation, the performance of block producing
process degrades. Part of it is due to the stress that was put on memory and
disk/io from the accounts hash threads.

When we computes merkle tree hash, we use mmap to store the extracted accounts'
hashes. Mmap is heavy on resource usages, such as memory and disk/io, and put a
stress on the whole system.

In this PR, we propose to switch to use file io, less resource stressful, for
merkle tree hash computation.

Studies on mainnet with this PR shows that file-io use less memory and put less
stress on disk io. The "pure" hash computation time with file io is a little
longer than with mmap. But, we also save the mmap drop time with file io. And
the saving from mmap drop time is more to offset the extra time spent on hash
calculation. Thus, it makes the overall time for computing the hash smaller.

Note that there is an upcoming lattice hash feature, which will be the ultimate
solution to hash, i.e. it remove all the merkle tree hash calculation. However,
before that feature is activated, we could still use this PR as an interim
enhancement for merkle tree hash computation.

Summary of Changes

replace mmap with file io for merkle tree hash calcuation

Fixes #

HaoranYi · 2024-11-11T16:00:07Z

Performance comparison

mmap-64Kbins (pink)
mmap-4kbins (orange)
file/io - 64kbins (blue)

HaoranYi · 2024-11-11T16:26:26Z

Summary

smaller bins use more memory and a bit slower in hashtime, but is faster on drop time.
file-io use less memory and but requires a bit longer time to compute hash but the save over drop time is larger than mmap, which makes it overall faster than mmap with the same number of bins.
file-io use less disk/io.

HaoranYi · 2024-11-13T21:54:42Z

rebase to pick up #3589

HaoranYi · 2025-01-15T15:47:40Z

accounts-db/src/accounts_hash.rs

@@ -1160,16 +1255,15 @@ impl AccountsHasher<'_> {
        let binner = PubkeyBinCalculator24::new(bins);

        // working_set hold the lowest items for each slot_group sorted by pubkey descending (min_key is the last)
-        let (mut working_set, max_inclusive_num_pubkeys) = Self::initialize_dedup_working_set(
+        let (mut working_set, _max_inclusive_num_pubkeys) = Self::initialize_dedup_working_set(


max_inclusive_num_pubkeys is an estimate of the upper bound for the hash file size. It is only required when we use mmap - creating an mmap requires to specify the initial size, which could be over allocated too. After switching to file writer, we don't need this any more.

jeffwashington · 2025-01-15T16:36:47Z

accounts-db/src/accounts_hash.rs

@@ -619,6 +540,180 @@ impl AccountsHasher<'_> {
        (num_hashes_per_chunk, levels_hashed, three_level)
    }

+    // This function is called at the top lover level to compute the merkle. It


this fn is copied from fn compute_merkle_root_from_slices<'b, F, T>(. Do we need the other fn still?

Yes, we do.
compute_merkle_root_from_slices is still used when we compute merkle tree at 2 level and above, where we have all the data in memory already.

The comments here may be helpful to understand.

// This function is called at the top level to compute the merkle hash. It // takes a closure that returns an owned vec of hash data at the leaf level // of the merkle tree. The input data for this bottom level are read from a // file. For non-leaves nodes, where the input data is already in memory, we // will use `compute_merkle_root_from_slices`, which is a version that takes // a borrowed slice of hash data instead.

accounts-db/src/accounts_hash.rs

jeffwashington · 2025-01-15T16:40:03Z

accounts-db/src/accounts_hash.rs

+            .unwrap();
+
+        let mut result_bytes: Vec<u8> = vec![];
+        data.read_to_end(&mut result_bytes).unwrap();


note that this allocates what could be a large Vec. Depends on hashing tuning parameters (num bins, etc.)

we have found in pop testing that large mem allocations can cause us to oom.

Good point.

We could add a cap on how many bytes we load here. The downside is that we may need to call this function multiple times.

I commit a change to cap the file read buffer size to 64M.

jeffwashington · 2025-01-15T16:43:17Z

I'm certainly happy and supportive with the idea of the change here. What is the estimate of the lattice hash creation? Even then, we may need to use this code to create a full lattice hash from scratch initially or for full accountsdb verification.

HaoranYi · 2025-01-15T17:46:17Z

accounts-db/src/accounts_hash.rs

+
+        // initial fetch - could return entire slice
+        let data_bytes = get_hash_slice_starting_at_index(0);
+        let data: &[T] = bytemuck::cast_slice(&data_bytes);


@jeffwashington The caller cast the data to the generic type T here.

HaoranYi · 2025-01-15T17:55:19Z

accounts-db/src/accounts_hash.rs

+                            for _k in 0..end {
+                                if data_index >= data_len {
+                                    // we exhausted our data, fetch next slice starting at i
+                                    data_bytes = get_hash_slice_starting_at_index(i);


@jeffwashington With a cap only how many bytes we load each time, we may find that we will need to load the time more times.

brooksprumo

I still need to go through the merkle tree code.

Heh, trying to fit the file io impl into the existing mmap api is really clunky... That's probably the right choice though, as this code will go away after the accounts lt hash is activated.

accounts-db/src/accounts_hash.rs

HaoranYi · 2025-01-21T16:13:13Z

@jeffwashington, @brooksprumo I pushed several commits to address your reviews.
Can you take a look again?
FYI I have also restarted my node to test with the updated commits.

brooksprumo

There are some suboptimal pieces here, but doing it "right" would be much more intrusive. At the same time, nothing here seems egregiously bad. So assuming the mnb test runs show good results, I'd be on board with the changes.

accounts-db/src/accounts_hash.rs

HaoranYi · 2025-01-24T17:24:22Z

Performance Comparison

blue: this PR
red: canary-mc3

Result summary

Much less time for "drop_hash" files. (mmap is expensive to drop)
A small increase on "hash" calc time.
Overall less time for "total_hash".
Less total "used" memory too.
More disk spike during hash calc.

Overall, this PR looks like a "win" on performance and resource usage.

hash_time: drop_us

hash_time: hash_us

hash_time:total_us

disk: rw

mem

brooksprumo · 2025-01-25T00:22:18Z

Much less time for "drop_hash" files. (mmap is expensive to drop)

I still don't understand why this PR takes 5 seconds to drop the file io cache files. Do you have more info here?

A small increase on "hash" calc time.

Overall less time for "total_hash".

Lower overall is definitely the important one, esp if it's around one hundred seconds!

Less total "used" memory too.

The memory usage doesn't look any different to me. Neither node is at steady state, so comparing absolute used bytes isn't particularly useful here. That said, the deltas on each seem about the same, so that's good. Especially since the current mmap version won't show up in used bytes at all.

More disk spike during hash calc.

Which disk metric specifically is this? It is likely fine. Also the canary nodes do vary quite widely between each other even.

Overall, this PR looks like a "win" on performance and resource usage.

Definitely agree 🚀🚀🚀

brooksprumo

Code looks good to me. Overall this is a win. I'd love to get answers to my specific questions from above, too. Please get sign off from jwash/alessandro too before merging.

HaoranYi · 2025-01-25T15:28:18Z

I believe that 5 seconds are spent on deleting the temp directory and all hash files ~ total number is 64K. We can probably use io_uring to speed up the deleting.
The disk-metrics is "time_io_ms" including both read and write time.

jeffwashington

lgtm

HaoranYi marked this pull request as draft November 8, 2024 17:22

HaoranYi changed the title ~~cli hash bins~~ Replace mmap with file io in merkle tree hash calculation Nov 8, 2024

HaoranYi force-pushed the cli_hash_bins branch 2 times, most recently from f8d3ca6 to d9b0c8a Compare November 9, 2024 02:50

This was referenced Nov 11, 2024

Add stat for hash calculation file drop time #3577

Merged

Add accounts hash pubkey bins to cli #3578

Merged

HaoranYi force-pushed the cli_hash_bins branch 3 times, most recently from 8abeb3f to 187b7c3 Compare November 13, 2024 21:50

HaoranYi force-pushed the cli_hash_bins branch 2 times, most recently from beb8a1f to f6c5c61 Compare November 15, 2024 15:41

HaoranYi added 3 commits January 15, 2025 15:42

replace mmap

47d6953

remove copy

22c91a3

comments

ee3a12b

HaoranYi force-pushed the cli_hash_bins branch from f6c5c61 to ee3a12b Compare January 15, 2025 15:42

HaoranYi commented Jan 15, 2025

View reviewed changes

HaoranYi marked this pull request as ready for review January 15, 2025 16:11

HaoranYi requested review from brooksprumo, jeffwashington and alessandrod January 15, 2025 16:13

jeffwashington reviewed Jan 15, 2025

View reviewed changes

accounts-db/src/accounts_hash.rs Outdated Show resolved Hide resolved

jeffwashington reviewed Jan 15, 2025

View reviewed changes

HaoranYi commented Jan 15, 2025

View reviewed changes

pr: cap max file read size to 64M

aeb53a4

HaoranYi force-pushed the cli_hash_bins branch from 7069973 to aeb53a4 Compare January 16, 2025 15:37

brooksprumo reviewed Jan 17, 2025

View reviewed changes

accounts-db/src/accounts_hash.rs Outdated Show resolved Hide resolved

accounts-db/src/accounts_hash.rs Outdated Show resolved Hide resolved

brooksprumo self-requested a review January 17, 2025 16:13

pr: share the hash computation from slice with from vec

fe25c2d

HaoranYi force-pushed the cli_hash_bins branch from c0ea913 to fe25c2d Compare January 17, 2025 23:57

pr: return slice of hash in get_data api

6c77371

HaoranYi force-pushed the cli_hash_bins branch from 9cc3366 to 6c77371 Compare January 18, 2025 01:17

use expect for hash slice read

9a252ba

HaoranYi requested a review from jeffwashington January 21, 2025 16:10

brooksprumo reviewed Jan 21, 2025

View reviewed changes

accounts-db/src/accounts_hash.rs Show resolved Hide resolved

brooksprumo approved these changes Jan 25, 2025

View reviewed changes

jeffwashington approved these changes Feb 3, 2025

View reviewed changes

HaoranYi merged commit cb62abf into anza-xyz:master Feb 4, 2025
47 checks passed

HaoranYi deleted the cli_hash_bins branch February 4, 2025 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace mmap with file io in merkle tree hash calculation #3547

Replace mmap with file io in merkle tree hash calculation #3547

HaoranYi commented Nov 8, 2024 •

edited

Loading

HaoranYi commented Nov 11, 2024 •

edited

Loading

HaoranYi commented Nov 11, 2024 •

edited

Loading

HaoranYi commented Nov 13, 2024

HaoranYi Jan 15, 2025

jeffwashington Jan 15, 2025

HaoranYi Jan 15, 2025

jeffwashington Jan 15, 2025

jeffwashington Jan 15, 2025

HaoranYi Jan 15, 2025

HaoranYi Jan 16, 2025

jeffwashington commented Jan 15, 2025

HaoranYi Jan 15, 2025

HaoranYi Jan 15, 2025

brooksprumo left a comment

HaoranYi commented Jan 21, 2025

brooksprumo left a comment

HaoranYi commented Jan 24, 2025

brooksprumo commented Jan 25, 2025

brooksprumo left a comment

HaoranYi commented Jan 25, 2025 •

edited

Loading

jeffwashington left a comment

Replace mmap with file io in merkle tree hash calculation #3547

Replace mmap with file io in merkle tree hash calculation #3547

Conversation

HaoranYi commented Nov 8, 2024 • edited Loading

Problem

Summary of Changes

HaoranYi commented Nov 11, 2024 • edited Loading

Performance comparison

HaoranYi commented Nov 11, 2024 • edited Loading

HaoranYi commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffwashington commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brooksprumo left a comment

Choose a reason for hiding this comment

HaoranYi commented Jan 21, 2025

brooksprumo left a comment

Choose a reason for hiding this comment

HaoranYi commented Jan 24, 2025

Performance Comparison

Result summary

hash_time: drop_us

hash_time: hash_us

hash_time:total_us

disk: rw

mem

brooksprumo commented Jan 25, 2025

brooksprumo left a comment

Choose a reason for hiding this comment

HaoranYi commented Jan 25, 2025 • edited Loading

jeffwashington left a comment

Choose a reason for hiding this comment

HaoranYi commented Nov 8, 2024 •

edited

Loading

HaoranYi commented Nov 11, 2024 •

edited

Loading

HaoranYi commented Nov 11, 2024 •

edited

Loading

HaoranYi commented Jan 25, 2025 •

edited

Loading