-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace mmap with file io in merkle tree hash calculation #3547
Conversation
f8d3ca6
to
d9b0c8a
Compare
Summary
|
8abeb3f
to
187b7c3
Compare
rebase to pick up #3589 |
beb8a1f
to
f6c5c61
Compare
f6c5c61
to
ee3a12b
Compare
@@ -1160,16 +1255,15 @@ impl AccountsHasher<'_> { | |||
let binner = PubkeyBinCalculator24::new(bins); | |||
|
|||
// working_set hold the lowest items for each slot_group sorted by pubkey descending (min_key is the last) | |||
let (mut working_set, max_inclusive_num_pubkeys) = Self::initialize_dedup_working_set( | |||
let (mut working_set, _max_inclusive_num_pubkeys) = Self::initialize_dedup_working_set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_inclusive_num_pubkeys is an estimate of the upper bound for the hash file size. It is only required when we use mmap - creating an mmap requires to specify the initial size, which could be over allocated too. After switching to file writer, we don't need this any more.
accounts-db/src/accounts_hash.rs
Outdated
@@ -619,6 +540,180 @@ impl AccountsHasher<'_> { | |||
(num_hashes_per_chunk, levels_hashed, three_level) | |||
} | |||
|
|||
// This function is called at the top lover level to compute the merkle. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fn is copied from fn compute_merkle_root_from_slices<'b, F, T>(
. Do we need the other fn still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do.
compute_merkle_root_from_slices is still used when we compute merkle tree at 2 level and above, where we have all the data in memory already.
The comments here may be helpful to understand.
// This function is called at the top level to compute the merkle hash. It
// takes a closure that returns an owned vec of hash data at the leaf level
// of the merkle tree. The input data for this bottom level are read from a
// file. For non-leaves nodes, where the input data is already in memory, we
// will use `compute_merkle_root_from_slices`, which is a version that takes
// a borrowed slice of hash data instead.
accounts-db/src/accounts_hash.rs
Outdated
.unwrap(); | ||
|
||
let mut result_bytes: Vec<u8> = vec![]; | ||
data.read_to_end(&mut result_bytes).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this allocates what could be a large Vec
. Depends on hashing tuning parameters (num bins, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have found in pop testing that large mem allocations can cause us to oom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
We could add a cap on how many bytes we load here. The downside is that we may need to call this function multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commit a change to cap the file read buffer size to 64M.
I'm certainly happy and supportive with the idea of the change here. What is the estimate of the lattice hash creation? Even then, we may need to use this code to create a full lattice hash from scratch initially or for full accountsdb verification. |
accounts-db/src/accounts_hash.rs
Outdated
|
||
// initial fetch - could return entire slice | ||
let data_bytes = get_hash_slice_starting_at_index(0); | ||
let data: &[T] = bytemuck::cast_slice(&data_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffwashington The caller cast the data to the generic type T here.
accounts-db/src/accounts_hash.rs
Outdated
for _k in 0..end { | ||
if data_index >= data_len { | ||
// we exhausted our data, fetch next slice starting at i | ||
data_bytes = get_hash_slice_starting_at_index(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffwashington With a cap only how many bytes we load each time, we may find that we will need to load the time more times.
7069973
to
aeb53a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to go through the merkle tree code.
Heh, trying to fit the file io impl into the existing mmap api is really clunky... That's probably the right choice though, as this code will go away after the accounts lt hash is activated.
c0ea913
to
fe25c2d
Compare
9cc3366
to
6c77371
Compare
@jeffwashington, @brooksprumo I pushed several commits to address your reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some suboptimal pieces here, but doing it "right" would be much more intrusive. At the same time, nothing here seems egregiously bad. So assuming the mnb test runs show good results, I'd be on board with the changes.
Performance Comparison
Result summary
Overall, this PR looks like a "win" on performance and resource usage. hash_time: drop_ushash_time: hash_ushash_time:total_usdisk: rwmem |
I still don't understand why this PR takes 5 seconds to drop the file io cache files. Do you have more info here?
Lower overall is definitely the important one, esp if it's around one hundred seconds!
The memory usage doesn't look any different to me. Neither node is at steady state, so comparing absolute used bytes isn't particularly useful here. That said, the deltas on each seem about the same, so that's good. Especially since the current mmap version won't show up in used bytes at all.
Which disk metric specifically is this? It is likely fine. Also the canary nodes do vary quite widely between each other even.
Definitely agree 🚀🚀🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me. Overall this is a win. I'd love to get answers to my specific questions from above, too. Please get sign off from jwash/alessandro too before merging.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Problem
We have noticed that during hash calculation, the performance of block producing
process degrades. Part of it is due to the stress that was put on memory and
disk/io from the accounts hash threads.
When we computes merkle tree hash, we use mmap to store the extracted accounts'
hashes. Mmap is heavy on resource usages, such as memory and disk/io, and put a
stress on the whole system.
In this PR, we propose to switch to use file io, less resource stressful, for
merkle tree hash computation.
Studies on mainnet with this PR shows that file-io use less memory and put less
stress on disk io. The "pure" hash computation time with file io is a little
longer than with mmap. But, we also save the mmap drop time with file io. And
the saving from mmap drop time is more to offset the extra time spent on hash
calculation. Thus, it makes the overall time for computing the hash smaller.
Note that there is an upcoming lattice hash feature, which will be the ultimate
solution to hash, i.e. it remove all the merkle tree hash calculation. However,
before that feature is activated, we could still use this PR as an interim
enhancement for merkle tree hash computation.
Summary of Changes
Fixes #