-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unthrottle repair requests #4485
Unthrottle repair requests #4485
Conversation
be61c3a
to
334abba
Compare
637a1a7
to
d9b8221
Compare
Haven't read the code but small nit - can you update PR title to indicate that you tuning the request side (as opposed to the serve side) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass - looks great!
You have this
let repair_request = ShredRepairType::HighestShred(*slot, *received);
if let Entry::Vacant(entry) = outstanding_repairs.entry(repair_request) {
entry.insert(timestamp());
Some(repair_request)
} else {
None
}
repeated in a bunch of places. Maybe you could introduce a struct OutstandingRepairRequests
or something with request_highest_shred_if_needed(slot, i)
request_orphan_if_needed(slot)
etc which return Option<ShredRepairType>
and then you can if let
and filter_map
on that? Not sure exactly what the best API would be but looks like it can be unspaghettified. That said, the PR is in the middle of a lot of spaghetti, so feel free to ignore this suggestion.
Then we should probably add some tests that check that generate_whatever_repair(..., outstanding)
called twice doesn't re-generate the same repair requests? And maybe a way to stub timestamp()
so you can test expiration too?
Added
Added coverage for this for each repair type.
Punted on this one for now. I don't think we really need to stub - we could afford to just wait the 100ms. The annoying part is that the logic is inside of the main run loop, and spinning this up is heavy. The nice thing is we have some implicit coverage of the request timeout retry. How do I know? Because I messed up the polarity on a first draft and |
2.1? 😋 |
FYI this commit has significantly and redundantly increased outgoing repair requests during steady state (meaning node already in sync with the cluster and not catch up period). Also I am concerned how this thing works out during cluster-wide repair spikes if all nodes increase their repairs by this much. Left chunk is this code. Right is the commit before it. |
Confirming I've seen much of the same thing. Actively running experiments to tune the repair delay. I suspect the increase is largely from decreasing the average effective delay from 250ms to 200ms. Behzad, you mentioned seeing increases in both redundant requests and receives. The receives part I understand. I haven't observed (and can't explain) a large uptick in redundant requests - only in overall, unnecessary requests. I say unnecessary because turbine ultimately delivers them before repair. Have you seen lots of duplicate repair requests? |
Collecting some repair data for a node in SG on mainnet. Most data collection periods were ~1 hour (some were much longer):
Next step will be repeating experiments in another geo or 2. I suspect we can safely increase |
Problem
Repair is a large bottleneck during catch-up. We cannot feed shreds to replay fast enough, and so nodes tend to fall behind during this period. The main problem is the repair request loop, which will request at most 512 shreds and then sleep for 100ms.
During runtime, this design also means we may be 50ms behind in requesting repairs for late shreds on average.
Summary of Changes
best_orphans
map as it is superseded by this new repair request map + a simple repair counterTesting Notes
This metrics capture shows several things: