-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase repair delays #4643
increase repair delays #4643
Conversation
126f366
to
87d52ce
Compare
To add 50c here: maybe the right approach is to auto-tune the repair timeout? One size fits all approach may not suit all validators at all cluster load conditions. Instead, we can keep track of the last 5-10 seconds of repair and of it does not go well adjust the timeouts accordingly. I.e. if too many repair shreds end up getting delivered by turbine, we bump repair timeout by 5 ms, until the % of useless repairs is sufficiently low. Similarly, if too many repair requests timeout before delivery, we bump up repair timeout. If none of the two happen, we gradually decrease both timeouts by e.g. 1 ms every 10 seconds. This way both parameters will consistently float near optimal values. Of course ranges would need to be constrained to avoid runaway effects. |
I agree. Dynamic values are obviously a little trickier and (more importantly) can be harder to reason about, but load and geo location impact the "right" value. As far as the control mechanism to use, the tricky part is going to be defining what is "good" and "bad" behavior. I don't think we can just use redundant repair rate. I think we want to optimize for (1) block delivery times and (2) fewest repair requests. Repair success rate generally floats around 2-3% because the most common case is requiring just a single shred via repair to reach the recover threshold for an erasure batch. As a thought experiment, when we detect we're missing 40/64 shreds and 20/32 data shreds for some erasure batch, what should we do? We may currently request repairs for all 20 data shreds. At best (from a repair success rate perspective) only 8/20 will be successful because at this point we will reach recover threshold. Should we only request 8? This seems like a bad tradeoff for block times because waiting for the slowest out of 8 is going to be worse than waiting for the fastest 8 out of 20. In any event, I believe increasing these values up will reduce unnecessary repairs by quite a bit in the short term and provide some breathing room to work on the right longer term design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm assuming this is well tested.
Collected multiple days of data from 2 different nodes on mainnet. That said, I will definitely need to monitor metrics on testnet as this code is adopted to make sure block times don't increase and that repair rates decline as expected |
I have a few ideas how this problem could potentially be addressed. Fundamentally, this is a variation of the one-arm-bandit problem, where we are trying to figure out the optimal reward strategy by observing responses to our action in the presence of noise. There are many agents that can work under these conditions very well assuming we can figure out the reward metric that we want to maximize. Will need to build some primitive simulations to explore & test, but generally these things are fairly well studied subject. If you think it is worth investing time into, of course. |
Problem
After changes from #4485 , repair rate is overly aggressive.
There are two main sources of change:
The second problem will not be solved by this PR. There is a separate discussion to fix that portion.
Summary of Changes
Data
See this comment for data on repair tuning testing: #4485 (comment)
Collecting some repair data for a node in SG on mainnet. Most data collection periods were ~1 hour (some were much longer):
Collecting some repair data for a node in HK on mainnet. Most data collection periods were ~1 hour (some were much longer):
Here's a common sequence seen when running with 200ms delay:
[2025-01-26T19:04:28.080781179Z ERROR solana_core::repair::repair_service] #BW: slot 316547008 index 952 shred repair
[2025-01-26T19:04:28.116726260Z ERROR solana_ledger::shred] #BW: slot 316547008 index 952 fetch - repair = false
[2025-01-26T19:04:28.116803935Z ERROR solana_ledger::blockstore] #BW: slot 316547008 index 952 insert turbine
[2025-01-26T19:04:28.299676948Z ERROR solana_ledger::shred] #BW: slot 316547008 index 952 fetch - repair = true
[2025-01-26T19:04:28.299764151Z ERROR solana_ledger::blockstore] #BW: slot 316547008 index 952 insert repair exists
Two things to note: