Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase repair delays #4643

Merged
merged 1 commit into from
Feb 11, 2025
Merged

increase repair delays #4643

merged 1 commit into from
Feb 11, 2025

Conversation

bw-solana
Copy link

@bw-solana bw-solana commented Jan 27, 2025

Problem

After changes from #4485 , repair rate is overly aggressive.

There are two main sources of change:

  1. The delay was always 200ms, but the repair loop only ran once every 100ms. This means on average, repairs would get triggered when shreds were 250ms late
  2. There is a bug where reference tick is 0 for the final set of shreds, which effectively means they are eligible for repair immediately. This was likely partially masked with the 100ms repair period because we would have had to receive a shred from this group (to establish the ref tick), and still have more than half of shreds to insert when the repair period came up. With a more prompt repair triggering, this window is much easier to hit.

The second problem will not be solved by this PR. There is a separate discussion to fix that portion.

Summary of Changes

  • increase repair delay from 200ms to 250ms
  • increase repair request timeout from 100ms to 150ms

Data

See this comment for data on repair tuning testing: #4485 (comment)

Collecting some repair data for a node in SG on mainnet. Most data collection periods were ~1 hour (some were much longer):

DEFER_REPAIR_THRESHOLD REPAIR_REQUEST_TIMEOUT_MS Repairs per second (runtime) Block full time Successful repair %
200 100 110 475 2.5
250 100 68 475 2.7
300 100 52 485 2.9
300 150 52 490 3.2
250 150 58 475 2.95
200 150 65 475 2.5

image

Collecting some repair data for a node in HK on mainnet. Most data collection periods were ~1 hour (some were much longer):

DEFER_REPAIR_THRESHOLD REPAIR_REQUEST_TIMEOUT_MS Repairs per second (runtime) Block full time Successful repair %
200 100 110 453 1.95
250 100 83 456 1.95
250 150 83 456 1.9
200 150 117 453 2.0
300 200 59 465 2.5
500 200 17 500 3.0

Here's a common sequence seen when running with 200ms delay:
[2025-01-26T19:04:28.080781179Z ERROR solana_core::repair::repair_service] #BW: slot 316547008 index 952 shred repair
[2025-01-26T19:04:28.116726260Z ERROR solana_ledger::shred] #BW: slot 316547008 index 952 fetch - repair = false
[2025-01-26T19:04:28.116803935Z ERROR solana_ledger::blockstore] #BW: slot 316547008 index 952 insert turbine
[2025-01-26T19:04:28.299676948Z ERROR solana_ledger::shred] #BW: slot 316547008 index 952 fetch - repair = true
[2025-01-26T19:04:28.299764151Z ERROR solana_ledger::blockstore] #BW: slot 316547008 index 952 insert repair exists

Two things to note:

  1. The repair was triggered ~36ms before the turbine shred arrived. With the 250ms delay, we wouldn't have requested this repair
  2. The repair request took ~220ms to arrive after the request was made. This highlights how long these can take and how we shouldn't be overly aggressive in retrying. Additionally, only a handful of repairs generally need to land to have enough to perform erasure recovery.

@bw-solana bw-solana marked this pull request as ready for review January 27, 2025 16:44
@alexpyattaev
Copy link

To add 50c here: maybe the right approach is to auto-tune the repair timeout? One size fits all approach may not suit all validators at all cluster load conditions. Instead, we can keep track of the last 5-10 seconds of repair and of it does not go well adjust the timeouts accordingly. I.e. if too many repair shreds end up getting delivered by turbine, we bump repair timeout by 5 ms, until the % of useless repairs is sufficiently low. Similarly, if too many repair requests timeout before delivery, we bump up repair timeout. If none of the two happen, we gradually decrease both timeouts by e.g. 1 ms every 10 seconds. This way both parameters will consistently float near optimal values. Of course ranges would need to be constrained to avoid runaway effects.

@bw-solana
Copy link
Author

To add 50c here: maybe the right approach is to auto-tune the repair timeout? One size fits all approach may not suit all validators at all cluster load conditions. Instead, we can keep track of the last 5-10 seconds of repair and of it does not go well adjust the timeouts accordingly. I.e. if too many repair shreds end up getting delivered by turbine, we bump repair timeout by 5 ms, until the % of useless repairs is sufficiently low. Similarly, if too many repair requests timeout before delivery, we bump up repair timeout. If none of the two happen, we gradually decrease both timeouts by e.g. 1 ms every 10 seconds. This way both parameters will consistently float near optimal values. Of course ranges would need to be constrained to avoid runaway effects.

I agree. Dynamic values are obviously a little trickier and (more importantly) can be harder to reason about, but load and geo location impact the "right" value.

As far as the control mechanism to use, the tricky part is going to be defining what is "good" and "bad" behavior. I don't think we can just use redundant repair rate. I think we want to optimize for (1) block delivery times and (2) fewest repair requests.

Repair success rate generally floats around 2-3% because the most common case is requiring just a single shred via repair to reach the recover threshold for an erasure batch.

As a thought experiment, when we detect we're missing 40/64 shreds and 20/32 data shreds for some erasure batch, what should we do? We may currently request repairs for all 20 data shreds. At best (from a repair success rate perspective) only 8/20 will be successful because at this point we will reach recover threshold. Should we only request 8? This seems like a bad tradeoff for block times because waiting for the slowest out of 8 is going to be worse than waiting for the fastest 8 out of 20.

In any event, I believe increasing these values up will reduce unnecessary repairs by quite a bit in the short term and provide some breathing room to work on the right longer term design.

Copy link

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm assuming this is well tested.

@bw-solana
Copy link
Author

lgtm assuming this is well tested.

Collected multiple days of data from 2 different nodes on mainnet.

That said, I will definitely need to monitor metrics on testnet as this code is adopted to make sure block times don't increase and that repair rates decline as expected

@bw-solana bw-solana merged commit d3e1514 into anza-xyz:master Feb 11, 2025
66 checks passed
@bw-solana bw-solana deleted the repair_tuning branch February 11, 2025 17:12
@alexpyattaev
Copy link

As a thought experiment, when we detect we're missing 40/64 shreds and 20/32 data shreds for some erasure batch, what should we do? We may currently request repairs for all 20 data shreds. At best (from a repair success rate perspective) only 8/20 will be successful because at this point we will reach recover threshold. Should we only request 8? This seems like a bad tradeoff for block times because waiting for the slowest out of 8 is going to be worse than waiting for the fastest 8 out of 20.

I have a few ideas how this problem could potentially be addressed. Fundamentally, this is a variation of the one-arm-bandit problem, where we are trying to figure out the optimal reward strategy by observing responses to our action in the presence of noise. There are many agents that can work under these conditions very well assuming we can figure out the reward metric that we want to maximize. Will need to build some primitive simulations to explore & test, but generally these things are fairly well studied subject. If you think it is worth investing time into, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants