increase repair delays #4643

bw-solana · 2025-01-27T03:21:41Z

Problem

After changes from #4485 , repair rate is overly aggressive.

There are two main sources of change:

The delay was always 200ms, but the repair loop only ran once every 100ms. This means on average, repairs would get triggered when shreds were 250ms late
There is a bug where reference tick is 0 for the final set of shreds, which effectively means they are eligible for repair immediately. This was likely partially masked with the 100ms repair period because we would have had to receive a shred from this group (to establish the ref tick), and still have more than half of shreds to insert when the repair period came up. With a more prompt repair triggering, this window is much easier to hit.

The second problem will not be solved by this PR. There is a separate discussion to fix that portion.

Summary of Changes

increase repair delay from 200ms to 250ms
increase repair request timeout from 100ms to 150ms

Data

See this comment for data on repair tuning testing: #4485 (comment)

Collecting some repair data for a node in SG on mainnet. Most data collection periods were ~1 hour (some were much longer):

DEFER_REPAIR_THRESHOLD	REPAIR_REQUEST_TIMEOUT_MS	Repairs per second (runtime)	Block full time	Successful repair %
200	100	110	475	2.5
250	100	68	475	2.7
300	100	52	485	2.9
300	150	52	490	3.2
250	150	58	475	2.95
200	150	65	475	2.5

Collecting some repair data for a node in HK on mainnet. Most data collection periods were ~1 hour (some were much longer):

DEFER_REPAIR_THRESHOLD	REPAIR_REQUEST_TIMEOUT_MS	Repairs per second (runtime)	Block full time	Successful repair %
200	100	110	453	1.95
250	100	83	456	1.95
250	150	83	456	1.9
200	150	117	453	2.0
300	200	59	465	2.5
500	200	17	500	3.0

Here's a common sequence seen when running with 200ms delay:
[2025-01-26T19:04:28.080781179Z ERROR solana_core::repair::repair_service] #BW: slot 316547008 index 952 shred repair
[2025-01-26T19:04:28.116726260Z ERROR solana_ledger::shred] #BW: slot 316547008 index 952 fetch - repair = false
[2025-01-26T19:04:28.116803935Z ERROR solana_ledger::blockstore] #BW: slot 316547008 index 952 insert turbine
[2025-01-26T19:04:28.299676948Z ERROR solana_ledger::shred] #BW: slot 316547008 index 952 fetch - repair = true
[2025-01-26T19:04:28.299764151Z ERROR solana_ledger::blockstore] #BW: slot 316547008 index 952 insert repair exists

Two things to note:

The repair was triggered ~36ms before the turbine shred arrived. With the 250ms delay, we wouldn't have requested this repair
The repair request took ~220ms to arrive after the request was made. This highlights how long these can take and how we shouldn't be overly aggressive in retrying. Additionally, only a handful of repairs generally need to land to have enough to perform erasure recovery.

alexpyattaev · 2025-02-09T07:45:31Z

To add 50c here: maybe the right approach is to auto-tune the repair timeout? One size fits all approach may not suit all validators at all cluster load conditions. Instead, we can keep track of the last 5-10 seconds of repair and of it does not go well adjust the timeouts accordingly. I.e. if too many repair shreds end up getting delivered by turbine, we bump repair timeout by 5 ms, until the % of useless repairs is sufficiently low. Similarly, if too many repair requests timeout before delivery, we bump up repair timeout. If none of the two happen, we gradually decrease both timeouts by e.g. 1 ms every 10 seconds. This way both parameters will consistently float near optimal values. Of course ranges would need to be constrained to avoid runaway effects.

bw-solana · 2025-02-10T16:54:19Z

To add 50c here: maybe the right approach is to auto-tune the repair timeout? One size fits all approach may not suit all validators at all cluster load conditions. Instead, we can keep track of the last 5-10 seconds of repair and of it does not go well adjust the timeouts accordingly. I.e. if too many repair shreds end up getting delivered by turbine, we bump repair timeout by 5 ms, until the % of useless repairs is sufficiently low. Similarly, if too many repair requests timeout before delivery, we bump up repair timeout. If none of the two happen, we gradually decrease both timeouts by e.g. 1 ms every 10 seconds. This way both parameters will consistently float near optimal values. Of course ranges would need to be constrained to avoid runaway effects.

I agree. Dynamic values are obviously a little trickier and (more importantly) can be harder to reason about, but load and geo location impact the "right" value.

As far as the control mechanism to use, the tricky part is going to be defining what is "good" and "bad" behavior. I don't think we can just use redundant repair rate. I think we want to optimize for (1) block delivery times and (2) fewest repair requests.

Repair success rate generally floats around 2-3% because the most common case is requiring just a single shred via repair to reach the recover threshold for an erasure batch.

As a thought experiment, when we detect we're missing 40/64 shreds and 20/32 data shreds for some erasure batch, what should we do? We may currently request repairs for all 20 data shreds. At best (from a repair success rate perspective) only 8/20 will be successful because at this point we will reach recover threshold. Should we only request 8? This seems like a bad tradeoff for block times because waiting for the slowest out of 8 is going to be worse than waiting for the fastest 8 out of 20.

In any event, I believe increasing these values up will reduce unnecessary repairs by quite a bit in the short term and provide some breathing room to work on the right longer term design.

behzadnouri

lgtm assuming this is well tested.

bw-solana · 2025-02-11T17:12:01Z

lgtm assuming this is well tested.

Collected multiple days of data from 2 different nodes on mainnet.

That said, I will definitely need to monitor metrics on testnet as this code is adopted to make sure block times don't increase and that repair rates decline as expected

alexpyattaev · 2025-02-11T17:43:38Z

As a thought experiment, when we detect we're missing 40/64 shreds and 20/32 data shreds for some erasure batch, what should we do? We may currently request repairs for all 20 data shreds. At best (from a repair success rate perspective) only 8/20 will be successful because at this point we will reach recover threshold. Should we only request 8? This seems like a bad tradeoff for block times because waiting for the slowest out of 8 is going to be worse than waiting for the fastest 8 out of 20.

I have a few ideas how this problem could potentially be addressed. Fundamentally, this is a variation of the one-arm-bandit problem, where we are trying to figure out the optimal reward strategy by observing responses to our action in the presence of noise. There are many agents that can work under these conditions very well assuming we can figure out the reward metric that we want to maximize. Will need to build some primitive simulations to explore & test, but generally these things are fairly well studied subject. If you think it is worth investing time into, of course.

bw-solana marked this pull request as ready for review January 27, 2025 16:44

bw-solana requested review from behzadnouri and steviez January 27, 2025 16:45

increase repair delays

87d52ce

bw-solana force-pushed the repair_tuning branch from 126f366 to 87d52ce Compare February 5, 2025 15:55

behzadnouri approved these changes Feb 11, 2025

View reviewed changes

bw-solana merged commit d3e1514 into anza-xyz:master Feb 11, 2025
66 checks passed

bw-solana deleted the repair_tuning branch February 11, 2025 17:12

bw-solana mentioned this pull request Jan 3, 2025

Performance: Shred Repair #4268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase repair delays #4643

increase repair delays #4643

bw-solana commented Jan 27, 2025 •

edited

Loading

alexpyattaev commented Feb 9, 2025

bw-solana commented Feb 10, 2025

behzadnouri left a comment

bw-solana commented Feb 11, 2025

alexpyattaev commented Feb 11, 2025

increase repair delays #4643

increase repair delays #4643

Conversation

bw-solana commented Jan 27, 2025 • edited Loading

Problem

Summary of Changes

Data

alexpyattaev commented Feb 9, 2025

bw-solana commented Feb 10, 2025

behzadnouri left a comment

Choose a reason for hiding this comment

bw-solana commented Feb 11, 2025

alexpyattaev commented Feb 11, 2025

bw-solana commented Jan 27, 2025 •

edited

Loading