Unthrottle repair requests #4485

bw-solana · 2025-01-15T22:06:13Z

Problem

Repair is a large bottleneck during catch-up. We cannot feed shreds to replay fast enough, and so nodes tend to fall behind during this period. The main problem is the repair request loop, which will request at most 512 shreds and then sleep for 100ms.

During runtime, this design also means we may be 50ms behind in requesting repairs for late shreds on average.

Summary of Changes

Drastically reduce the sleep between iterations in the repair service loop down to 1ms (from 100ms)
Remember which shreds have been requested in the past and at what timestamp the request was made.
Do not re-request the same repair until the previous request times out (100ms have passed)
Get rid of best_orphans map as it is superseded by this new repair request map + a simple repair counter
Update some existing unit tests to check the outstanding request length for correctness for each repair type, including verifying we don't duplicate requests.

Testing Notes

This metrics capture shows several things:

We are requesting ~16k repairs per second during startup. This is ~4x increase over current code.
The requests per second drop dramatically once we catch up to when we started receiving turbine shreds. This verifies we don't continue spamming once we don't need repairs any longer.
We are busy looking for, assembling, and sending out repair requests >95% of the time during startup. This seems fine. We have CPU to burn and getting the repaired shreds to feed replay is the most important thing.
After catching up, repair service is busy ~60% of the time and tails off all the way down to <10% during steady state. This confirms that even after reducing the sleep interval from 100ms to 1ms, we're not burning a bunch of extra CPU.
We can see this node is catching up to the cluster even during the repair phase. This is in contrast to current code where we fall even further behind during this phase. Note this was measured by comparing highest slot of turbine shreds received (cluster) vs. highest slot replayed (node).
The overall repair catchup time is a little under 5 minutes (on average this seems to take ~4 minutes after this change, and mostly depends on how long ledger load takes - i.e. how far behind we start). This is ~3x faster than pre-change
While this isn't captured in the graph, replay_total_elapsed times are ~170ms during the repair phase compared to ~160ms during the next phase (where shreds are sitting in blockstore). So the sourcing of shreds via repair is only having a very small impact on replay efficiency. Previously, replay spent most of its time waiting on shreds from repair and took ~500ms per block

steviez · 2025-01-17T01:27:38Z

Haven't read the code but small nit - can you update PR title to indicate that you tuning the request side (as opposed to the serve side)

alessandrod

First pass - looks great!

You have this

            let repair_request = ShredRepairType::HighestShred(*slot, *received);
            if let Entry::Vacant(entry) = outstanding_repairs.entry(repair_request) {
                entry.insert(timestamp());
                Some(repair_request)
            } else {
                None
            }

repeated in a bunch of places. Maybe you could introduce a struct OutstandingRepairRequests or something with request_highest_shred_if_needed(slot, i) request_orphan_if_needed(slot) etc which return Option<ShredRepairType> and then you can if let and filter_map on that? Not sure exactly what the best API would be but looks like it can be unspaghettified. That said, the PR is in the middle of a lot of spaghetti, so feel free to ignore this suggestion.

Then we should probably add some tests that check that generate_whatever_repair(..., outstanding) called twice doesn't re-generate the same repair requests? And maybe a way to stub timestamp() so you can test expiration too?

bw-solana · 2025-01-17T03:42:41Z

Maybe you could introduce a struct OutstandingRepairRequests or something with request_highest_shred_if_needed(slot, i) request_orphan_if_needed(slot) etc which return Option<ShredRepairType> and then you can if let and filter_map on that?

Added request_repair_if_needed to abstract this. I made it a little more generic so it can be used across all the ShredRepairType enums

Then we should probably add some tests that check that generate_whatever_repair(..., outstanding) called twice doesn't re-generate the same repair requests?

Added coverage for this for each repair type.

And maybe a way to stub timestamp() so you can test expiration too?

Punted on this one for now. I don't think we really need to stub - we could afford to just wait the 100ms. The annoying part is that the logic is inside of the main run loop, and spinning this up is heavy.

The nice thing is we have some implicit coverage of the request timeout retry. How do I know? Because I messed up the polarity on a first draft and test_fork_choice_refresh_old_votes started failing because it relies on some retried repair requests succeeding.

alessandrod · 2025-01-17T09:01:21Z

2.1? 😋

behzadnouri · 2025-01-23T23:28:27Z

FYI this commit has significantly and redundantly increased outgoing repair requests during steady state (meaning node already in sync with the cluster and not catch up period).
Shreds inserted into blockstore from repair path have also redundantly increased by 3x (again steady state, not catch up period).

Also I am concerned how this thing works out during cluster-wide repair spikes if all nodes increase their repairs by this much.

Left chunk is this code. Right is the commit before it.

bw-solana · 2025-01-23T23:45:25Z

FYI this commit has significantly and redundantly increased outgoing repair requests during steady state (meaning node already in sync with the cluster and not catch up period).

Shreds inserted into blockstore from repair path have also redundantly increased by 3x (again steady state, not catch up period).

Also I am concerned how this thing works out during cluster-wide repair spikes if all nodes increase their repairs by this much.

Left chunk is this code. Right is the commit before it.

Confirming I've seen much of the same thing. Actively running experiments to tune the repair delay. I suspect the increase is largely from decreasing the average effective delay from 250ms to 200ms.

Behzad, you mentioned seeing increases in both redundant requests and receives. The receives part I understand. I haven't observed (and can't explain) a large uptick in redundant requests - only in overall, unnecessary requests. I say unnecessary because turbine ultimately delivers them before repair. Have you seen lots of duplicate repair requests?

bw-solana · 2025-01-24T18:45:58Z

Collecting some repair data for a node in SG on mainnet. Most data collection periods were ~1 hour (some were much longer):

DEFER_REPAIR_THRESHOLD	REPAIR_REQUEST_TIMEOUT_MS	Repairs per second (runtime)	Block full time	Successful repair %
200	100	110	475	2.5
250	100	68	475	2.7
300	100	52	485	2.9
300	150	52	490	3.2
250	150	58	475	2.95
200	150	65	475	2.5

Next step will be repeating experiments in another geo or 2.

I suspect we can safely increase DEFER_REPAIR_THRESHOLD to 250 and REPAIR_REQUEST_TIMEOUT_MS to 150 and reduce repair rate quite a bit without increasing block times.

bw-solana force-pushed the repair_client_throttling branch from be61c3a to 334abba Compare January 16, 2025 04:08

unthrottle repair

d9b8221

bw-solana force-pushed the repair_client_throttling branch from 637a1a7 to d9b8221 Compare January 16, 2025 22:18

bw-solana marked this pull request as ready for review January 16, 2025 23:15

bw-solana requested review from alessandrod and AshwinSekar January 16, 2025 23:15

bw-solana mentioned this pull request Jan 17, 2025

Repair metrics #4511

Merged

bw-solana changed the title ~~Unthrottle repair~~ Unthrottle repair requests Jan 17, 2025

alessandrod reviewed Jan 17, 2025

View reviewed changes

PR feedback

393a496

bw-solana requested a review from alessandrod January 17, 2025 03:43

alessandrod approved these changes Jan 17, 2025

View reviewed changes

bw-solana requested a review from wen-coding January 17, 2025 14:19

bw-solana merged commit e6ed940 into anza-xyz:master Jan 17, 2025
47 checks passed

bw-solana mentioned this pull request Jan 27, 2025

increase repair delays #4643

Merged

bw-solana mentioned this pull request Jan 3, 2025

Performance: Shred Repair #4268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unthrottle repair requests #4485

Unthrottle repair requests #4485

bw-solana commented Jan 15, 2025 •

edited

Loading

steviez commented Jan 17, 2025

alessandrod left a comment •

edited

Loading

bw-solana commented Jan 17, 2025

alessandrod commented Jan 17, 2025

behzadnouri commented Jan 23, 2025

bw-solana commented Jan 23, 2025

bw-solana commented Jan 24, 2025 •

edited

Loading

Unthrottle repair requests #4485

Unthrottle repair requests #4485

Conversation

bw-solana commented Jan 15, 2025 • edited Loading

Problem

Summary of Changes

Testing Notes

steviez commented Jan 17, 2025

alessandrod left a comment • edited Loading

Choose a reason for hiding this comment

bw-solana commented Jan 17, 2025

alessandrod commented Jan 17, 2025

behzadnouri commented Jan 23, 2025

bw-solana commented Jan 23, 2025

bw-solana commented Jan 24, 2025 • edited Loading

bw-solana commented Jan 15, 2025 •

edited

Loading

alessandrod left a comment •

edited

Loading

bw-solana commented Jan 24, 2025 •

edited

Loading