Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unthrottle repair requests #4485

Merged
merged 2 commits into from
Jan 17, 2025

Conversation

bw-solana
Copy link

@bw-solana bw-solana commented Jan 15, 2025

Problem

Repair is a large bottleneck during catch-up. We cannot feed shreds to replay fast enough, and so nodes tend to fall behind during this period. The main problem is the repair request loop, which will request at most 512 shreds and then sleep for 100ms.

During runtime, this design also means we may be 50ms behind in requesting repairs for late shreds on average.

Summary of Changes

  • Drastically reduce the sleep between iterations in the repair service loop down to 1ms (from 100ms)
  • Remember which shreds have been requested in the past and at what timestamp the request was made.
  • Do not re-request the same repair until the previous request times out (100ms have passed)
  • Get rid of best_orphans map as it is superseded by this new repair request map + a simple repair counter
  • Update some existing unit tests to check the outstanding request length for correctness for each repair type, including verifying we don't duplicate requests.

Testing Notes

image
This metrics capture shows several things:

  1. We are requesting ~16k repairs per second during startup. This is ~4x increase over current code.
  2. The requests per second drop dramatically once we catch up to when we started receiving turbine shreds. This verifies we don't continue spamming once we don't need repairs any longer.
  3. We are busy looking for, assembling, and sending out repair requests >95% of the time during startup. This seems fine. We have CPU to burn and getting the repaired shreds to feed replay is the most important thing.
  4. After catching up, repair service is busy ~60% of the time and tails off all the way down to <10% during steady state. This confirms that even after reducing the sleep interval from 100ms to 1ms, we're not burning a bunch of extra CPU.
  5. We can see this node is catching up to the cluster even during the repair phase. This is in contrast to current code where we fall even further behind during this phase. Note this was measured by comparing highest slot of turbine shreds received (cluster) vs. highest slot replayed (node).
  6. The overall repair catchup time is a little under 5 minutes (on average this seems to take ~4 minutes after this change, and mostly depends on how long ledger load takes - i.e. how far behind we start). This is ~3x faster than pre-change
  7. While this isn't captured in the graph, replay_total_elapsed times are ~170ms during the repair phase compared to ~160ms during the next phase (where shreds are sitting in blockstore). So the sourcing of shreds via repair is only having a very small impact on replay efficiency. Previously, replay spent most of its time waiting on shreds from repair and took ~500ms per block

image

@bw-solana bw-solana force-pushed the repair_client_throttling branch from be61c3a to 334abba Compare January 16, 2025 04:08
@bw-solana bw-solana force-pushed the repair_client_throttling branch from 637a1a7 to d9b8221 Compare January 16, 2025 22:18
@bw-solana bw-solana marked this pull request as ready for review January 16, 2025 23:15
@steviez
Copy link

steviez commented Jan 17, 2025

Haven't read the code but small nit - can you update PR title to indicate that you tuning the request side (as opposed to the serve side)

@bw-solana bw-solana mentioned this pull request Jan 17, 2025
@bw-solana bw-solana changed the title Unthrottle repair Unthrottle repair requests Jan 17, 2025
Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass - looks great!

You have this

            let repair_request = ShredRepairType::HighestShred(*slot, *received);
            if let Entry::Vacant(entry) = outstanding_repairs.entry(repair_request) {
                entry.insert(timestamp());
                Some(repair_request)
            } else {
                None
            }

repeated in a bunch of places. Maybe you could introduce a struct OutstandingRepairRequests or something with request_highest_shred_if_needed(slot, i) request_orphan_if_needed(slot) etc which return Option<ShredRepairType> and then you can if let and filter_map on that? Not sure exactly what the best API would be but looks like it can be unspaghettified. That said, the PR is in the middle of a lot of spaghetti, so feel free to ignore this suggestion.

Then we should probably add some tests that check that generate_whatever_repair(..., outstanding) called twice doesn't re-generate the same repair requests? And maybe a way to stub timestamp() so you can test expiration too?

@bw-solana
Copy link
Author

Maybe you could introduce a struct OutstandingRepairRequests or something with request_highest_shred_if_needed(slot, i) request_orphan_if_needed(slot) etc which return Option<ShredRepairType> and then you can if let and filter_map on that?

Added request_repair_if_needed to abstract this. I made it a little more generic so it can be used across all the ShredRepairType enums

Then we should probably add some tests that check that generate_whatever_repair(..., outstanding) called twice doesn't re-generate the same repair requests?

Added coverage for this for each repair type.

And maybe a way to stub timestamp() so you can test expiration too?

Punted on this one for now. I don't think we really need to stub - we could afford to just wait the 100ms. The annoying part is that the logic is inside of the main run loop, and spinning this up is heavy.

The nice thing is we have some implicit coverage of the request timeout retry. How do I know? Because I messed up the polarity on a first draft and test_fork_choice_refresh_old_votes started failing because it relies on some retried repair requests succeeding.

@bw-solana bw-solana requested a review from alessandrod January 17, 2025 03:43
@alessandrod
Copy link

2.1? 😋

@bw-solana bw-solana requested a review from wen-coding January 17, 2025 14:19
@bw-solana bw-solana merged commit e6ed940 into anza-xyz:master Jan 17, 2025
47 checks passed
@behzadnouri
Copy link

FYI this commit has significantly and redundantly increased outgoing repair requests during steady state (meaning node already in sync with the cluster and not catch up period).
Shreds inserted into blockstore from repair path have also redundantly increased by 3x (again steady state, not catch up period).

Also I am concerned how this thing works out during cluster-wide repair spikes if all nodes increase their repairs by this much.

Left chunk is this code. Right is the commit before it.

250123_1716_repaired_ratio
250123_1721_repair_total

@bw-solana
Copy link
Author

FYI this commit has significantly and redundantly increased outgoing repair requests during steady state (meaning node already in sync with the cluster and not catch up period).

Shreds inserted into blockstore from repair path have also redundantly increased by 3x (again steady state, not catch up period).

Also I am concerned how this thing works out during cluster-wide repair spikes if all nodes increase their repairs by this much.

Left chunk is this code. Right is the commit before it.

250123_1716_repaired_ratio

250123_1721_repair_total

Confirming I've seen much of the same thing. Actively running experiments to tune the repair delay. I suspect the increase is largely from decreasing the average effective delay from 250ms to 200ms.

Behzad, you mentioned seeing increases in both redundant requests and receives. The receives part I understand. I haven't observed (and can't explain) a large uptick in redundant requests - only in overall, unnecessary requests. I say unnecessary because turbine ultimately delivers them before repair. Have you seen lots of duplicate repair requests?

@bw-solana
Copy link
Author

bw-solana commented Jan 24, 2025

Collecting some repair data for a node in SG on mainnet. Most data collection periods were ~1 hour (some were much longer):

DEFER_REPAIR_THRESHOLD REPAIR_REQUEST_TIMEOUT_MS Repairs per second (runtime) Block full time Successful repair %
200 100 110 475 2.5
250 100 68 475 2.7
300 100 52 485 2.9
300 150 52 490 3.2
250 150 58 475 2.95
200 150 65 475 2.5

image

Next step will be repeating experiments in another geo or 2.

I suspect we can safely increase DEFER_REPAIR_THRESHOLD to 250 and REPAIR_REQUEST_TIMEOUT_MS to 150 and reduce repair rate quite a bit without increasing block times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants