Reapply PR#31: optimize retry pool #5113

KirillLykov · 2025-03-02T19:28:48Z

Problem

Transactions not included in the retry pool on full utilization

This PRs adds back #31

Summary of Changes

do not insert transactions with zero max_retries to the retry pool
remove transactions reached max_retries in the same iteration of the loop
dynamically select sleep time between iterations based on last_sent_time in TransactionInfo

mergify · 2025-03-03T13:46:25Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

KirillLykov · 2025-03-03T18:06:03Z

send-transaction-service/src/send_transaction_service.rs

@@ -298,9 +308,17 @@ impl SendTransactionService {
                    {
                        // take a lock of retry_transactions and move the batch to the retry set.
                        let mut retry_transactions = retry_transactions.lock().unwrap();
-                        let transactions_to_retry = transactions.len();
+                        let mut transactions_to_retry: usize = 0;


transactions_to_retry changed the semantics: now it is number of transactions that haven't reached retry limit.

KirillLykov · 2025-03-03T18:23:36Z

send-transaction-service/src/send_transaction_service.rs

@@ -369,6 +388,17 @@ impl SendTransactionService {
                        stats,
                    );
                    stats_report.report();
+
+                    // to send transactions as soon as possible we adjust retry interval
+                    retry_interval_ms = retry_interval_ms_default


it means just retry_interval_ms_default - (ms since last send) to prevent sleeping longer than retry_interval_ms_default (because between the moment when we sent and current moment some time passed).

steveluscher

Thanks for this fix, @fanatid! I understand this PR, but really wish it was 3 PRs that each did one of the things in the description. Don't bother splitting it up now, but just allow me to lodge a complaint that it would have been easier to review as three tiny changes.

Unrelated to this PR: there are things I don't understand about this subsystem, that maybe the reader can help me work through.

I don't understand why we send transactions in batches. The actual sender (eg. the QUIC sender) sends them in a loop, one at a time, which doesn't sound like a batch to me. Why not just send each transaction immediately upon receiving it or observing its retry interval elapse?
Why do we call send_transactions_in_batch in two places? I feel like this code could be made much simpler if we had a single queue, whose entries were roughly of the shape (send_deadline, transaction_info), and a queue processor.
- Retrying a transaction would involve decrementing its remaining retries and throwing it on the end of the queue with a new send_deadline.
- Expiring a transaction would involve doing nothing. Just don't send it or re-add it to the queue.
- The queue processor could sleep until the next send deadline, then consume all entries of the queue whose send deadline has been exceeded.

steveluscher · 2025-03-04T01:28:16Z

send-transaction-service/src/send_transaction_service.rs

+    fn get_max_retries(
+        &self,
+        default_max_retries: Option<usize>,
+        service_max_retries: usize,
+    ) -> Option<usize> {
+        self.max_retries
+            .or(default_max_retries)
+            .map(|max_retries| max_retries.min(service_max_retries))
+    }


I'm not actually that bullish on extracting this logic because I think it makes things harder to read, but if we're going to then we should catch all of the other places this happens in this file (eg. L463-466).

I think you could delete a ton of code if, instead of tracking max_retries and retries you changed TransactionInfo to track retries_remaining.

steveluscher · 2025-03-04T02:32:35Z

Before landing this, can someone flesh out the PR description a bit? I'm not sure that future readers will know what ‘transactions not included in the retry pool on full utilization’ means (I don't).

optimize retry pool (reapply PR#31)

b05628a

KirillLykov requested a review from fanatid March 2, 2025 19:28

KirillLykov mentioned this pull request Mar 3, 2025

Add support of tpu-client-next to validator #3454

Open

10 tasks

KirillLykov added the v2.2 Backport to v2.2 branch label Mar 3, 2025

KirillLykov commented Mar 3, 2025

View reviewed changes

KirillLykov marked this pull request as ready for review March 3, 2025 18:50

KirillLykov requested a review from steveluscher March 3, 2025 18:50

steveluscher approved these changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapply PR#31: optimize retry pool #5113

Reapply PR#31: optimize retry pool #5113

KirillLykov commented Mar 2, 2025

mergify bot commented Mar 3, 2025

KirillLykov Mar 3, 2025

KirillLykov Mar 3, 2025 •

edited

Loading

steveluscher left a comment

steveluscher Mar 4, 2025

steveluscher commented Mar 4, 2025

Reapply PR#31: optimize retry pool #5113

Are you sure you want to change the base?

Reapply PR#31: optimize retry pool #5113

Conversation

KirillLykov commented Mar 2, 2025

Problem

Summary of Changes

mergify bot commented Mar 3, 2025

KirillLykov Mar 3, 2025

Choose a reason for hiding this comment

KirillLykov Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

steveluscher left a comment

Choose a reason for hiding this comment

steveluscher Mar 4, 2025

Choose a reason for hiding this comment

steveluscher commented Mar 4, 2025

KirillLykov Mar 3, 2025 •

edited

Loading