feat(relayer): don't retry old messages #5455

daniel-savu · 2025-02-12T18:06:57Z

Description

We've recently noticed the submitter attempts to process many old messages at the same time, because they have very similar next_retry_attempts. These old messages are unlikely to be processable, and in fact starve new "healthy" messages, sometimes leading to lowered throughput.

There was an attempt to make this less of an issue in #5416, but we decided old messages are not worth retrying at all given we offer the ability to deliver messages on-demand in the hyperlane CLI.

Not retrying old messages means we don't even have to push them to the submitter when loaded from the db - we just skip them.

This PR:

introduces a MAX_MESSAGE_RETRIES env var which defaults to 70 if not set. The default of 70 is picked under the assumption that the first 48 retries take about 1 day (formula). The remaining 22 retries will take at least 21 * 22 / 2 hours (see the formula), which is about 11 days. In total that is almost 2 weeks, which @nambrot confirmed is how long we should retry for. If the default is incorrect we can set a higher env var to override. I also manually checked the queues for the arbitrum and base destinations, and there was no message with a retry count of 100 but most old ones had more than 70 attempts.
skips messages read from the db whose retry count is >= MAX_MESSAGE_RETRIES
for messages already loaded in the submitter, sets Duration::from_secs(u32::MAX) as the next retry attempt, which is far enough into the future as to not be attempted at all

Prepare queues are expected to get much smaller because of the skipping, which hopefully also lowers memory pressure.

Drive-by changes

Makes the db field in MessageContext a trait obj for easier future mocking

Related issues

Backward compatibility

Yes - if MAX_MESSAGE_RETRIES is set to a very high value, it's as if this feature is disabled

Testing

Unit testing of the message backoff logic and of inner logic in maybe_from_persisted_retries. The env var loading isn't tested but is very similar to how we load env vars elsewhere - longer term it should be turned into a proper agent config (the other ones too).

changeset-bot · 2025-02-12T18:07:05Z

⚠️ No Changeset found

Latest commit: 37ba577

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

codecov · 2025-02-12T18:15:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.53%. Comparing base (d6724c4) to head (39a7746).
Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5455   +/-   ##
=======================================
  Coverage   77.53%   77.53%           
=======================================
  Files         103      103           
  Lines        2110     2110           
  Branches      190      190           
=======================================
  Hits         1636     1636           
  Misses        453      453           
  Partials       21       21

Components	Coverage Δ
core	`87.80% <ø> (ø)`
hooks	`79.39% <ø> (ø)`
isms	`83.68% <ø> (ø)`
token	`91.27% <ø> (ø)`
middlewares	`79.80% <ø> (ø)`

daniel-savu · 2025-02-12T18:14:47Z

rust/main/agents/relayer/src/msg/pending_message.rs

-                PendingMessage::calculate_msg_backoff(num_retries).map(|dur| Instant::now() + dur);
-            pm.num_retries = num_retries;
-            pm.next_attempt_after = next_attempt_after;
+    fn get_message_status(


this logic is unchanged, just refactored out of from_persisted_retries for easier unit testing

daniel-savu · 2025-02-12T18:15:20Z

rust/main/agents/relayer/src/msg/pending_message.rs

+        // Skip this message if it has been retried too many times
+        if let Some(max_retries) = retries_before_skipping {
+            if num_retries >= max_retries {
+                return None;
+            }
+        }


this is new logic, the rest is refactored out of from_persisted_retries

daniel-savu · 2025-02-12T18:15:47Z

rust/main/agents/relayer/src/msg/pending_message.rs

+            fn fmt<'a>(&self, f: &mut std::fmt::Formatter<'a>) -> std::fmt::Result;
+        }
+
+        impl HyperlaneDb for Db {


verbose but that's just mockall unfortunately

daniel-savu · 2025-02-12T22:16:14Z

rust/main/agents/relayer/src/msg/processor.rs

                msg,
                self.destination_ctxs[&destination].clone(),
                app_context,
+                Some(DEFAULT_MAX_MESSAGE_RETRIES),


I should read the env var here as well

ameten · 2025-02-13T09:50:21Z

rust/main/agents/relayer/src/msg/pending_message.rs

-            .origin_db
+    fn maybe_get_num_retries(
+        origin_db: Arc<dyn HyperlaneDb>,
+        message: HyperlaneMessage,


This method does not need to take ownership of message. I wonder if passing as a reference will be sufficient.

ameten · 2025-02-13T09:50:55Z

rust/main/agents/relayer/src/msg/pending_message.rs

-            pm.next_attempt_after = next_attempt_after;
+    fn get_message_status(
+        origin_db: Arc<dyn HyperlaneDb>,
+        message: HyperlaneMessage,


This method does not need to take ownership of message. I wonder if passing as a reference will be sufficient.

ameten · 2025-02-13T09:52:34Z

rust/main/agents/relayer/src/msg/pending_message.rs

+            retries_before_skipping,
+        )?;
+
+        let message_status = Self::get_message_status(ctx.origin_db.clone(), message.clone());


We can avoid the clone here if get_message_status does not take ownership of its message parameter.

ameten · 2025-02-13T10:08:09Z

rust/main/agents/relayer/src/msg/pending_message.rs

+        const SECS_PER_MINUTE: u64 = 60;
+        const MINS_PER_HOUR: u64 = 60;
+        const HOURS_PER_DAY: u64 = 24;
+        const DAYS_PER_WEEK: u64 = 7;
+        const SECS_PER_WEEK: u64 = SECS_PER_MINUTE * MINS_PER_HOUR * HOURS_PER_DAY * DAYS_PER_WEEK;


Suggested change

const SECS_PER_MINUTE: u64 = 60;

const MINS_PER_HOUR: u64 = 60;

const HOURS_PER_DAY: u64 = 24;

const DAYS_PER_WEEK: u64 = 7;

const SECS_PER_WEEK: u64 = SECS_PER_MINUTE * MINS_PER_HOUR * HOURS_PER_DAY * DAYS_PER_WEEK;

let duration = Duration::from_seconds(chrono::Duration::weeks(1).num_seconds())

ameten · 2025-02-13T10:10:22Z

rust/main/agents/relayer/src/msg/pending_message.rs

+    use crate::msg::pending_message::DEFAULT_MAX_MESSAGE_RETRIES;
+    use hyperlane_base::db::*;
+    use hyperlane_core::*;
+    use std::{fmt::Debug, sync::Arc};
+
+    use super::PendingMessage;


nit

Suggested change

use crate::msg::pending_message::DEFAULT_MAX_MESSAGE_RETRIES;

use hyperlane_base::db::*;

use hyperlane_core::*;

use std::{fmt::Debug, sync::Arc};

use super::PendingMessage;

use std::{fmt::Debug, sync::Arc};

use hyperlane_base::db::*;

use hyperlane_core::*;

use crate::msg::pending_message::DEFAULT_MAX_MESSAGE_RETRIES;

use super::PendingMessage;

ameten · 2025-02-13T10:10:55Z

rust/main/agents/relayer/src/msg/pending_message.rs

+    fn dummy_db_with_retries(retries: u32) -> MockDb {
+        let mut db = MockDb::new();
+        db.expect_retrieve_pending_message_retry_count_by_message_id()
+            .returning(move |_| Ok(Some(retries)));
+        db
+    }
+
+    fn assert_get_num_retries(
+        mock_retries: u32,
+        expected_retries: Option<u32>,
+        retries_before_skipping: Option<u32>,
+    ) {
+        let db = dummy_db_with_retries(mock_retries);
+        let num_retries = PendingMessage::maybe_get_num_retries(
+            Arc::new(db),
+            HyperlaneMessage::default(),
+            retries_before_skipping,
+        );
+
+        // retry count is the same, because `retries_before_skipping` is `None`
+        assert_eq!(num_retries, expected_retries);
+    }


nit: move this auxiliary methods to the bottom of the file

tkporter · 2025-02-13T14:05:00Z

rust/main/agents/relayer/src/msg/pending_message.rs

@@ -516,44 +517,37 @@ impl PendingOperation for PendingMessage {

 impl PendingMessage {
    /// Constructor that tries reading the retry count from the HyperlaneDB in order to recompute the `next_attempt_after`.
+    /// If the message has been retried more than `max_retries`, it will return `None`.


nit: max_retries seems stale

tkporter · 2025-02-13T14:08:34Z

rust/main/agents/relayer/src/msg/pending_message.rs

@@ -562,15 +556,45 @@ impl PendingMessage {
                0
            }
        };
+        // Skip this message if it has been retried too many times
+        if let Some(max_retries) = retries_before_skipping {


feels a little awkward for this function to be the place where the decision of whether to skip is made?

I would've thought that we'd have get_num_retries always return a u32 that is accurate (easier to reason about it then imo), and then in maybe_from_persisted_retries we make the judgement call whether to ignore the message based off that number. The benefit being that this retry skipping logic isn't so intertwined in multiple places

tkporter · 2025-02-13T14:12:40Z

rust/main/agents/relayer/src/msg/pending_message.rs

@@ -651,6 +675,10 @@ impl PendingMessage {
    /// given the number of retries.
    /// `pub(crate)` for testing purposes
    pub(crate) fn calculate_msg_backoff(num_retries: u32) -> Option<Duration> {
+        let max_retries = std::env::var("MAX_MESSAGE_RETRIES")


Is there any way to make this not need to read / parse the env var over and over again? as we call this many times, especially upon startup -- i.e. add a max_retries: Option<u32> as a param or something

would also probably be compatible with pulling out this logic to get access to the max retries for calling maybe_from_persisted_retries as well

tkporter · 2025-02-13T14:13:12Z

rust/main/agents/relayer/src/msg/pending_message.rs

+        // Skip this message if it has been retried too many times
+        if let Some(max_retries) = retries_before_skipping {
+            if num_retries >= max_retries {
+                return None;


curious if you think we should log or not (I don't have a strong opinion)

same goes for when a message already in the prep queue now becomes not deliverable bc it hits this max retries

tkporter · 2025-02-13T14:15:18Z

rust/main/agents/relayer/src/msg/pending_message.rs

-            }
-        };
+        retries_before_skipping: Option<u32>,
+    ) -> Option<Self> {


this makes me realize that we wouldn't be able to use the API to trigger one of these to be retried? is that intentional or should we prefer to just put them in the prep queue but never be retried unless we get an API call asking for it?

tkporter · 2025-02-13T14:16:03Z

rust/main/agents/relayer/src/msg/pending_message.rs

@@ -651,6 +675,10 @@ impl PendingMessage {
    /// given the number of retries.
    /// `pub(crate)` for testing purposes
    pub(crate) fn calculate_msg_backoff(num_retries: u32) -> Option<Duration> {
+        let max_retries = std::env::var("MAX_MESSAGE_RETRIES")


I probably would've done the same here fwiw but noting that our preference for ad hoc env vars is a result of our settings logic being so hard to work with that it's easier to add little "debt" pieces like this. Makes me a little concerned

daniel-savu added 2 commits February 12, 2025 16:34

chore: logic changes to not load / retry old messages

bb0dae5

feat: do not submit / retry old messages

4d6fd2f

daniel-savu requested a review from tkporter as a code owner February 12, 2025 18:06

feat: env var

37ba577

daniel-savu commented Feb 12, 2025

View reviewed changes

chore: const doc comment

39a7746

daniel-savu commented Feb 12, 2025

View reviewed changes

kamiyaa approved these changes Feb 12, 2025

View reviewed changes

ameten reviewed Feb 13, 2025

View reviewed changes

ameten approved these changes Feb 13, 2025

View reviewed changes

tkporter reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(relayer): don't retry old messages #5455

feat(relayer): don't retry old messages #5455

daniel-savu commented Feb 12, 2025

changeset-bot bot commented Feb 12, 2025 •

edited

Loading

codecov bot commented Feb 12, 2025 •

edited

Loading

daniel-savu Feb 12, 2025

kamiyaa Feb 12, 2025

daniel-savu Feb 12, 2025

daniel-savu Feb 12, 2025

daniel-savu Feb 12, 2025

ameten Feb 13, 2025

ameten Feb 13, 2025

ameten Feb 13, 2025

ameten Feb 13, 2025

ameten Feb 13, 2025 •

edited

Loading

ameten Feb 13, 2025 •

edited

Loading

tkporter Feb 13, 2025

tkporter Feb 13, 2025

tkporter Feb 13, 2025

tkporter Feb 13, 2025

tkporter Feb 13, 2025

tkporter Feb 13, 2025

feat(relayer): don't retry old messages #5455

Are you sure you want to change the base?

feat(relayer): don't retry old messages #5455

Conversation

daniel-savu commented Feb 12, 2025

Description

Drive-by changes

Related issues

Backward compatibility

Testing

changeset-bot bot commented Feb 12, 2025 • edited Loading

⚠️ No Changeset found

codecov bot commented Feb 12, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ameten Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

ameten Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

changeset-bot bot commented Feb 12, 2025 •

edited

Loading

codecov bot commented Feb 12, 2025 •

edited

Loading

ameten Feb 13, 2025 •

edited

Loading

ameten Feb 13, 2025 •

edited

Loading