From b6aab046046d6dad14d61b29de7c5e4974928fe2 Mon Sep 17 00:00:00 2001 From: Mark Tyneway Date: Wed, 19 Mar 2025 17:04:28 -0600 Subject: [PATCH 1/6] design doc: interop monitoring Begins to flesh out a plan for how we go about monitoring and alerting for the interop release. --- protocol/interop-monitoring.md | 132 +++++++++++++++++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 protocol/interop-monitoring.md diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md new file mode 100644 index 00000000..f781cf8a --- /dev/null +++ b/protocol/interop-monitoring.md @@ -0,0 +1,132 @@ +# Interop Monitoring + +| | | +| ------------------ | -------------------------------------------------- | +| Author | _Mark Tyneway_ | +| Created at | _2025-03-19_ | +| Initial Reviewers | _Reviewer Name 1, Reviewer Name 2_ | +| Need Approval From | _Reviewer Name_ | +| Status | _Draft_ | + +## Purpose + + + + + +This document is meant to align on a strategy for monitoring interop. Given assumptions in +the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), it is generally +not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple +implementations of `op-supervisor`. With only a single implementation, a bug becomes consensus. +In the worst case, this can mint an infinite amount of ether. Given this risk, we need a monitoring, +alerting and runbook for handling invalid `ExecutingMessage`s being included in the chain. + +## Summary + + + +We implement a monitoring service that validates all of the `ExecutingMessage` logs +produce by the entire cluster and validates them against transaction access lists +and remote nodes. We use this service to alert oncall engineers as well as automatically +pausing the batcher/transaction ingress if an invalid `ExecutingMessage` is included. + +## Problem Statement + Context + + + +We want to be alterted when there is an invalid `ExecutingMessage`. We implement preventative +measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, +so we need to have ways to detect and prevent that. + +## Proposed Solution + + + +A service is implemented that is able to observe all blocks and logs produced by the cluster. +It is responsible for doing two things: +- Guaranteeing that every executing message has a corresponding access list entry +- Double checking that every executing message is valid + +### ExecutingMessages and Access Lists + +The [access list](https://github.com/ethereum-optimism/design-docs/blob/9e919c5b173fe8fc89949b012f6f70a0bc3247f6/protocol/interop-access-list.md) +design guarantees the fact that all executing messages can be validated without the need to execute the transaction. Any calls to the `CrossL2Inbox` +that do not include the statically declared executing message in the access list will revert rather than needing to be dropped. This prevents +a denial of service attack where the MEV searcher can simply produce an invalid `ExecutingMessage` after their MEV attempt fails. + +Given that the decided upon approach depends strictly on the current EVM resource pricing via storage slot cost introspection, we should have +monitoring to alert us if someone is able to trick the `CrossL2Inbox` into producing an `ExecutingMessage` when the access list entry is +not declared. We think this is impossible, but given this is such a critical security property, it is important to monitor. + +### Double Checking Message Validity + +#### Unsafe Blocks + +We want to utilize the cloud architecture in [this doc](https://github.com/ethereum-optimism/design-docs/pull/218) to ensure that +no invalid `ExecutingMessage`s are ever included in a block. No matter what tradeoffs we make, it is impossible to guarantee there +will not be a contingent reorg because an unsafe head reorg can happen after all cross chain transaction validity checks passed. + +We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the +oncall engineering team. Additionally, we should consider pausing the batcher automatically if an invalid `ExecutingMessage` +has been detected in an unsafe block and triggering an unsafe head reorg. It is preferable to not waste blobs and trigger +an unsafe head reorg by batch submitting the invalid block. The problem with this is that it is indistinguishable from +a malicious sequencer triggering a reorg to extract MEV. Either way an unsafe head reorg is going to happen, its just whether +or not its due to the data being posted and then resulting in a replacement deposits only block or if its manually done +by the sequencer offchain. + +We may also want to consider a way to alert partners in the interop set ahead of time that an unsafe head reorg is coming +if an invalid `ExecutingMessage` is observed in an unsafe block. If they turn off their cross chain message ingress fast enough, +it could be possible that they can prevent a contingent reorg. The liveness of the chain can continue with no issues until the +remote chain goes through its unsafe head reorg, then it can open up its cross chain message ingress again. + +#### Safe/Finalized Blocks + +If an invalid `ExecutingMessage` ends up in a safe block, that means that a bug becomes consensus (without multiclient). +This is very bad. Our monitoring should be able to observe this. The worst case thing that can happen in this case is +an attacker uses a fast liquidity bridge like Across to quickly send funds out of the cluster after the finalization of +an invalid `ExecutingMessage` that mints a ton of ether out of thin air. If we detect an invalid `ExecutingMessage` +finalizing to be safe/finalized, we should pause all transaction ingress to the sequencer and effectively stop producing +blocks. This reduces the chances that the attacker is able to initiate a fast bridge out of the cluster. This strategy +is aligned with our philosophy of favoring safety over liveness. + +### Resource Usage + + + +A new service needs to be implemented and operated in the cloud. This service +can be stateless, it mostly needs to do consistent network access to full nodes. +The cost of the full nodes is going to be the majority of the cost in operating +this service. + +### Single Point of Failure and Multi Client Considerations + + + +This is meant to detect the single source of failure with the `op-supervisor`. Having client diversity +for monitoring would be a nice to have. + +## Alternatives Considered + + + +No real alternatives considered. + +## Risks & Uncertainties + + + +- If an invalid `ExecutingMessage` finalizes, it would be a very bad look to roll back the chain. It may be the best solution +- Need to observe the latency of validating the `ExecutingMessage`s to ensure that this is all feasible \ No newline at end of file From 69fddd5b70be30d3b5f7950c26937006667b41f1 Mon Sep 17 00:00:00 2001 From: axelKingsley Date: Wed, 7 May 2025 16:16:39 -0500 Subject: [PATCH 2/6] Editorial Pass ; Specify Monitor Design --- protocol/interop-monitoring.md | 138 +++++++++++++++++---------------- 1 file changed, 70 insertions(+), 68 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index f781cf8a..f94f5e3e 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -1,63 +1,75 @@ -# Interop Monitoring +# Interop Monitoring Service | | | | ------------------ | -------------------------------------------------- | -| Author | _Mark Tyneway_ | +| Author | _Mark Tyneway, Axel Kingsley_ | | Created at | _2025-03-19_ | -| Initial Reviewers | _Reviewer Name 1, Reviewer Name 2_ | -| Need Approval From | _Reviewer Name_ | -| Status | _Draft_ | ## Purpose - +This document is meant to align on a strategy for monitoring interop and propose a +Monitoring Service for Executing Messages. - +## Summary + Problem Statement + Context -This document is meant to align on a strategy for monitoring interop. Given assumptions in -the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), it is generally -not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple +Given assumptions in the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), +it is generally not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple implementations of `op-supervisor`. With only a single implementation, a bug becomes consensus. In the worst case, this can mint an infinite amount of ether. Given this risk, we need a monitoring, alerting and runbook for handling invalid `ExecutingMessage`s being included in the chain. -## Summary +We want to be alterted when there is an invalid `ExecutingMessage`. We are implementing preventative +measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, +so we need to have ways to detect and prevent that. - +## Proposed Solution -We implement a monitoring service that validates all of the `ExecutingMessage` logs +We should implement a monitoring service that validates all of the `ExecutingMessage` logs produce by the entire cluster and validates them against transaction access lists -and remote nodes. We use this service to alert oncall engineers as well as automatically +and remote nodes. We use this service to alert oncall engineers as well as potentially automatically pausing the batcher/transaction ingress if an invalid `ExecutingMessage` is included. -## Problem Statement + Context +This "Cross Message Monitor" should have the following features: - +### Monitoring Strategies like `dispute-mon` -We want to be alterted when there is an invalid `ExecutingMessage`. We implement preventative -measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, -so we need to have ways to detect and prevent that. +Dispute Monitor is a service already implemented and deployed for tracking Fault Proof Disputes. +Rather than just be a simple alert when games are invalid, it serves up various statistics that an operator +can refer to in order to determine network health: +- How many games are being monitored, how many of each status +- How many Incorrect Forecasts or Incorrect Results +- Warning and Error Logs from the Monitor -## Proposed Solution +Cross Message Monitor can crib directly from these statistics, but focused on Interop: +- How many Executing Messages emitted by the CrossL2Inbox per block per chain +- How many `Executing Message`s Messages point at each Chain in the Superchain +- How many `Executing Message`s are known valid, per safety level +- How many `Executing Message`s are known invalid, per safety level +- How many `Executing Message`s are not yet known valid/invalid, per safety level +- How many `Executing Message`s *changed validity* over time (indicating remote reorg) +- How many `Executing Message`s were resolved via Block Replacement + +Almost all `Executing Message` Metrics emitted by the Cross Message Monitor should have dimensions: +- What chain the `Executing Message` in question is on +- What chain the `Executing Message` is referring to (the chain of the initiating message) +- Timestamp of Block + +### Long Term Monitoring of `Executing Message`s - +Executing Messages can change validity over the course of the Unsafe Chain, +data is not allways sufficiently available to validate `Executing Message`s, and transitive `Executing Message`s can +cause cascades of Valid/Invalid messages. -A service is implemented that is able to observe all blocks and logs produced by the cluster. -It is responsible for doing two things: -- Guaranteeing that every executing message has a corresponding access list entry -- Double checking that every executing message is valid +Therefore, it is insufficent to check a message just once. Instead, every Executing Message +detected by the Cross Message Monitor will be considered an ongoing process, like games are +for the Dispute Monitor. From the time the `Executing Message` is discovered, until the `Executing Message` is included by a +Cross Safe block height which is now L1 finalized, the `Executing Message` should be repeatedly re-checked. -### ExecutingMessages and Access Lists +This means that when the status of the `Executing Message` flips, special alerts can be emitted to indicate +a remote reorg has likely occured. Or, when a single invalid message creates a cascade of +invalidation, each `Executing Message` can resolve individually. + +### Access List Confirmation The [access list](https://github.com/ethereum-optimism/design-docs/blob/9e919c5b173fe8fc89949b012f6f70a0bc3247f6/protocol/interop-access-list.md) design guarantees the fact that all executing messages can be validated without the need to execute the transaction. Any calls to the `CrossL2Inbox` @@ -68,13 +80,14 @@ Given that the decided upon approach depends strictly on the current EVM resourc monitoring to alert us if someone is able to trick the `CrossL2Inbox` into producing an `ExecutingMessage` when the access list entry is not declared. We think this is impossible, but given this is such a critical security property, it is important to monitor. -### Double Checking Message Validity +Each message can be checked for this once, when it is detected and added to the monitoring set. -#### Unsafe Blocks +### Alert Behaviors -We want to utilize the cloud architecture in [this doc](https://github.com/ethereum-optimism/design-docs/pull/218) to ensure that -no invalid `ExecutingMessage`s are ever included in a block. No matter what tradeoffs we make, it is impossible to guarantee there -will not be a contingent reorg because an unsafe head reorg can happen after all cross chain transaction validity checks passed. +Though it will need evaluation over time, we already know the sorts of operator responses we want when certain situations are detected +by the monitor. + +#### Unsafe Blocks We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the oncall engineering team. Additionally, we should consider pausing the batcher automatically if an invalid `ExecutingMessage` @@ -91,42 +104,31 @@ remote chain goes through its unsafe head reorg, then it can open up its cross c #### Safe/Finalized Blocks -If an invalid `ExecutingMessage` ends up in a safe block, that means that a bug becomes consensus (without multiclient). -This is very bad. Our monitoring should be able to observe this. The worst case thing that can happen in this case is -an attacker uses a fast liquidity bridge like Across to quickly send funds out of the cluster after the finalization of -an invalid `ExecutingMessage` that mints a ton of ether out of thin air. If we detect an invalid `ExecutingMessage` -finalizing to be safe/finalized, we should pause all transaction ingress to the sequencer and effectively stop producing -blocks. This reduces the chances that the attacker is able to initiate a fast bridge out of the cluster. This strategy -is aligned with our philosophy of favoring safety over liveness. +If an invalid `ExecutingMessage` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, +and must be replaced with a Deposit Only Block. This situation should page the operator to monitor the situation, and every +individual invalid `Executing Message` in a Safe Block should be very easy to see and monitor individually. The operator is monitoring +to ensure a Block Replacement occurs and the invalid messges are no longer known to the chain. -### Resource Usage - - +If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally +have its own alerts associated in addition to the prior expectation of an operator monitoring the situation. -A new service needs to be implemented and operated in the cloud. This service -can be stateless, it mostly needs to do consistent network access to full nodes. -The cost of the full nodes is going to be the majority of the cost in operating -this service. +### Resource Usage -### Single Point of Failure and Multi Client Considerations +This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node +for the Superchain it is monitoring. - +## Summary of Solution -This is meant to detect the single source of failure with the `op-supervisor`. Having client diversity -for monitoring would be a nice to have. +Create `xmsg-mon` in the image of `dispute-mon` to track all in-flight Executing Messages for a Superchain, for their entire +Unsafe -> Safe -> Finalized lifecycle. Create Alerting against it which pages operators when Invalid Messages advance into blocks. ## Alternatives Considered - +No real alternatives considered. Monitoring should happen as a matter of course when deploying new services. -No real alternatives considered. +Having additional Cross-Validation software besides Supervisor would lessen the criticality of this software. ## Risks & Uncertainties - - -- If an invalid `ExecutingMessage` finalizes, it would be a very bad look to roll back the chain. It may be the best solution -- Need to observe the latency of validating the `ExecutingMessage`s to ensure that this is all feasible \ No newline at end of file +- The Monitoring Service may be insufficent, and we may not catch what we need to. Real experience will inform updates to this service. +- The speed of the Monitor may be insufficent for operators to take meaningful action \ No newline at end of file From 344040c4edc6a36b953077320ba6b9efcb03ea4a Mon Sep 17 00:00:00 2001 From: axelKingsley Date: Thu, 8 May 2025 13:55:19 -0500 Subject: [PATCH 3/6] comments --- protocol/interop-monitoring.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index f94f5e3e..a5b048e4 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -90,18 +90,21 @@ by the monitor. #### Unsafe Blocks We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the -oncall engineering team. Additionally, we should consider pausing the batcher automatically if an invalid `ExecutingMessage` -has been detected in an unsafe block and triggering an unsafe head reorg. It is preferable to not waste blobs and trigger -an unsafe head reorg by batch submitting the invalid block. The problem with this is that it is indistinguishable from -a malicious sequencer triggering a reorg to extract MEV. Either way an unsafe head reorg is going to happen, its just whether -or not its due to the data being posted and then resulting in a replacement deposits only block or if its manually done -by the sequencer offchain. +oncall engineering team. It is preferable to not waste blobs and trigger an unsafe head reorg by batch submitting the invalid block as soon as possible, +therefore the operator may want to accelerate batch submission when this alert arrives. Unless nodes on the network are +able to accept an Unsafe->Unsafe block replacement (and they are not), the Sequencer's only path forward is to see the +invalid block commited to L1, at which point it will be replaced. Doing this faster will minimize reorg sizes. We may also want to consider a way to alert partners in the interop set ahead of time that an unsafe head reorg is coming if an invalid `ExecutingMessage` is observed in an unsafe block. If they turn off their cross chain message ingress fast enough, it could be possible that they can prevent a contingent reorg. The liveness of the chain can continue with no issues until the remote chain goes through its unsafe head reorg, then it can open up its cross chain message ingress again. +Finally, when Invalid Messages occur, it is prudent to shut off additional Executing Messages. Admin APIs should be established which: +- Shut off Executing Message Ingress at `proxyd` +- Force remove Executing Messages from block builder mempools. +These triggers should occur automatically when an invalid `ExecutingMessage` is discovered at the Unsafe Block stage, in order to reduce cascades. + #### Safe/Finalized Blocks If an invalid `ExecutingMessage` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, @@ -122,6 +125,9 @@ for the Superchain it is monitoring. Create `xmsg-mon` in the image of `dispute-mon` to track all in-flight Executing Messages for a Superchain, for their entire Unsafe -> Safe -> Finalized lifecycle. Create Alerting against it which pages operators when Invalid Messages advance into blocks. +Furthermore, Admin APIs should be established to shut off `proxyd` and `mempool` acceptance of Executing Messages, to swiftly respond +when the Monitoring Service detects invalid messages in blocks. + ## Alternatives Considered No real alternatives considered. Monitoring should happen as a matter of course when deploying new services. From 50107785b538d38e84073ead5bc60514ed646448 Mon Sep 17 00:00:00 2001 From: Axel Kingsley Date: Wed, 4 Jun 2025 09:39:02 -0500 Subject: [PATCH 4/6] in-review typo fixes Co-authored-by: George Knee --- protocol/interop-monitoring.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index a5b048e4..812cd06a 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -15,8 +15,8 @@ Monitoring Service for Executing Messages. Given assumptions in the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), it is generally not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple implementations of `op-supervisor`. With only a single implementation, a bug becomes consensus. -In the worst case, this can mint an infinite amount of ether. Given this risk, we need a monitoring, -alerting and runbook for handling invalid `ExecutingMessage`s being included in the chain. +In the worst case, this can mint an infinite amount of ether. Given this risk, we need to have monitoring, +alerting, and a runbook for handling invalid `ExecutingMessage`s being included in the chain. We want to be alterted when there is an invalid `ExecutingMessage`. We are implementing preventative measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, @@ -25,7 +25,7 @@ so we need to have ways to detect and prevent that. ## Proposed Solution We should implement a monitoring service that validates all of the `ExecutingMessage` logs -produce by the entire cluster and validates them against transaction access lists +produced by the entire cluster and validates them against transaction access lists and remote nodes. We use this service to alert oncall engineers as well as potentially automatically pausing the batcher/transaction ingress if an invalid `ExecutingMessage` is included. @@ -41,15 +41,15 @@ can refer to in order to determine network health: - Warning and Error Logs from the Monitor Cross Message Monitor can crib directly from these statistics, but focused on Interop: -- How many Executing Messages emitted by the CrossL2Inbox per block per chain -- How many `Executing Message`s Messages point at each Chain in the Superchain +- How many `Executing Message`s are emitted by the `CrossL2Inbox` per block per chain +- How many `Executing Message`s Messages point at each Chain in the dependency set - How many `Executing Message`s are known valid, per safety level - How many `Executing Message`s are known invalid, per safety level - How many `Executing Message`s are not yet known valid/invalid, per safety level - How many `Executing Message`s *changed validity* over time (indicating remote reorg) - How many `Executing Message`s were resolved via Block Replacement -Almost all `Executing Message` Metrics emitted by the Cross Message Monitor should have dimensions: +Almost all `Executing Message` metrics emitted by the Cross Message Monitor should have dimensions: - What chain the `Executing Message` in question is on - What chain the `Executing Message` is referring to (the chain of the initiating message) - Timestamp of Block @@ -110,7 +110,7 @@ These triggers should occur automatically when an invalid `ExecutingMessage` is If an invalid `ExecutingMessage` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, and must be replaced with a Deposit Only Block. This situation should page the operator to monitor the situation, and every individual invalid `Executing Message` in a Safe Block should be very easy to see and monitor individually. The operator is monitoring -to ensure a Block Replacement occurs and the invalid messges are no longer known to the chain. +to ensure a Block Replacement occurs and the invalid messages are no longer part of the canonical chain. If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally have its own alerts associated in addition to the prior expectation of an operator monitoring the situation. @@ -137,4 +137,5 @@ Having additional Cross-Validation software besides Supervisor would lessen the ## Risks & Uncertainties - The Monitoring Service may be insufficent, and we may not catch what we need to. Real experience will inform updates to this service. -- The speed of the Monitor may be insufficent for operators to take meaningful action \ No newline at end of file +- The Monitoring Service may cause a lot of RPC traffic and generate a lot of data, putting strain on the infrastructure. +- The speed of the Monitoring Service may be insufficent for operators to take meaningful action \ No newline at end of file From 111d35fba83e6851cc78de5f8f12830d7d503a58 Mon Sep 17 00:00:00 2001 From: axelKingsley Date: Tue, 10 Jun 2025 10:02:58 -0500 Subject: [PATCH 5/6] updates --- protocol/interop-monitoring.md | 62 +++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 24 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index 812cd06a..72bc1db0 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -13,23 +13,23 @@ Monitoring Service for Executing Messages. ## Summary + Problem Statement + Context Given assumptions in the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), -it is generally not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple +it is generally not possible to guarantee that invalid `Executing Message`s do not finalize without multiple implementations of `op-supervisor`. With only a single implementation, a bug becomes consensus. In the worst case, this can mint an infinite amount of ether. Given this risk, we need to have monitoring, -alerting, and a runbook for handling invalid `ExecutingMessage`s being included in the chain. +alerting, and a runbook for handling invalid `Executing Message`s being included in the chain. -We want to be alterted when there is an invalid `ExecutingMessage`. We are implementing preventative -measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, +We want to be alterted when there is an invalid `Executing Message`. We are implementing preventative +measures, but the downside risk is existential if an invalid `Executing Message` finalizes, so we need to have ways to detect and prevent that. ## Proposed Solution -We should implement a monitoring service that validates all of the `ExecutingMessage` logs -produced by the entire cluster and validates them against transaction access lists +We should implement a monitoring service that validates all of the `Executing Message` logs +produced by the entire Superchain and validates them against transaction access lists and remote nodes. We use this service to alert oncall engineers as well as potentially automatically -pausing the batcher/transaction ingress if an invalid `ExecutingMessage` is included. +pausing the batcher/transaction ingress if an invalid `Executing Message` is included. -This "Cross Message Monitor" should have the following features: +This "Executing Message Monitor" should have the following features: ### Monitoring Strategies like `dispute-mon` @@ -40,16 +40,15 @@ can refer to in order to determine network health: - How many Incorrect Forecasts or Incorrect Results - Warning and Error Logs from the Monitor -Cross Message Monitor can crib directly from these statistics, but focused on Interop: +Executing Message Monitor can crib directly from these statistics, but focused on Interop: - How many `Executing Message`s are emitted by the `CrossL2Inbox` per block per chain - How many `Executing Message`s Messages point at each Chain in the dependency set - How many `Executing Message`s are known valid, per safety level - How many `Executing Message`s are known invalid, per safety level - How many `Executing Message`s are not yet known valid/invalid, per safety level - How many `Executing Message`s *changed validity* over time (indicating remote reorg) -- How many `Executing Message`s were resolved via Block Replacement -Almost all `Executing Message` metrics emitted by the Cross Message Monitor should have dimensions: +Almost all `Executing Message` metrics emitted by the Executing Message Monitor should have dimensions: - What chain the `Executing Message` in question is on - What chain the `Executing Message` is referring to (the chain of the initiating message) - Timestamp of Block @@ -61,9 +60,9 @@ data is not allways sufficiently available to validate `Executing Message`s, and cause cascades of Valid/Invalid messages. Therefore, it is insufficent to check a message just once. Instead, every Executing Message -detected by the Cross Message Monitor will be considered an ongoing process, like games are +detected by the Executing Message Monitor will be considered an ongoing process, like games are for the Dispute Monitor. From the time the `Executing Message` is discovered, until the `Executing Message` is included by a -Cross Safe block height which is now L1 finalized, the `Executing Message` should be repeatedly re-checked. +Cross-Safe block height which is now L1 finalized, the `Executing Message` should be repeatedly re-checked. This means that when the status of the `Executing Message` flips, special alerts can be emitted to indicate a remote reorg has likely occured. Or, when a single invalid message creates a cascade of @@ -74,11 +73,11 @@ invalidation, each `Executing Message` can resolve individually. The [access list](https://github.com/ethereum-optimism/design-docs/blob/9e919c5b173fe8fc89949b012f6f70a0bc3247f6/protocol/interop-access-list.md) design guarantees the fact that all executing messages can be validated without the need to execute the transaction. Any calls to the `CrossL2Inbox` that do not include the statically declared executing message in the access list will revert rather than needing to be dropped. This prevents -a denial of service attack where the MEV searcher can simply produce an invalid `ExecutingMessage` after their MEV attempt fails. +failing Interop Transactions from putting unpaid load onto the block builder. Given that the decided upon approach depends strictly on the current EVM resource pricing via storage slot cost introspection, we should have -monitoring to alert us if someone is able to trick the `CrossL2Inbox` into producing an `ExecutingMessage` when the access list entry is -not declared. We think this is impossible, but given this is such a critical security property, it is important to monitor. +monitoring to alert us if someone is able to trick the `CrossL2Inbox` into producing an `Executing Message` when the access list entry is +not declared or differs from the Executing Message. We think this is impossible, but given this is such a critical security property, it is important to monitor. Each message can be checked for this once, when it is detected and added to the monitoring set. @@ -87,27 +86,25 @@ Each message can be checked for this once, when it is detected and added to the Though it will need evaluation over time, we already know the sorts of operator responses we want when certain situations are detected by the monitor. -#### Unsafe Blocks +[**Note: this section is better detailed through the Interop: AutoStop design**](https://github.com/ethereum-optimism/design-docs/pull/287) -We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the +We want to be able to detect when an invalid `Executing Message` is included in an unsafe block and trigger an altert to the oncall engineering team. It is preferable to not waste blobs and trigger an unsafe head reorg by batch submitting the invalid block as soon as possible, therefore the operator may want to accelerate batch submission when this alert arrives. Unless nodes on the network are able to accept an Unsafe->Unsafe block replacement (and they are not), the Sequencer's only path forward is to see the invalid block commited to L1, at which point it will be replaced. Doing this faster will minimize reorg sizes. We may also want to consider a way to alert partners in the interop set ahead of time that an unsafe head reorg is coming -if an invalid `ExecutingMessage` is observed in an unsafe block. If they turn off their cross chain message ingress fast enough, +if an invalid `Executing Message` is observed in an unsafe block. If they turn off their cross chain message ingress fast enough, it could be possible that they can prevent a contingent reorg. The liveness of the chain can continue with no issues until the remote chain goes through its unsafe head reorg, then it can open up its cross chain message ingress again. Finally, when Invalid Messages occur, it is prudent to shut off additional Executing Messages. Admin APIs should be established which: - Shut off Executing Message Ingress at `proxyd` - Force remove Executing Messages from block builder mempools. -These triggers should occur automatically when an invalid `ExecutingMessage` is discovered at the Unsafe Block stage, in order to reduce cascades. +These triggers should occur automatically when an invalid `Executing Message` is discovered at the Unsafe Block stage, in order to reduce cascades. -#### Safe/Finalized Blocks - -If an invalid `ExecutingMessage` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, +If an invalid `Executing Message` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, and must be replaced with a Deposit Only Block. This situation should page the operator to monitor the situation, and every individual invalid `Executing Message` in a Safe Block should be very easy to see and monitor individually. The operator is monitoring to ensure a Block Replacement occurs and the invalid messages are no longer part of the canonical chain. @@ -120,13 +117,30 @@ have its own alerts associated in addition to the prior expectation of an operat This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node for the Superchain it is monitoring. +The service may use significant memory to store the ongoing statuses of potentially many Executing Messages across the chains +through their life-cycle. + +### Availability and Reliability + +This service must be able to detect *all* interop messages during their lifecycle. To that end, the service must be able to +backfill blocks on startup, so that temporary outages do not create blind spots in monitoring. + +Only one monitor needs to be running if the backfill system works appropriately. Otherwise, a secondary backup monitor +may be advisable to keep gaps from forming. + +## Monitoring Expiry + +This service will need a way to prune old Executing Messages from being monitored once the lifecycle is over. To do that, +the monitoring service should pay attention to the *Finalized L2 Heads* of each chain, and stop monitoring Executing Messages +which were created prior to that finalized head. + ## Summary of Solution Create `xmsg-mon` in the image of `dispute-mon` to track all in-flight Executing Messages for a Superchain, for their entire Unsafe -> Safe -> Finalized lifecycle. Create Alerting against it which pages operators when Invalid Messages advance into blocks. Furthermore, Admin APIs should be established to shut off `proxyd` and `mempool` acceptance of Executing Messages, to swiftly respond -when the Monitoring Service detects invalid messages in blocks. +when the Monitoring Service detects invalid messages in blocks. (See: Interop AutoStop) ## Alternatives Considered From e79b94705152b2d436fe34638fa5afd76db8d083 Mon Sep 17 00:00:00 2001 From: axelKingsley Date: Fri, 13 Jun 2025 13:25:32 -0500 Subject: [PATCH 6/6] review feedback --- protocol/interop-monitoring.md | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index 72bc1db0..0c4755ea 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -43,16 +43,28 @@ can refer to in order to determine network health: Executing Message Monitor can crib directly from these statistics, but focused on Interop: - How many `Executing Message`s are emitted by the `CrossL2Inbox` per block per chain - How many `Executing Message`s Messages point at each Chain in the dependency set -- How many `Executing Message`s are known valid, per safety level -- How many `Executing Message`s are known invalid, per safety level -- How many `Executing Message`s are not yet known valid/invalid, per safety level +- How many `Executing Message`s are known valid +- How many `Executing Message`s are known invalid +- How many `Executing Message`s are not yet known valid/invalid - How many `Executing Message`s *changed validity* over time (indicating remote reorg) +By tracking these metrics individually, we can see at a glance the state of Cross-Validation, and identify underlying issues quickly. +For example, if the Executing Messages on a given chain start showing up invalid, it may indicate a failure of Tx filtering. +Or, if the *Initiating Messages* for a chain show a pattern of invalidity, it may indicate that Initiating chain is equivocating or reorging. + +In particular, a change between Valid and Invalid status is especially noteworthy, as it demonstrate a high likelihood of reorg. + +Because these metrics are dimensioned across both the Executing and Initiating side, we can tell whether the issue lies with the producer, +or the consumer. + Almost all `Executing Message` metrics emitted by the Executing Message Monitor should have dimensions: - What chain the `Executing Message` in question is on - What chain the `Executing Message` is referring to (the chain of the initiating message) - Timestamp of Block +Additionally, we should alert when either the Monitor itself, or the underlying Node is down, to let operators know +when we are flying blind. + ### Long Term Monitoring of `Executing Message`s Executing Messages can change validity over the course of the Unsafe Chain, @@ -112,6 +124,11 @@ to ensure a Block Replacement occurs and the invalid messages are no longer part If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally have its own alerts associated in addition to the prior expectation of an operator monitoring the situation. +#### Clear Logs +When issues would arise that would generate an alert, the Monitor should also be printing clearly actionable logs which can be checked. +This would take the form of individual Invalid messages, or individual Invalid->Valid state transitions. Then operators can proceed to tirage +with high precision data. + ### Resource Usage This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node