From b6aab046046d6dad14d61b29de7c5e4974928fe2 Mon Sep 17 00:00:00 2001 From: Mark Tyneway Date: Wed, 19 Mar 2025 17:04:28 -0600 Subject: [PATCH 1/3] design doc: interop monitoring Begins to flesh out a plan for how we go about monitoring and alerting for the interop release. --- protocol/interop-monitoring.md | 132 +++++++++++++++++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 protocol/interop-monitoring.md diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md new file mode 100644 index 00000000..f781cf8a --- /dev/null +++ b/protocol/interop-monitoring.md @@ -0,0 +1,132 @@ +# Interop Monitoring + +| | | +| ------------------ | -------------------------------------------------- | +| Author | _Mark Tyneway_ | +| Created at | _2025-03-19_ | +| Initial Reviewers | _Reviewer Name 1, Reviewer Name 2_ | +| Need Approval From | _Reviewer Name_ | +| Status | _Draft_ | + +## Purpose + + + + + +This document is meant to align on a strategy for monitoring interop. Given assumptions in +the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), it is generally +not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple +implementations of `op-supervisor`. With only a single implementation, a bug becomes consensus. +In the worst case, this can mint an infinite amount of ether. Given this risk, we need a monitoring, +alerting and runbook for handling invalid `ExecutingMessage`s being included in the chain. + +## Summary + + + +We implement a monitoring service that validates all of the `ExecutingMessage` logs +produce by the entire cluster and validates them against transaction access lists +and remote nodes. We use this service to alert oncall engineers as well as automatically +pausing the batcher/transaction ingress if an invalid `ExecutingMessage` is included. + +## Problem Statement + Context + + + +We want to be alterted when there is an invalid `ExecutingMessage`. We implement preventative +measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, +so we need to have ways to detect and prevent that. + +## Proposed Solution + + + +A service is implemented that is able to observe all blocks and logs produced by the cluster. +It is responsible for doing two things: +- Guaranteeing that every executing message has a corresponding access list entry +- Double checking that every executing message is valid + +### ExecutingMessages and Access Lists + +The [access list](https://github.com/ethereum-optimism/design-docs/blob/9e919c5b173fe8fc89949b012f6f70a0bc3247f6/protocol/interop-access-list.md) +design guarantees the fact that all executing messages can be validated without the need to execute the transaction. Any calls to the `CrossL2Inbox` +that do not include the statically declared executing message in the access list will revert rather than needing to be dropped. This prevents +a denial of service attack where the MEV searcher can simply produce an invalid `ExecutingMessage` after their MEV attempt fails. + +Given that the decided upon approach depends strictly on the current EVM resource pricing via storage slot cost introspection, we should have +monitoring to alert us if someone is able to trick the `CrossL2Inbox` into producing an `ExecutingMessage` when the access list entry is +not declared. We think this is impossible, but given this is such a critical security property, it is important to monitor. + +### Double Checking Message Validity + +#### Unsafe Blocks + +We want to utilize the cloud architecture in [this doc](https://github.com/ethereum-optimism/design-docs/pull/218) to ensure that +no invalid `ExecutingMessage`s are ever included in a block. No matter what tradeoffs we make, it is impossible to guarantee there +will not be a contingent reorg because an unsafe head reorg can happen after all cross chain transaction validity checks passed. + +We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the +oncall engineering team. Additionally, we should consider pausing the batcher automatically if an invalid `ExecutingMessage` +has been detected in an unsafe block and triggering an unsafe head reorg. It is preferable to not waste blobs and trigger +an unsafe head reorg by batch submitting the invalid block. The problem with this is that it is indistinguishable from +a malicious sequencer triggering a reorg to extract MEV. Either way an unsafe head reorg is going to happen, its just whether +or not its due to the data being posted and then resulting in a replacement deposits only block or if its manually done +by the sequencer offchain. + +We may also want to consider a way to alert partners in the interop set ahead of time that an unsafe head reorg is coming +if an invalid `ExecutingMessage` is observed in an unsafe block. If they turn off their cross chain message ingress fast enough, +it could be possible that they can prevent a contingent reorg. The liveness of the chain can continue with no issues until the +remote chain goes through its unsafe head reorg, then it can open up its cross chain message ingress again. + +#### Safe/Finalized Blocks + +If an invalid `ExecutingMessage` ends up in a safe block, that means that a bug becomes consensus (without multiclient). +This is very bad. Our monitoring should be able to observe this. The worst case thing that can happen in this case is +an attacker uses a fast liquidity bridge like Across to quickly send funds out of the cluster after the finalization of +an invalid `ExecutingMessage` that mints a ton of ether out of thin air. If we detect an invalid `ExecutingMessage` +finalizing to be safe/finalized, we should pause all transaction ingress to the sequencer and effectively stop producing +blocks. This reduces the chances that the attacker is able to initiate a fast bridge out of the cluster. This strategy +is aligned with our philosophy of favoring safety over liveness. + +### Resource Usage + + + +A new service needs to be implemented and operated in the cloud. This service +can be stateless, it mostly needs to do consistent network access to full nodes. +The cost of the full nodes is going to be the majority of the cost in operating +this service. + +### Single Point of Failure and Multi Client Considerations + + + +This is meant to detect the single source of failure with the `op-supervisor`. Having client diversity +for monitoring would be a nice to have. + +## Alternatives Considered + + + +No real alternatives considered. + +## Risks & Uncertainties + + + +- If an invalid `ExecutingMessage` finalizes, it would be a very bad look to roll back the chain. It may be the best solution +- Need to observe the latency of validating the `ExecutingMessage`s to ensure that this is all feasible \ No newline at end of file From 69fddd5b70be30d3b5f7950c26937006667b41f1 Mon Sep 17 00:00:00 2001 From: axelKingsley Date: Wed, 7 May 2025 16:16:39 -0500 Subject: [PATCH 2/3] Editorial Pass ; Specify Monitor Design --- protocol/interop-monitoring.md | 138 +++++++++++++++++---------------- 1 file changed, 70 insertions(+), 68 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index f781cf8a..f94f5e3e 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -1,63 +1,75 @@ -# Interop Monitoring +# Interop Monitoring Service | | | | ------------------ | -------------------------------------------------- | -| Author | _Mark Tyneway_ | +| Author | _Mark Tyneway, Axel Kingsley_ | | Created at | _2025-03-19_ | -| Initial Reviewers | _Reviewer Name 1, Reviewer Name 2_ | -| Need Approval From | _Reviewer Name_ | -| Status | _Draft_ | ## Purpose - +This document is meant to align on a strategy for monitoring interop and propose a +Monitoring Service for Executing Messages. - +## Summary + Problem Statement + Context -This document is meant to align on a strategy for monitoring interop. Given assumptions in -the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), it is generally -not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple +Given assumptions in the [cloud topology](https://github.com/ethereum-optimism/design-docs/pull/218), +it is generally not possible to guarantee that invalid `ExecutingMessage`s do not finalize without multiple implementations of `op-supervisor`. With only a single implementation, a bug becomes consensus. In the worst case, this can mint an infinite amount of ether. Given this risk, we need a monitoring, alerting and runbook for handling invalid `ExecutingMessage`s being included in the chain. -## Summary +We want to be alterted when there is an invalid `ExecutingMessage`. We are implementing preventative +measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, +so we need to have ways to detect and prevent that. - +## Proposed Solution -We implement a monitoring service that validates all of the `ExecutingMessage` logs +We should implement a monitoring service that validates all of the `ExecutingMessage` logs produce by the entire cluster and validates them against transaction access lists -and remote nodes. We use this service to alert oncall engineers as well as automatically +and remote nodes. We use this service to alert oncall engineers as well as potentially automatically pausing the batcher/transaction ingress if an invalid `ExecutingMessage` is included. -## Problem Statement + Context +This "Cross Message Monitor" should have the following features: - +### Monitoring Strategies like `dispute-mon` -We want to be alterted when there is an invalid `ExecutingMessage`. We implement preventative -measures, but the downside risk is existential if an invalid `ExecutingMessage` finalizes, -so we need to have ways to detect and prevent that. +Dispute Monitor is a service already implemented and deployed for tracking Fault Proof Disputes. +Rather than just be a simple alert when games are invalid, it serves up various statistics that an operator +can refer to in order to determine network health: +- How many games are being monitored, how many of each status +- How many Incorrect Forecasts or Incorrect Results +- Warning and Error Logs from the Monitor -## Proposed Solution +Cross Message Monitor can crib directly from these statistics, but focused on Interop: +- How many Executing Messages emitted by the CrossL2Inbox per block per chain +- How many `Executing Message`s Messages point at each Chain in the Superchain +- How many `Executing Message`s are known valid, per safety level +- How many `Executing Message`s are known invalid, per safety level +- How many `Executing Message`s are not yet known valid/invalid, per safety level +- How many `Executing Message`s *changed validity* over time (indicating remote reorg) +- How many `Executing Message`s were resolved via Block Replacement + +Almost all `Executing Message` Metrics emitted by the Cross Message Monitor should have dimensions: +- What chain the `Executing Message` in question is on +- What chain the `Executing Message` is referring to (the chain of the initiating message) +- Timestamp of Block + +### Long Term Monitoring of `Executing Message`s - +Executing Messages can change validity over the course of the Unsafe Chain, +data is not allways sufficiently available to validate `Executing Message`s, and transitive `Executing Message`s can +cause cascades of Valid/Invalid messages. -A service is implemented that is able to observe all blocks and logs produced by the cluster. -It is responsible for doing two things: -- Guaranteeing that every executing message has a corresponding access list entry -- Double checking that every executing message is valid +Therefore, it is insufficent to check a message just once. Instead, every Executing Message +detected by the Cross Message Monitor will be considered an ongoing process, like games are +for the Dispute Monitor. From the time the `Executing Message` is discovered, until the `Executing Message` is included by a +Cross Safe block height which is now L1 finalized, the `Executing Message` should be repeatedly re-checked. -### ExecutingMessages and Access Lists +This means that when the status of the `Executing Message` flips, special alerts can be emitted to indicate +a remote reorg has likely occured. Or, when a single invalid message creates a cascade of +invalidation, each `Executing Message` can resolve individually. + +### Access List Confirmation The [access list](https://github.com/ethereum-optimism/design-docs/blob/9e919c5b173fe8fc89949b012f6f70a0bc3247f6/protocol/interop-access-list.md) design guarantees the fact that all executing messages can be validated without the need to execute the transaction. Any calls to the `CrossL2Inbox` @@ -68,13 +80,14 @@ Given that the decided upon approach depends strictly on the current EVM resourc monitoring to alert us if someone is able to trick the `CrossL2Inbox` into producing an `ExecutingMessage` when the access list entry is not declared. We think this is impossible, but given this is such a critical security property, it is important to monitor. -### Double Checking Message Validity +Each message can be checked for this once, when it is detected and added to the monitoring set. -#### Unsafe Blocks +### Alert Behaviors -We want to utilize the cloud architecture in [this doc](https://github.com/ethereum-optimism/design-docs/pull/218) to ensure that -no invalid `ExecutingMessage`s are ever included in a block. No matter what tradeoffs we make, it is impossible to guarantee there -will not be a contingent reorg because an unsafe head reorg can happen after all cross chain transaction validity checks passed. +Though it will need evaluation over time, we already know the sorts of operator responses we want when certain situations are detected +by the monitor. + +#### Unsafe Blocks We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the oncall engineering team. Additionally, we should consider pausing the batcher automatically if an invalid `ExecutingMessage` @@ -91,42 +104,31 @@ remote chain goes through its unsafe head reorg, then it can open up its cross c #### Safe/Finalized Blocks -If an invalid `ExecutingMessage` ends up in a safe block, that means that a bug becomes consensus (without multiclient). -This is very bad. Our monitoring should be able to observe this. The worst case thing that can happen in this case is -an attacker uses a fast liquidity bridge like Across to quickly send funds out of the cluster after the finalization of -an invalid `ExecutingMessage` that mints a ton of ether out of thin air. If we detect an invalid `ExecutingMessage` -finalizing to be safe/finalized, we should pause all transaction ingress to the sequencer and effectively stop producing -blocks. This reduces the chances that the attacker is able to initiate a fast bridge out of the cluster. This strategy -is aligned with our philosophy of favoring safety over liveness. +If an invalid `ExecutingMessage` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, +and must be replaced with a Deposit Only Block. This situation should page the operator to monitor the situation, and every +individual invalid `Executing Message` in a Safe Block should be very easy to see and monitor individually. The operator is monitoring +to ensure a Block Replacement occurs and the invalid messges are no longer known to the chain. -### Resource Usage - - +If Cross-Validation should promote the block to Cross-Safe, this is an all-hands-on-deck consensus bug, which would naturally +have its own alerts associated in addition to the prior expectation of an operator monitoring the situation. -A new service needs to be implemented and operated in the cloud. This service -can be stateless, it mostly needs to do consistent network access to full nodes. -The cost of the full nodes is going to be the majority of the cost in operating -this service. +### Resource Usage -### Single Point of Failure and Multi Client Considerations +This new service will need minimal CPU/Disk and can be stateless. It will need a connection to one of each Node +for the Superchain it is monitoring. - +## Summary of Solution -This is meant to detect the single source of failure with the `op-supervisor`. Having client diversity -for monitoring would be a nice to have. +Create `xmsg-mon` in the image of `dispute-mon` to track all in-flight Executing Messages for a Superchain, for their entire +Unsafe -> Safe -> Finalized lifecycle. Create Alerting against it which pages operators when Invalid Messages advance into blocks. ## Alternatives Considered - +No real alternatives considered. Monitoring should happen as a matter of course when deploying new services. -No real alternatives considered. +Having additional Cross-Validation software besides Supervisor would lessen the criticality of this software. ## Risks & Uncertainties - - -- If an invalid `ExecutingMessage` finalizes, it would be a very bad look to roll back the chain. It may be the best solution -- Need to observe the latency of validating the `ExecutingMessage`s to ensure that this is all feasible \ No newline at end of file +- The Monitoring Service may be insufficent, and we may not catch what we need to. Real experience will inform updates to this service. +- The speed of the Monitor may be insufficent for operators to take meaningful action \ No newline at end of file From 344040c4edc6a36b953077320ba6b9efcb03ea4a Mon Sep 17 00:00:00 2001 From: axelKingsley Date: Thu, 8 May 2025 13:55:19 -0500 Subject: [PATCH 3/3] comments --- protocol/interop-monitoring.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/protocol/interop-monitoring.md b/protocol/interop-monitoring.md index f94f5e3e..a5b048e4 100644 --- a/protocol/interop-monitoring.md +++ b/protocol/interop-monitoring.md @@ -90,18 +90,21 @@ by the monitor. #### Unsafe Blocks We want to be able to detect when an invalid `ExecutingMessage` is included in an unsafe block and trigger an altert to the -oncall engineering team. Additionally, we should consider pausing the batcher automatically if an invalid `ExecutingMessage` -has been detected in an unsafe block and triggering an unsafe head reorg. It is preferable to not waste blobs and trigger -an unsafe head reorg by batch submitting the invalid block. The problem with this is that it is indistinguishable from -a malicious sequencer triggering a reorg to extract MEV. Either way an unsafe head reorg is going to happen, its just whether -or not its due to the data being posted and then resulting in a replacement deposits only block or if its manually done -by the sequencer offchain. +oncall engineering team. It is preferable to not waste blobs and trigger an unsafe head reorg by batch submitting the invalid block as soon as possible, +therefore the operator may want to accelerate batch submission when this alert arrives. Unless nodes on the network are +able to accept an Unsafe->Unsafe block replacement (and they are not), the Sequencer's only path forward is to see the +invalid block commited to L1, at which point it will be replaced. Doing this faster will minimize reorg sizes. We may also want to consider a way to alert partners in the interop set ahead of time that an unsafe head reorg is coming if an invalid `ExecutingMessage` is observed in an unsafe block. If they turn off their cross chain message ingress fast enough, it could be possible that they can prevent a contingent reorg. The liveness of the chain can continue with no issues until the remote chain goes through its unsafe head reorg, then it can open up its cross chain message ingress again. +Finally, when Invalid Messages occur, it is prudent to shut off additional Executing Messages. Admin APIs should be established which: +- Shut off Executing Message Ingress at `proxyd` +- Force remove Executing Messages from block builder mempools. +These triggers should occur automatically when an invalid `ExecutingMessage` is discovered at the Unsafe Block stage, in order to reduce cascades. + #### Safe/Finalized Blocks If an invalid `ExecutingMessage` ends up in a safe block, it is an expectation of the Protocol that the block is Invalid, @@ -122,6 +125,9 @@ for the Superchain it is monitoring. Create `xmsg-mon` in the image of `dispute-mon` to track all in-flight Executing Messages for a Superchain, for their entire Unsafe -> Safe -> Finalized lifecycle. Create Alerting against it which pages operators when Invalid Messages advance into blocks. +Furthermore, Admin APIs should be established to shut off `proxyd` and `mempool` acceptance of Executing Messages, to swiftly respond +when the Monitoring Service detects invalid messages in blocks. + ## Alternatives Considered No real alternatives considered. Monitoring should happen as a matter of course when deploying new services.