|
| 1 | +# OP-Supervisor: Failure Modes and Recovery Path Analysis |
| 2 | + |
| 3 | +| | | |
| 4 | +|--------|--------------| |
| 5 | +| Author | Axel Kingsley | |
| 6 | +| Created at | 2025-03-26 | |
| 7 | +| Needs Approval From | | |
| 8 | +| Other Reviewers | | |
| 9 | +| Status | Draft | |
| 10 | + |
| 11 | +## Introduction |
| 12 | + |
| 13 | +This document covers the supervisor as a new consensus critical component, and by its nature involves interop protocol behaviors, as well as new behavior modalities in op-node. |
| 14 | + |
| 15 | +## Interop Protocol Terms and Behavior Explainer |
| 16 | + |
| 17 | +To start setting context, here is a basic overview of how Interop works at the protocol level: |
| 18 | + |
| 19 | +- **There are Two Chains, A and B** who participate in interop with one another. |
| 20 | +- Any time a log is emitted on A, it can be referenced on B. |
| 21 | +- And likewise, any logs on B can be referenced on A. |
| 22 | +- The way a chain can reference logs from other chains is by creating an *Executing Message* |
| 23 | + - Executing Messages are emitted as Logs by the Cross-L2-Inbox contract when called. |
| 24 | + - Each Executing Message contains indexing information and a hash. |
| 25 | +- By the rules of the protocol, Executing Messages are always valid on the canonical chain because: |
| 26 | + - If they are valid, then they are valid. |
| 27 | + - If they are *invalid* then the entire block which contains them is also invalid, and so nodes should derive a Replacement Block. |
| 28 | + |
| 29 | + |
| 30 | +To track this, Nodes now consider twice as many heads as before: |
| 31 | + |
| 32 | +- Local Unsafe: Represents the basic P2P gossiped chain with no interop validity considered. |
| 33 | +- Cross Unsafe: Represents the farthest point within the Local Unsafe chain which optimistically appears to be valid with the Cross Unsafe data of other chains. |
| 34 | +- Local Safe: Represents the basic L1 derived chain with no interop validity considered. |
| 35 | +- Cross Safe: Represents the farthest point within the Local Safe chain which can be validated with Cross Safe data of other chains from this L1 source. |
| 36 | +- Finalization: As always, the L2 data is finalized when the L1 data from which it is derived becomes Finalized. However, Finalization only meaningfully applies to the *Cross Safe* head, because the local safe chain may be discovered invalid post-finalization if data that was awaited turns out to be invalid. |
| 37 | + |
| 38 | +While the protocol rules of interop create a very simple and effective method of keeping interoperation secure, it creates the incentive for chain operators to *not* include these transactions in their blocks. If they do, then the promotion from Local Safe to Cross Safe will fail and their chain will experience a reorg in order to replace the invalid block. |
| 39 | + |
| 40 | +So, the Supervisor was developed to effectively serve this information in a scalable, secure way: |
| 41 | + |
| 42 | +- The Supervisor manages one “Managed Node” for Chain A, and one for Chain B |
| 43 | + - A Managed Node is an Op-Node set to a special behavior mode to serve a Supervisor |
| 44 | + - Managed Nodes receive the next L1 block from the Supervisor |
| 45 | + - Managed Nodes report their derivation results to the Supervisor for Indexing |
| 46 | + - Managed Nodes may be controlled and reset by the Supervisor to maintain sync |
| 47 | +- (There is a different "Mode" than Managed Mode which is not yet implemented, and is described in more detail starting at FM5) |
| 48 | +- As Nodes A and B receive unsafe blocks over P2P, they report it to the Supervisor |
| 49 | + - The Supervisor uses this to sync all the receipts to a log database |
| 50 | +- The Supervisor hands down the L1 block to Managed Nodes A and B |
| 51 | +- Each Node derives zero or more blocks from the L1 input |
| 52 | +- As new L2 blocks are derived from the L1, the L1:L2 link is recorded to a Derivation Database |
| 53 | +- As new data arrives in the Supervisor, safety promotion routines re-evaluate all given data |
| 54 | + - Unsafe Heads advance as their receipts are indexed |
| 55 | + - Cross Unsafe is evaluated any time new data is available |
| 56 | + - Safe heads advance as their derivation is reported |
| 57 | + - Cross Safe is evaluated any time new data is available |
| 58 | +- If *positively invalid* messages are discovered in Safe data, the Supervisor directs the given Node to Replace the block with a Deposit-Only block |
| 59 | + |
| 60 | +In this way, the Supervisor is an implementation of the Interop Protocol: it evaluates the validity of interop messages, and initiates the block replacement specified in the protocol. |
| 61 | + |
| 62 | +The Supervisor *also* serves the data it computes over RPC, for other components to check the validity of interop messages *before* they are included in a block, to avoid invalid messages. |
| 63 | + |
| 64 | +The Supervisor is as important as our Consensus Node itself — it is the primary/sole implementation of a serious aspect of the OP Spec, and if it behaves incorrectly it could mislead an entire network. |
| 65 | + |
| 66 | +# FMs: |
| 67 | + |
| 68 | +There are three conceptual ways for the Supervisor to fail: |
| 69 | + |
| 70 | +- It is unresponsive |
| 71 | +- It is incorrect |
| 72 | +- It is destructive to the Nodes connected to it in some other way |
| 73 | + |
| 74 | +And all Failure Modes are some subtype. Incorrect responses are much worse than no responses, as they may mislead the network. At maximum, a network which is mislead may have fraudulent interop messages played on it, resulting in arbitrary damage to that network. I say “mislead” because all invalid interop messages are invalid by the protocol, and this fact can be checked using L1 data. But because there is limited client diversity, bugs may have greater impact. |
| 75 | + |
| 76 | +## FM1a: Supervisor Is Totally Unavailable |
| 77 | + |
| 78 | +- Description |
| 79 | + - The Supervisor is entirely unavailable, as if it were disconnected. |
| 80 | + - No Message/Block Validity could be determined for any Node connected to this Supervisor. |
| 81 | + - Any Node in Managed Mode would be unable to advance their chain state at all besides local-unsafe. |
| 82 | + - Any Node in Standard Mode would be unable to advance cross-unsafe and cross-safe, but could still advance local-unsafe and local-safe. And because Validity can’t be checked, the Node may prefer not to trust Safe data either. |
| 83 | +- Risk Assessment |
| 84 | + - Low Impact, High likelihood. |
| 85 | + - Software fails; ports become blocked. At some point an interruption will take down a Supervisor. |
| 86 | + - We have redundancy solutions being designed [here](https://github.com/ethereum-optimism/design-docs/pull/218/files?short_path=88594e4#diff-88594e47f0a70261441a7452448ef1f240c7c0f15b9132c7789b2ee2d0e07bd2), which will make Supervisor failures less impactful for Chain Operators. |
| 87 | + - Other Node Operators can adopt similar redundancy measures, but may follow their own infrastructure designs . Operators should treat the Supervisor as consensus critical, and manage it similar to how they would manage their L1 source. |
| 88 | + |
| 89 | +## FM1b: Supervisor Is Corrupted |
| 90 | + |
| 91 | +- Description |
| 92 | + - Not only is the supervisor unavailable, its database has been corrupted or destroyed. |
| 93 | + - In this scenario, all of the above applies, and the recovery is delayed by our ability to restore a working Supervisor. |
| 94 | +- Risk Assessment |
| 95 | + - Low Impact, Low likelihood. |
| 96 | + - The Supervisor has robust database trimming to clear invalid states and partial writes. |
| 97 | + - This code, like all Supervisor code, could be tested more. |
| 98 | +- Mitigations |
| 99 | + - There is a feature of the Supervisor that databases can be replicated over HTTP. If a Supervisor needs to be quickly synced it can bootstrap its databases from an existing Supervisor. |
| 100 | + - Supervisors themselves should be made redundant, so if one goes down, another may serve in its place. Supervisor *replacement* is not something that is well tested, and more likely you’d switch to use that backup Supervisor *and* its Managed Nodes. |
| 101 | + |
| 102 | +## FM2a: Supervisor Incorrectly Determines Message Validity in Response to Sequencer - Unsafe Block Only |
| 103 | + |
| 104 | +- Description |
| 105 | + - If the Supervisor were to claim an Executing Message is valid when it actually isn’t, Sequencers may use this determination to include an invalid message in their local-unsafe chain. |
| 106 | + - Assuming the Supervisor correctly evaluates the cross-promotion (a later step), it will experience a cross-unsafe head stall at this error point, because of the cross-invalid message. |
| 107 | + - The Sequencer does not care about cross-unsafe chain stalling, because unsafe data is subject to change, and as far as it knows, all the messages it has included are valid, so the *expectation* is that other dependent data will unstick it. |
| 108 | + - Eventually, the block is published to the L1 and becomes local-safe data. |
| 109 | + - When the Supervisor attempts to promote the local-safe data to cross-safe, it discovers the invalid message and issues an invalidation and replacement. |
| 110 | + - The Replacement block is applied to the block which contained the invalid message, and the chain has now reorg’d out all blocks from the invalid message to the safe head (effectively resetting the chain back to the stalled cross-unsafe head). |
| 111 | +- Risk Assessment |
| 112 | + - Medium Impact, Medium Likelihood. |
| 113 | + - A reorg which affects the Safe Chain implicitly invalidates the Unsafe Chain as well, causing disruption to users. |
| 114 | + - The message validity code in the Supervisor is the most core aspect of its implementation. It has reasonable unit testing for database, accessors, apis, and all validity checking code. It has some E2E tests, and some situationally comprehensive Action tests, as well as local Kurtosis and Devnet exposure. However, this component is not battle tested. |
| 115 | + - *Even When* the Supervisor behaves totally correctly, this case may occur if some data cross-unsafe data is used to build a block which later becomes invalid due to a reorg on the referenced chain. In this situation, the same outcome is felt by the network. |
| 116 | +- Mitigations |
| 117 | + - The Sequencer could detect cross-unsafe head stall and issue a reorg on the unsafe chain in order to avoid invalid L1 inclusions. Depending on the heuristic used, this could create regular unsafe reorgs with low threshold, or larger, less common ones. This also saves operators from wasted L1 fees when a batch would contain unwanted data. |
| 118 | + - When promoting local-unsafe to cross-unsafe, the Supervisor can additionally detect if the data it is stalled on is already cross-safe or not. If it is, it can proactively notify the Sequencer that the current chain is hopeless to be valid, creating a more eager reorg point. |
| 119 | + - The Batcher can decline to post beyond the current cross-unsafe head. This will avoid the publishing of bad data so the sequencer may reorg it out, saving the replacement based reorg. If it went on long enough, the Batcher would prevent any new data from being posted to L1, effectively creating a safe-head stall until the sequencer resolved the issue. This *could* be a preferred scenario for some chains. |
| 120 | + - We need to develop and use production-realistic networks in large scale testing to exercise failure cases and get confidence that the system behaves and recovers as expected. |
| 121 | + |
| 122 | +## FM2b: Supervisor Doesn’t Catch Block Invalidation of Safe Data |
| 123 | + |
| 124 | +- Description |
| 125 | + - FM2a has occurred, but additionally, the Supervisor doesn’t catch the invalid message when promoting from local-safe to cross-safe. |
| 126 | + - An output root that builds on this incorrect cross-safe head is published to L1. |
| 127 | + - At this point, any validators who rely on the Supervisor are following an incorrectly derived chain — blocks between the invalid message and the end of the batch should be replaced by deposit only blocks. |
| 128 | + - Any validators who *do not* rely on the failing Supervisor will see the correct chain, but there are currently no alternative implementations to use. |
| 129 | + - An output root posted from this incorrect state would be open to be fault-proven. |
| 130 | +- Risk Assessment |
| 131 | + - Even Higher Impact, Low Likelihood. |
| 132 | +- Recovery |
| 133 | + - Within the 12h Safe Head window, if the sequencer is repaired and rewound, it could correctly interpret the batch data to replace a block, and then would have the job of rebuilding the chain from that point. All users would need to upgrade to a version without this derivation bug, and resync the chain from the failed position. Operators who *did not* upgrade and resync would be left on a dead branch that is no longer being updated. |
| 134 | + |
| 135 | +## FM3: The Supervisor Issues a Reset to the Sequencer |
| 136 | + |
| 137 | +- Description |
| 138 | + - The Supervisor manages Nodes in Managed Mode, meaning they listen to the Supervisor for signals for what derivation activities to take next. |
| 139 | + - One such activity is to reset the node to specific heads. |
| 140 | + - Due to some misbehavior, the Supervisor could issue a reset to the Sequencer. |
| 141 | + - Due to the way the Supervisor navigates and negotiates resets to Nodes, it has the potential to reset to an arbitrary depth. |
| 142 | + - If this happened, the Sequencer would indeed reset, and the ability for the Sequencer to advance the chain would be broken, effectively causing an unsafe head stall until it could recover. |
| 143 | +- Risk Assessment |
| 144 | + - Medium Impact (due to mitigtaions), Medium Likelihood. |
| 145 | + - Currently, the code used to issue resets to Managed Nodes is insufficiently tested. This is due to limitations in our ability to construct lifelike-yet-erroneous scenarios for Nodes to sync against. **As far as it has been battle-tested thus**-**far**, Managed Nodes are stable (for example, the “Denver Devnet” has run for 40 days with minimal operations, while supporting real builder traffic). |
| 146 | +- Mitigations |
| 147 | + - If a reset would significantly roll back the Sequencer, a chain with a Conductor Set *should* be able to Identify that the Node is unhealthy and elect a new Active Sequencer. In this case, there would be no interruption to the chain, as the tip is continued by the new Active Sequencer. |
| 148 | +- Recovery |
| 149 | + - With respect to the Node, an arbitrary amount of re-sync may be required. |
| 150 | + - Recovery to the network depends entirely on the impact of the node outage. |
| 151 | + |
| 152 | +## FM4a: Managed Nodes are at Different Heights |
| 153 | + |
| 154 | +- Description |
| 155 | + - The Supervisor is managing one Managed Node per chain, and using the data reported from their derivation to calculate cross-safety. |
| 156 | + - While the Node for Chain A has derived to some height (L1 block 100), the Node for Chain B is still processing earlier L1 blocks (L1 block 90). |
| 157 | + - During this period, cross safety can’t be fully calculated between beyond L1 block 90 for *any chain which references Chain B* when the referenced data is yet unsynced (direct or indirect). |
| 158 | + - Any chains who do not need Chain B data of the unsynced region can process normally. |
| 159 | +- Risk Assessment |
| 160 | + - Zero Impact, Certainty. |
| 161 | + - This is just the natural consequence of syncing from multiple data sources at once, you can’t know what you don’t yet know. |
| 162 | + - The Supervisor already takes this lack of data into account when responding to queries and advancing safety. |
| 163 | + |
| 164 | +## FM4b: A Managed Node Stalls or Lags Significantly |
| 165 | + |
| 166 | +- Description |
| 167 | + - The Supervisor is managing one Managed Node per chain, and using the data reported from their derivation to calculate cross-safety. |
| 168 | + - For some reason, a Managed Node is not able to sync the chain quickly, or at all (perhaps the node is down or is failing) |
| 169 | + - Without the data from this Node, the Supervisor cannot advance safety for any other chain which depends on this stalled cahin. |
| 170 | + - The Supervisor also won’t be able to answer any validity questions about the un-synced portion of the chain (which is why safety doesn’t advance) |
| 171 | + - Assuming they take some dependency on un-syncing chain, other chains will eventually stall their cross-unsafe and cross-safe chains |
| 172 | +- Risk Assessment |
| 173 | + - Low Impact, Low likelihood |
| 174 | + - The Supervisor knows the boundaries of data it can/can’t report on, and won’t answer *incorrectly* just because it doesn’t have data. |
| 175 | + - If the Managed Nodes are Sequencers who need interop data to build blocks, they will be unable to validate interop messages, and will therefore not include them in block building |
| 176 | +- Mitigation |
| 177 | + - Supervisor supports a feature in which *multiple* Nodes of a given chain can be connected as Managed Nodes. If one Node goes down, syncing can continue with the backup. |
| 178 | + - This feature is mostly ready in the happy path, but there are known gaps in managing secondary Managed Nodes during block replacements and resets. |
| 179 | + - This feature *also* needs much more robust testing in lifelike environments. Previous development cycles were spent in the devnet trying to enable this feature, which was slow and risky. To get this feature working well, we need to leverage our new network-wide testing abilities. |
| 180 | + |
| 181 | +## FM5: Increased Operational/Infrastructure Load Results in Fewer Nodes |
| 182 | + |
| 183 | +- Description |
| 184 | + - In the current design, validating a given chain in the Supechain requires validaiton of *all* chains. Conceptually, |
| 185 | + this is unavoidable because inter-chain message validity requires inter-chain validation. |
| 186 | + - The way a user would achieve this today is by running one Node (`op-geth` and `op-node`) for each Chain in the Interopating Set, and would additionally run a Supervisor to manage these Managed Nodes. |
| 187 | + - If operators want redundancy, they are advised to create entire redundant stacks (N nodes and Supervisor) |
| 188 | + - Operators may not appreciate the increased burdeon and could decline to validate the network at all |
| 189 | +- Risk Assessment |
| 190 | + - Sliding Scale Impact and Likelihood |
| 191 | + - Network security is not based on number of validators, but fewer people running the OP Stack means less validation, mindshare, etc. |
| 192 | +- Mitigation |
| 193 | + - There is a feature which we have not yet had a chance to implement called "Standard Mode", which is a counter to "Managed Mode" |
| 194 | + - In Standard Mode, a Node *is not* managed by a Supervisor, and instead runs its on L1 derivation |
| 195 | + - Periodically, the Node reaches out to *some* trusted Supervisor to fetch the Cross-Heads and any Replacement signals |
| 196 | + - The Node uses this data to confirm that the chain it has derived is also cross-safe, and to handle invalid blocks |
| 197 | + - With Standard Mode, operators would only *need* to run a single Node, in exchange for the interop aspects of validation being a trusted activity. We expect this Mode to be valuable to offset operator burdeon in cases where fully trustless validation isn't critical. |
| 198 | + |
| 199 | +## FM6: Standard Mode Nodes Trust a Malicious Supervisor |
| 200 | + |
| 201 | +- Description |
| 202 | + - A Node is in Standard Mode, doing derivation on the L1 independently |
| 203 | + - At some moment, it reaches out to a trusted Supervisor endpoint to query for Cross Safety and Replacement information |
| 204 | + - The Supervisor is incorrect, or even adversarial |
| 205 | + - The result depends on the way in which the data is incorrect: |
| 206 | + - If the data ignores a block Invalidation/Replacement, the Node will be deceived into following a chain where an invalid interop message exists, allowing for invalid interop behaviors (like inappropriate minting) *on that Node* (and presumably all Nodes who also trust this Supervisor) |
| 207 | + - If the data claims a block should be replaced, the Node would similarly perform the replacement, leaving this Node at that state. |
| 208 | + - If the data simply isn't consistently applicable to what the Node independently derived, the Node has nothing to be decieved about, but also can't make forward progress, and would need to halt at this point. |
| 209 | +- Risk Assessment |
| 210 | + - Low Impact, Unknown Likelihood |
| 211 | + - If there were a tactical reason to use a Trusted Sequencer Endpoint to fool a Node as part of a larger exploit, this would be an attractive thing to attempt. |
| 212 | + - Most entities who would provide a Trusted Endpoint have intrinsic incentive to keep their endpoint valid and secure. Much like a provider like Alchemy doesn't want the Node you connect to to lie. |
| 213 | + - At most, a dishonest Supervisor can mislead the Nodes connected to it. So long as block producers *are not* using one of these Trusted Endpoints, block production is unaffected. |
| 214 | +- Mitigation |
| 215 | + - Standard Mode could allow for *multiple* Supervisor Endpoints to be specified, they could confirm that all endpoints agree, preventing dishonesty from one party from deceiving the Node. |
0 commit comments