You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: security/fma-supervisor.md
+40-6
Original file line number
Diff line number
Diff line change
@@ -114,7 +114,7 @@ And all Failure Modes are some subtype. Incorrect responses are much worse than
114
114
- The message validity code in the Supervisor is the most core aspect of its implementation. It has reasonable unit testing for database, accessors, apis, and all validity checking code. It has some E2E tests, and some situationally comprehensive Action tests, as well as local Kurtosis and Devnet exposure. However, this component is not battle tested.
115
115
-*Even When* the Supervisor behaves totally correctly, this case may occur if some data cross-unsafe data is used to build a block which later becomes invalid due to a reorg on the referenced chain. In this situation, the same outcome is felt by the network.
116
116
- Mitigations
117
-
- The Sequencer could detect cross-unsafe head stall and issue a reorg on the unsafe chain in order to avoid invalid L1 inclusions. Depending on the heuristic used, this could create regular unsafe reorgs with low threshold, or larger, less common ones. This also saves operators from wasted L1 fees when a batch would contain unwanted data.
117
+
- The Sequencer *should* detect cross-unsafe head stall and issue a reorg on the unsafe chain in order to avoid invalid L1 inclusions. Depending on the heuristic used, this could create regular unsafe reorgs with low threshold, or larger, less common ones. This also saves operators from wasted L1 fees when a batch would contain unwanted data.
118
118
- When promoting local-unsafe to cross-unsafe, the Supervisor can additionally detect if the data it is stalled on is already cross-safe or not. If it is, it can proactively notify the Sequencer that the current chain is hopeless to be valid, creating a more eager reorg point.
119
119
- The Batcher can decline to post beyond the current cross-unsafe head. This will avoid the publishing of bad data so the sequencer may reorg it out, saving the replacement based reorg. If it went on long enough, the Batcher would prevent any new data from being posted to L1, effectively creating a safe-head stall until the sequencer resolved the issue. This *could* be a preferred scenario for some chains.
120
120
- We need to develop and use production-realistic networks in large scale testing to exercise failure cases and get confidence that the system behaves and recovers as expected.
@@ -145,6 +145,7 @@ And all Failure Modes are some subtype. Incorrect responses are much worse than
145
145
- Currently, the code used to issue resets to Managed Nodes is insufficiently tested. This is due to limitations in our ability to construct lifelike-yet-erroneous scenarios for Nodes to sync against. **As far as it has been battle-tested thus**-**far**, Managed Nodes are stable (for example, the “Denver Devnet” has run for 40 days with minimal operations, while supporting real builder traffic).
146
146
- Mitigations
147
147
- If a reset would significantly roll back the Sequencer, a chain with a Conductor Set *should* be able to Identify that the Node is unhealthy and elect a new Active Sequencer. In this case, there would be no interruption to the chain, as the tip is continued by the new Active Sequencer.
148
+
- We will be cleaning up the number of Node<>Supervisor messages in current operation, which will allow us to hook closer metrics to this. If A Sequencer ever gets a reset signal, it may be worthy of an alert on its own (even if the reset is due to a legitimate reason, they are rare enough to be tracked)
148
149
- Recovery
149
150
- With respect to the Node, an arbitrary amount of re-sync may be required.
150
151
- Recovery to the network depends entirely on the impact of the node outage.
@@ -166,7 +167,7 @@ And all Failure Modes are some subtype. Incorrect responses are much worse than
166
167
- Description
167
168
- The Supervisor is managing one Managed Node per chain, and using the data reported from their derivation to calculate cross-safety.
168
169
- For some reason, a Managed Node is not able to sync the chain quickly, or at all (perhaps the node is down or is failing)
169
-
- Without the data from this Node, the Supervisor cannot advance safety for any other chain which depends on this stalled cahin.
170
+
- Without the data from this Node, the Supervisor cannot advance safety for any other chain which depends on this stalled chain.
170
171
- The Supervisor also won’t be able to answer any validity questions about the un-synced portion of the chain (which is why safety doesn’t advance)
171
172
- Assuming they take some dependency on un-syncing chain, other chains will eventually stall their cross-unsafe and cross-safe chains
172
173
- Risk Assessment
@@ -178,14 +179,26 @@ And all Failure Modes are some subtype. Incorrect responses are much worse than
178
179
- This feature is mostly ready in the happy path, but there are known gaps in managing secondary Managed Nodes during block replacements and resets.
179
180
- This feature *also* needs much more robust testing in lifelike environments. Previous development cycles were spent in the devnet trying to enable this feature, which was slow and risky. To get this feature working well, we need to leverage our new network-wide testing abilities.
180
181
182
+
## FM4c: Supervisor has Insufficient Performance
183
+
184
+
- Description
185
+
- Like FM4b, updates are not happening on the Supervisor quickly enough and it is falling behind.
186
+
- In this instance however, it is due to Supervisor performance, not Nodes.
187
+
- This is a liveness threat to the Supervisor only.
188
+
- Nodes who rely on the Supervisor may not be able to get all the queries they need answered, leading to protectively dropped interop transactions.
189
+
- Risk Assessment
190
+
- Low Impact, Low likelihood
191
+
- The Supervisor does not do strenuous calculations, mostly just DB lookups.
192
+
- Nodes and their Execution Engines are likely to be the bottleneck because they have to process the Gas of the block.
193
+
181
194
## FM5: Increased Operational/Infrastructure Load Results in Fewer Nodes
182
195
183
196
- Description
184
-
- In the current design, validating a given chain in the Supechain requires validaiton of *all* chains. Conceptually,
197
+
- In the current design, validating a given chain in the Supechain requires validation of *all* chains. Conceptually,
185
198
this is unavoidable because inter-chain message validity requires inter-chain validation.
186
199
- The way a user would achieve this today is by running one Node (`op-geth` and `op-node`) for each Chain in the Interopating Set, and would additionally run a Supervisor to manage these Managed Nodes.
187
200
- If operators want redundancy, they are advised to create entire redundant stacks (N nodes and Supervisor)
188
-
- Operators may not appreciate the increased burdeon and could decline to validate the network at all
201
+
- Operators may not appreciate the increased burden and could decline to validate the network at all
189
202
- Risk Assessment
190
203
- Sliding Scale Impact and Likelihood
191
204
- Network security is not based on number of validators, but fewer people running the OP Stack means less validation, mindshare, etc.
@@ -194,7 +207,7 @@ And all Failure Modes are some subtype. Incorrect responses are much worse than
194
207
- In Standard Mode, a Node *is not* managed by a Supervisor, and instead runs its on L1 derivation
195
208
- Periodically, the Node reaches out to *some* trusted Supervisor to fetch the Cross-Heads and any Replacement signals
196
209
- The Node uses this data to confirm that the chain it has derived is also cross-safe, and to handle invalid blocks
197
-
- With Standard Mode, operators would only *need* to run a single Node, in exchange for the interop aspects of validation being a trusted activity. We expect this Mode to be valuable to offset operator burdeon in cases where fully trustless validation isn't critical.
210
+
- With Standard Mode, operators would only *need* to run a single Node, in exchange for the interop aspects of validation being a trusted activity. We expect this Mode to be valuable to offset operator burden in cases where fully trustless validation isn't critical.
198
211
199
212
## FM6: Standard Mode Nodes Trust a Malicious Supervisor
200
213
@@ -205,11 +218,32 @@ And all Failure Modes are some subtype. Incorrect responses are much worse than
205
218
- The result depends on the way in which the data is incorrect:
206
219
- If the data ignores a block Invalidation/Replacement, the Node will be deceived into following a chain where an invalid interop message exists, allowing for invalid interop behaviors (like inappropriate minting) *on that Node* (and presumably all Nodes who also trust this Supervisor)
207
220
- If the data claims a block should be replaced, the Node would similarly perform the replacement, leaving this Node at that state.
208
-
- If the data simply isn't consistently applicable to what the Node independently derived, the Node has nothing to be decieved about, but also can't make forward progress, and would need to halt at this point.
221
+
- If the data simply isn't consistently applicable to what the Node independently derived, the Node has nothing to be deceived about, but also can't make forward progress, and would need to halt at this point.
209
222
- Risk Assessment
210
223
- Low Impact, Unknown Likelihood
211
224
- If there were a tactical reason to use a Trusted Sequencer Endpoint to fool a Node as part of a larger exploit, this would be an attractive thing to attempt.
212
225
- Most entities who would provide a Trusted Endpoint have intrinsic incentive to keep their endpoint valid and secure. Much like a provider like Alchemy doesn't want the Node you connect to to lie.
213
226
- At most, a dishonest Supervisor can mislead the Nodes connected to it. So long as block producers *are not* using one of these Trusted Endpoints, block production is unaffected.
214
227
- Mitigation
215
228
- Standard Mode could allow for *multiple* Supervisor Endpoints to be specified, they could confirm that all endpoints agree, preventing dishonesty from one party from deceiving the Node.
229
+
230
+
# Action Item Summary
231
+
232
+
Across all these Failure Modes, the following are explicitly identified improvements and mitigations we should make soon:
233
+
- Alternative implementations should exist to catch instances where Supervisor has a bug.
0 commit comments