-
Notifications
You must be signed in to change notification settings - Fork 25
feat: Operator Doppelgänger Protection #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
feat: Operator Doppelgänger Protection #692
Conversation
Add configuration options for operator doppelgänger protection: - --operator-dg: Enable/disable the feature (default: true) - --operator-dg-wait-epochs: Epochs to wait in monitor mode (default: 2) - --operator-dg-fresh-k: Freshness threshold for twin detection (default: 3) This implements the configuration layer for issue sigp#627, allowing operators to detect if their operator ID is already active elsewhere and shut down to prevent equivocation. Related to sigp#627
Add a new service module for detecting operator doppelgängers: - State machine with Monitor/Active modes - Track recent max consensus height per committee - Freshness threshold (K) to prevent false positives from replays - Check for single-signer messages with own operator ID - Comprehensive unit tests for state machine logic The service detects if the operator's ID appears in fresh SSV messages during monitor mode, indicating another instance is already running. Part of sigp#627
Integrates the operator doppelgänger protection with the message flow: - Adds DoppelgangerConfig struct to message_receiver for checking messages - NetworkMessageReceiver checks QBFT messages for doppelgänger detection - Client initializes doppelgänger service when operator_dg is enabled - Fatal shutdown triggered via TaskExecutor when twin operator detected - Spawns background task to listen for shutdown signal - Tests updated to use correct CommitteeId constructor ([u8; 32]) - Allows clippy::too_many_arguments for NetworkMessageReceiver::new This implements the core detection and shutdown mechanism specified in sigp#627 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
anchor/client/src/cli.rs
Outdated
#[clap( | ||
long, | ||
value_name = "HEIGHTS", | ||
help = "The freshness threshold for detecting operator twins. Only messages within \ | ||
this many consensus heights from the maximum observed height are considered \ | ||
fresh evidence of a twin. This prevents false positives from replayed old messages.", | ||
display_order = 0, | ||
default_value_t = 3, | ||
requires = "operator_dg" | ||
)] | ||
pub operator_dg_fresh_k: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is necessary. Have you tested this and confirmed there to be an issue without this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate more on why you think it's not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now that this is about preventing replay attacks, not necessarily about this occurring naturally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't about replay attacks. What makes you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes you think so?
The sentence "This prevents false positives from replayed old messages."
If another node maliciously holds onto one of our messages, they can use it to crash us once we do not have it in the duplicate cache anymore (due to cache timeout or node restart).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes you think so?
The sentence "This prevents false positives from replayed old messages."
If another node maliciously holds onto one of our messages, they can use it to crash us once we do not have it in the duplicate cache anymore (due to cache timeout or node restart).
The sentence is misleading; the idea is a protection from a protocol behavior. I thought about this other malicious case, but I don't know a solution to it yet. Not only is our gossipsub seen cache empty, but our message counts in validation as well.
It's still a heavily drafted PR; it is not ready for review |
Follows codebase pattern of handling slot_clock.now() returning None explicitly rather than silently falling back to Epoch 0. The current epoch is now required as a parameter to the service constructor, following the pattern used by other services in the codebase.
Add explicit error handling when slot_clock.now() returns None. If we can't read the current slot, we can't reliably determine the epoch or update the mode, so we skip the doppelgänger check and log a warning.
Replace RwLock with Mutex for DoppelgangerState since all operations are fast (HashMap lookups/updates) and the RwLock complexity isn't justified. Changes: - Replace Arc<RwLock<DoppelgangerState>> with Arc<Mutex<DoppelgangerState>> - Add update_and_check_freshness() method that atomically updates max height and checks freshness in one lock acquisition - Make update_max_height() and is_fresh() private helper methods - Simplify check_message() logic by removing drop/re-acquire pattern This provides cleaner API surface with better separation of concerns: - State handles data operations - Service handles policy decisions
Stale messages during doppelgänger monitoring are expected (network delays, replays) and not actionable. Using debug level reduces noise while keeping the information available for troubleshooting.
The enabled check is already done in lib.rs - we only create the service if operator_dg is true and impostor mode is disabled. No need to store and check an enabled flag in the service itself.
Add high-value tests covering core check_message functionality: - Twin detection with fresh single-signer messages - Stale message filtering (beyond fresh_k window) - Multi-signer aggregate message handling - Different operator ID filtering - Monitoring period transitions - Freshness window boundaries - Independent committee height tracking - Height progression scenarios Total: 14 service tests + 5 state tests = 19 passing tests
…ring wait Addresses PR review feedback about waiting for monitoring period to complete before starting validator services. Changes: - Extract ~50 lines of initialization code into `initialize_operator_doppelganger()` helper - Add `watch::Receiver<bool>` to broadcast monitoring status (similar to `is_synced` pattern) - Add `spawn_monitor_task()` method to encapsulate background epoch monitoring task - Add explicit `transition_to_active()` method for state transition - Client now waits for monitoring completion before starting services (matches sync wait pattern) - Remove `update_mode()` in favor of explicit `set_active()` call 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Improvements: - Add error logging for watch channel send failures - Remove unused parameters from DoppelgangerState::new() - Document hardcoded 12-second sleep duration with note about network flexibility - Change #[allow(dead_code)] to #[cfg(test)] for test-only methods - Add #[must_use] annotations to methods returning important values - Remove unused Epoch import - Remove redundant test_monitoring_mode_transition test (duplicated by test_no_twin_after_monitoring_period_ends) All 18 tests passing, no compiler warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The background monitoring task was using a hardcoded 12-second sleep interval. This change parameterizes the slot duration, deriving it from the network specification (spec.seconds_per_slot) to support different networks with varying slot durations. Changes: - Add slot_duration field to OperatorDoppelgangerService struct - Update new() signature to accept slot_duration parameter - Update spawn_monitor_task() to use self.slot_duration - Update all call sites to pass Duration::from_secs(spec.seconds_per_slot) - Update tests to pass slot_duration from TestingSlotClock 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ations The `update_and_check_freshness()` method had a semantic problem: it updated the height tracker even for stale messages (likely replays), and it checked freshness against an already-updated state. This refactoring separates the operations: 1. Check if the message is fresh (against current state) 2. Only update height tracking if the message is fresh Benefits: - Clearer code: each operation does one thing - More correct: we only track fresh messages, not stale replays - No confusion about order of operations - Better separation of concerns Changes: - Remove `update_and_check_freshness()` method from DoppelgangerState - Make `is_fresh()` and `update_max_height()` public - Update `check_message()` to call methods separately - Add documentation clarifying usage order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Removed three low-value tests that don't add meaningful coverage: 1. test_monitoring_state_persists - Had misleading comment about message checking - Didn't test actual behavior (background task not started) - Redundant with test_no_twin_after_monitoring_period_ends 2. test_different_fresh_k_values - Only tested constructor accepts different K values - No verification of behavioral differences - Already covered by test_service_creation 3. test_increasing_heights_all_fresh - Tested obvious behavior (sequential heights are fresh) - Fully covered by test_freshness_window_boundary - Redundant sanity check with no additional value All remaining tests cover meaningful scenarios and edge cases. Test count: 18 → 15 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The test_initial_state test is completely redundant with test_mode_transition, which already verifies the initial state as part of testing the state transition. Before: - test_initial_state: checks starts in Monitor mode - test_mode_transition: checks starts in Monitor, then transitions After: - test_mode_transition: covers both (initial state + transition) Test count: 15 → 14 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
I'll analyze this and get back to you. |
- Replace DoppelgangerMode enum with boolean monitoring flag - Remove unnecessary #[must_use] attributes on getters - Remove redundant test_grace_period_initially_active - Simplify logging by removing defensive operator_id checks - Remove unused operator_id parameter from start_operator_doppelganger The "Active" state didn't represent actual behavior, just "not monitoring". A simple boolean is clearer and more maintainable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Remove slot_clock and monitor_end_epoch from service struct - Replace polling loop with single sleep for total duration (grace_period + monitoring_duration) - Remove unused watch channel for broadcasting monitoring state - Simplify new() signature to only require essential parameters - Rename transition_to_active() to end_monitoring_period() - Update tests to match simplified service creation Best practice: For short, predictable timers (~10-15 minutes), a single sleep is more efficient and simpler than polling. The monitoring duration is known at startup, so checking every slot is unnecessary overhead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The function was just calling .new() and wrapping in Arc - no logic, validation, or transformation. Inlined at call site for simplicity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
It has been simplified substantially |
The service was generic over <E: EthSpec, S: SlotClock> but: - E was only used for E::slots_per_epoch() - now stored as u64 field - S was only in PhantomData after removing slot_clock field Simplification: - Changed from OperatorDoppelgangerService<E, S> to OperatorDoppelgangerService - Added slots_per_epoch: u64 field (passed to new()) - Removed PhantomData and all generic constraints - Updated NetworkMessageReceiver to drop E generic - Simplified all call sites Benefits: - Simpler type signatures - No phantom data noise - More explicit about what's needed (slots_per_epoch value, not entire EthSpec) - Easier to test (no need for MainnetEthSpec/MinimalEthSpec types) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace manual state manipulation in tests with async timer tests that verify spawn_monitor_task() correctly transitions through states via tokio timers. Changes: - Make DoppelgangerState::end_grace_period() pub(crate) (internal API) - Add test helper create_test_executor() for async test infrastructure - Add async tests using tokio::time::{pause, advance} for deterministic timing: - test_twin_detected_after_grace_period_timer: Verifies grace period ends via timer - test_no_twin_during_grace_period: Verifies no detection during grace period - test_no_twin_after_monitoring_period_timer: Verifies monitoring ends via timer - Add dev-dependencies: async-channel, tokio test-util feature Tests now verify actual async timer behavior instead of manually calling state transitions, providing better coverage of the spawned task logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace all manual state manipulation with async timer tests using helper function to eliminate direct state access from tests. Changes: - Add spawn_and_advance_past_grace_period() helper to reduce test duplication - Convert test_no_twin_multi_signer_aggregate_message to async timer test - Convert test_no_twin_different_operator_id to async timer test - Refactor test_twin_detected_after_grace_period_timer to use helper - Keep test_service_creation as sync (tests initial state only) All detection logic tests now verify actual timer behavior via tokio::time::advance(), following Tokio testing best practices. Tests are more maintainable and realistic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace yield loops (0..10 iterations) with single yield_now() calls after time advancement. According to tokio documentation, tasks waiting on timers complete immediately when time advances, but may not be polled until the next yield point. Changes: - spawn_and_advance_past_grace_period(): Use single yield after advance - test_no_twin_after_monitoring_period_timer(): Use single yield after each advance Verified stable across multiple test runs. Reduces code duplication while maintaining reliability. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace two boolean fields (monitoring, in_grace_period) with a clear three-state enum (GracePeriod → Monitoring → Disabled). This eliminates impossible states and makes the state machine explicit. Changes: - Convert DoppelgangerState from struct with bools to enum with three variants - Add PartialEq, Eq derives for better testing - Document state transitions clearly in type-level documentation - Update is_monitoring() to use pattern matching - Update transition methods to use explicit enum assignment - All existing tests pass without modification Benefits: - Impossible states (e.g., monitoring=false, in_grace_period=true) now prevented by type system - State transitions are explicit and clear - Better compile-time guarantees about state validity - More maintainable and self-documenting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
e8ed128
to
5d76d33
Compare
- Remove state.rs module and keep DoppelgangerState private in service.rs - Simplify tests to use public API (is_monitoring) instead of internal state comparisons - Remove redundant assert_eq checks in favor of behavior-focused assertions - Fix clippy lint for identity operation in test
5d76d33
to
2aa1ffb
Compare
- Remove slot_clock (unused) - Move types to dev-dependencies (only used in tests)
2f32d96
to
56ef9a8
Compare
Claude finished @diegomrsantos's task —— View job Code Review: Operator Doppelgänger ProtectionComprehensive review of the simplified implementation completed. The refactoring has addressed most of the previous concerns - excellent work on the simplification! SummaryOverall Assessment: This is a well-designed and significantly improved implementation. The recent simplification (removing epoch tracking, generics, and watch channels) has made the code much cleaner while maintaining safety guarantees. The implementation is production-ready with minor suggestions for polish. Key Strengths:
Recommendation: ✅ Approve with minor suggestions. The critical issues from previous reviews have been resolved through the simplification. Architecture & Design ✅Excellent Simplification The refactoring successfully addressed @dknopik's "needlessly complex" feedback:
Before: ~200 LOC service with generics, watch channels, epoch polling loop This is a textbook example of refactoring for simplicity while preserving correctness. Code Quality1. State Management ✅Location: The state enum is clean and well-documented: GracePeriod → Monitoring → Completed The documentation explaining the grace period (lines 24-41) is excellent - it clearly explains the gossipsub cache vulnerability and why we need to wait. Minor suggestion: Consider adding a state transition diagram in the module-level docs for discoverability. 2. Timer Logic ✅Location: The new single-sleep approach is much simpler: tokio::time::sleep(grace_period).await;
self.state.lock().end_grace_period();
tokio::time::sleep(monitoring_duration).await;
self.end_monitoring_period(); This replaces the previous slot-by-slot polling loop, eliminating complexity and potential timing bugs. 3. Detection Logic ✅Location: The
Good: Separation of detection ( Security Considerations1. Grace Period Calculation ✅Location: pub const OPERATOR_DOPPELGANGER_GRACE_PERIOD_SECS: u64 = {
let cache_window_millis = GOSSIPSUB_HISTORY_LENGTH as u64 * GOSSIPSUB_HEARTBEAT_INTERVAL_MILLIS;
cache_window_millis / 1000 + 1
}; Excellent: Automatically derived from gossipsub parameters (4.2s cache + 1s buffer = 5s). This will stay synchronized if gossipsub parameters change. Consideration: The +1 second buffer is reasonable but not explicitly justified. Consider documenting why 1s is sufficient (network delays, clock skew, etc.). See inline comment. 2. Message Filtering ✅Location: The single-signer check correctly prevents false positives:
Note: Previous concern about partial signatures not being checked has been addressed - both message types are now checked in 3. Shutdown Safety ✅Location: let _ = self
.shutdown_sender
.lock()
.try_send(ShutdownReason::Failure("Operator doppelgänger detected")); Good: Uses Minor suggestion: Consider logging if the send fails, though in practice this shouldn't happen. Performance1. Lock Contention
|
/// Default: 5s (cache window is 4.2s) | ||
pub const OPERATOR_DOPPELGANGER_GRACE_PERIOD_SECS: u64 = { | ||
let cache_window_millis = GOSSIPSUB_HISTORY_LENGTH as u64 * GOSSIPSUB_HEARTBEAT_INTERVAL_MILLIS; | ||
cache_window_millis / 1000 + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation suggestion: Consider documenting why +1 second buffer is sufficient.
The calculation is clear, but readers might wonder if 1s accounts for:
- Network propagation delays
- Clock skew between nodes
- Processing time variations
Something like:
// Add 1 second buffer for network delays and clock skew
cache_window_millis / 1000 + 1
This helps future maintainers understand the safety margin.
long, | ||
help = "Enable operator doppelgänger protection. When enabled, the node will monitor \ | ||
for messages signed by its operator ID on startup and shut down if a twin \ | ||
(duplicate operator) is detected. Enabled by default.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation improvement: Consider enhancing the help text to include:
help = "Enable operator doppelgänger protection. When enabled, the node will monitor \
for messages signed by its operator ID on startup and shut down if a twin \
(duplicate operator) is detected. During the monitoring period (~2 epochs + 5s grace), \
incoming duties will be dropped to ensure safe detection. \
Use --operator-dg=false to disable. Enabled by default."
This clarifies:
- What happens during monitoring (duties dropped)
- How to disable the feature
- Approximate monitoring duration
Helps operators understand the behavior without reading source code.
ValidatedSSVMessage::PartialSignatureMessages(_) => None, | ||
}; | ||
service.check_message(&signed_ssv_message, qbft_msg); | ||
// Drop message during monitoring period - don't process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observability suggestion: Consider adding metrics or periodic logging for messages dropped during monitoring.
This would help operators understand:
- That monitoring is active
- How many duties are being skipped
- When monitoring period ends
Example:
if service.is_monitoring() {
// ... existing check_message logic ...
// Optional: track dropped messages
metrics::inc_counter(&metrics::DOPPELGANGER_MESSAGES_DROPPED);
return;
}
Not critical, but useful for operational visibility.
Replace Mutex with RwLock for DoppelgangerState to eliminate lock contention in the hot path (message validation). Multiple readers can now check state concurrently without blocking each other. Benefits: - Lock-free reads after monitoring completes (entire node lifetime) - Minimal overhead during brief monitoring period (2 epochs) - Cleaner API than AtomicU8 alternative - Standard Rust pattern for read-heavy data Impact: Every incoming message checks is_doppelganger() - this optimization removes unnecessary lock contention for >99.9% of node uptime (after monitoring period ends).
Add message blocking during operator doppelgänger monitoring period to prevent the node from sending any messages while detecting potential twin operators. This ensures complete silence during grace period and monitoring. Changes: - Add DoppelgangerMonitoring error variant to MessageSender - Pass doppelganger service to NetworkMessageSender - Add monitoring checks to sign_and_send() and send() methods - Move doppelganger service creation earlier in client initialization
Btw, thinking about it more, doesn't doing it after validation possibly mask detection of some messages? e.g. if we and a doppelganger both send a qbft propose message, validation will only consider our message as valid, as we only accept one propose per round. So I think we only need to make sure that the signature is valid and that the message is fresh for doppelganger protection |
Issue Addressed
Closes #627
Proposed Changes
Implements Operator Doppelgänger Protection to prevent accidental slashing when multiple instances of Anchor run with the same operator ID. The feature monitors the network during a configurable waiting period to detect if another instance is already active.
Core Implementation
State Management (
operator_doppelganger/state.rs
):Service Layer (
operator_doppelganger/service.rs
):Integration (
lib.rs
,message_receiver/manager.rs
):Grace Period Configuration (
network/behaviour.rs
):(history_length × heartbeat_interval) / 1000 + 1
Configuration
--operator-dg
: Enable protection (default: true)--operator-dg-wait-epochs
: Monitoring period in epochs (default: 2)Grace Period Rationale
After restart, gossipsub seen_cache is cleared but messages remain in peers' mcache for up to 4.2s. Without a grace period, the node would receive its own old messages via IHAVE/IWANT and incorrectly detect a twin. The grace period ensures all old messages expire before detection starts.
Testing
9 tests covering:
Additional Info
Design Decisions:
Future Considerations: