feat: Operator Doppelgänger Protection #692

diegomrsantos · 2025-10-15T21:00:39Z

Issue Addressed

Closes #627

Proposed Changes

Implements Operator Doppelgänger Protection to prevent accidental slashing when multiple instances of Anchor run with the same operator ID. The feature monitors the network during a configurable waiting period to detect if another instance is already active.

Core Implementation

State Management (operator_doppelganger/state.rs):
- Tracks operating mode (Monitor/Active) with startup grace period
- Grace period prevents false positives from own old messages after restart
- Messages remain in gossipsub mcache for ~4.2s (history_length × heartbeat_interval)
Service Layer (operator_doppelganger/service.rs):
- Orchestrates monitoring period based on epoch progression
- Background task sleeps for grace period, then monitors epochs until waiting period ends
- Detects single-signer messages with own operator ID after grace period
- Watch channel broadcasts monitoring status to block duties until safe
Integration (lib.rs, message_receiver/manager.rs):
- Integrated into message receiver pipeline
- Services wait for sync completion AND doppelgänger monitoring
- Graceful shutdown on twin detection
Grace Period Configuration (network/behaviour.rs):
- Automatically derived: (history_length × heartbeat_interval) / 1000 + 1
- Default: 5 seconds (gossip cache window is 4.2s)
- Updates automatically when gossipsub parameters change

Configuration

--operator-dg: Enable protection (default: true)
--operator-dg-wait-epochs: Monitoring period in epochs (default: 2)

Grace Period Rationale

After restart, gossipsub seen_cache is cleared but messages remain in peers' mcache for up to 4.2s. Without a grace period, the node would receive its own old messages via IHAVE/IWANT and incorrectly detect a twin. The grace period ensures all old messages expire before detection starts.

Testing

9 tests covering:

Twin detection with single-signer messages
Grace period prevents false positives during startup
Multi-signer aggregate messages ignored
Different operator IDs ignored
Mode transitions after monitoring period

Additional Info

Design Decisions:

Grace period approach: simpler than height tracking, works uniformly for all message types
Derived from gossipsub config: automatically stays synchronized
Watch channel: allows multiple services to wait for monitoring completion

Future Considerations:

Metrics for doppelgänger events (monitoring status, detection count)
Configurable actions on detection (shutdown vs. alarm only)

Add configuration options for operator doppelgänger protection: - --operator-dg: Enable/disable the feature (default: true) - --operator-dg-wait-epochs: Epochs to wait in monitor mode (default: 2) - --operator-dg-fresh-k: Freshness threshold for twin detection (default: 3) This implements the configuration layer for issue sigp#627, allowing operators to detect if their operator ID is already active elsewhere and shut down to prevent equivocation. Related to sigp#627

Add a new service module for detecting operator doppelgängers: - State machine with Monitor/Active modes - Track recent max consensus height per committee - Freshness threshold (K) to prevent false positives from replays - Check for single-signer messages with own operator ID - Comprehensive unit tests for state machine logic The service detects if the operator's ID appears in fresh SSV messages during monitor mode, indicating another instance is already running. Part of sigp#627

Integrates the operator doppelgänger protection with the message flow: - Adds DoppelgangerConfig struct to message_receiver for checking messages - NetworkMessageReceiver checks QBFT messages for doppelgänger detection - Client initializes doppelgänger service when operator_dg is enabled - Fatal shutdown triggered via TaskExecutor when twin operator detected - Spawns background task to listen for shutdown signal - Tests updated to use correct CommitteeId constructor ([u8; 32]) - Allows clippy::too_many_arguments for NetworkMessageReceiver::new This implements the core detection and shutdown mechanism specified in sigp#627 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

anchor/client/src/lib.rs

dknopik · 2025-10-16T11:39:11Z

anchor/client/src/cli.rs

+    #[clap(
+        long,
+        value_name = "HEIGHTS",
+        help = "The freshness threshold for detecting operator twins. Only messages within \
+                this many consensus heights from the maximum observed height are considered \
+                fresh evidence of a twin. This prevents false positives from replayed old messages.",
+        display_order = 0,
+        default_value_t = 3,
+        requires = "operator_dg"
+    )]
+    pub operator_dg_fresh_k: u64,


I do not think this is necessary. Have you tested this and confirmed there to be an issue without this feature?

Could you elaborate more on why you think it's not necessary?

I understand now that this is about preventing replay attacks, not necessarily about this occurring naturally.

This isn't about replay attacks. What makes you think so?

What makes you think so?

The sentence "This prevents false positives from replayed old messages."

If another node maliciously holds onto one of our messages, they can use it to crash us once we do not have it in the duplicate cache anymore (due to cache timeout or node restart).

What makes you think so?

The sentence "This prevents false positives from replayed old messages."

If another node maliciously holds onto one of our messages, they can use it to crash us once we do not have it in the duplicate cache anymore (due to cache timeout or node restart).

The sentence is misleading; the idea is a protection from a protocol behavior. I thought about this other malicious case, but I don't know a solution to it yet. Not only is our gossipsub seen cache empty, but our message counts in validation as well.

anchor/client/src/lib.rs

anchor/message_receiver/src/manager.rs

anchor/client/src/cli.rs

diegomrsantos · 2025-10-16T12:07:17Z

It's still a heavily drafted PR; it is not ready for review

Follows codebase pattern of handling slot_clock.now() returning None explicitly rather than silently falling back to Epoch 0. The current epoch is now required as a parameter to the service constructor, following the pattern used by other services in the codebase.

Add explicit error handling when slot_clock.now() returns None. If we can't read the current slot, we can't reliably determine the epoch or update the mode, so we skip the doppelgänger check and log a warning.

Replace RwLock with Mutex for DoppelgangerState since all operations are fast (HashMap lookups/updates) and the RwLock complexity isn't justified. Changes: - Replace Arc<RwLock<DoppelgangerState>> with Arc<Mutex<DoppelgangerState>> - Add update_and_check_freshness() method that atomically updates max height and checks freshness in one lock acquisition - Make update_max_height() and is_fresh() private helper methods - Simplify check_message() logic by removing drop/re-acquire pattern This provides cleaner API surface with better separation of concerns: - State handles data operations - Service handles policy decisions

Stale messages during doppelgänger monitoring are expected (network delays, replays) and not actionable. Using debug level reduces noise while keeping the information available for troubleshooting.

The enabled check is already done in lib.rs - we only create the service if operator_dg is true and impostor mode is disabled. No need to store and check an enabled flag in the service itself.

Add high-value tests covering core check_message functionality: - Twin detection with fresh single-signer messages - Stale message filtering (beyond fresh_k window) - Multi-signer aggregate message handling - Different operator ID filtering - Monitoring period transitions - Freshness window boundaries - Independent committee height tracking - Height progression scenarios Total: 14 service tests + 5 state tests = 19 passing tests

…ring wait Addresses PR review feedback about waiting for monitoring period to complete before starting validator services. Changes: - Extract ~50 lines of initialization code into `initialize_operator_doppelganger()` helper - Add `watch::Receiver<bool>` to broadcast monitoring status (similar to `is_synced` pattern) - Add `spawn_monitor_task()` method to encapsulate background epoch monitoring task - Add explicit `transition_to_active()` method for state transition - Client now waits for monitoring completion before starting services (matches sync wait pattern) - Remove `update_mode()` in favor of explicit `set_active()` call 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Improvements: - Add error logging for watch channel send failures - Remove unused parameters from DoppelgangerState::new() - Document hardcoded 12-second sleep duration with note about network flexibility - Change #[allow(dead_code)] to #[cfg(test)] for test-only methods - Add #[must_use] annotations to methods returning important values - Remove unused Epoch import - Remove redundant test_monitoring_mode_transition test (duplicated by test_no_twin_after_monitoring_period_ends) All 18 tests passing, no compiler warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The background monitoring task was using a hardcoded 12-second sleep interval. This change parameterizes the slot duration, deriving it from the network specification (spec.seconds_per_slot) to support different networks with varying slot durations. Changes: - Add slot_duration field to OperatorDoppelgangerService struct - Update new() signature to accept slot_duration parameter - Update spawn_monitor_task() to use self.slot_duration - Update all call sites to pass Duration::from_secs(spec.seconds_per_slot) - Update tests to pass slot_duration from TestingSlotClock 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ations The `update_and_check_freshness()` method had a semantic problem: it updated the height tracker even for stale messages (likely replays), and it checked freshness against an already-updated state. This refactoring separates the operations: 1. Check if the message is fresh (against current state) 2. Only update height tracking if the message is fresh Benefits: - Clearer code: each operation does one thing - More correct: we only track fresh messages, not stale replays - No confusion about order of operations - Better separation of concerns Changes: - Remove `update_and_check_freshness()` method from DoppelgangerState - Make `is_fresh()` and `update_max_height()` public - Update `check_message()` to call methods separately - Add documentation clarifying usage order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Removed three low-value tests that don't add meaningful coverage: 1. test_monitoring_state_persists - Had misleading comment about message checking - Didn't test actual behavior (background task not started) - Redundant with test_no_twin_after_monitoring_period_ends 2. test_different_fresh_k_values - Only tested constructor accepts different K values - No verification of behavioral differences - Already covered by test_service_creation 3. test_increasing_heights_all_fresh - Tested obvious behavior (sequential heights are fresh) - Fully covered by test_freshness_window_boundary - Redundant sanity check with no additional value All remaining tests cover meaningful scenarios and edge cases. Test count: 18 → 15 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The test_initial_state test is completely redundant with test_mode_transition, which already verifies the initial state as part of testing the state transition. Before: - test_initial_state: checks starts in Monitor mode - test_mode_transition: checks starts in Monitor, then transitions After: - test_mode_transition: covers both (initial state + transition) Test count: 15 → 14 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

anchor/client/src/lib.rs

anchor/client/src/cli.rs

anchor/operator_doppelganger/src/service.rs

claude-code-actions-sigp · 2025-10-22T14:24:59Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

- Replace DoppelgangerMode enum with boolean monitoring flag - Remove unnecessary #[must_use] attributes on getters - Remove redundant test_grace_period_initially_active - Simplify logging by removing defensive operator_id checks - Remove unused operator_id parameter from start_operator_doppelganger The "Active" state didn't represent actual behavior, just "not monitoring". A simple boolean is clearer and more maintainable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Remove slot_clock and monitor_end_epoch from service struct - Replace polling loop with single sleep for total duration (grace_period + monitoring_duration) - Remove unused watch channel for broadcasting monitoring state - Simplify new() signature to only require essential parameters - Rename transition_to_active() to end_monitoring_period() - Update tests to match simplified service creation Best practice: For short, predictable timers (~10-15 minutes), a single sleep is more efficient and simpler than polling. The monitoring duration is known at startup, so checking every slot is unnecessary overhead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The function was just calling .new() and wrapping in Arc - no logic, validation, or transformation. Inlined at call site for simplicity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

diegomrsantos · 2025-10-22T15:58:53Z

I feel like a lot of this is needlessly complex

It has been simplified substantially

The service was generic over <E: EthSpec, S: SlotClock> but: - E was only used for E::slots_per_epoch() - now stored as u64 field - S was only in PhantomData after removing slot_clock field Simplification: - Changed from OperatorDoppelgangerService<E, S> to OperatorDoppelgangerService - Added slots_per_epoch: u64 field (passed to new()) - Removed PhantomData and all generic constraints - Updated NetworkMessageReceiver to drop E generic - Simplified all call sites Benefits: - Simpler type signatures - No phantom data noise - More explicit about what's needed (slots_per_epoch value, not entire EthSpec) - Easier to test (no need for MainnetEthSpec/MinimalEthSpec types) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Replace manual state manipulation in tests with async timer tests that verify spawn_monitor_task() correctly transitions through states via tokio timers. Changes: - Make DoppelgangerState::end_grace_period() pub(crate) (internal API) - Add test helper create_test_executor() for async test infrastructure - Add async tests using tokio::time::{pause, advance} for deterministic timing: - test_twin_detected_after_grace_period_timer: Verifies grace period ends via timer - test_no_twin_during_grace_period: Verifies no detection during grace period - test_no_twin_after_monitoring_period_timer: Verifies monitoring ends via timer - Add dev-dependencies: async-channel, tokio test-util feature Tests now verify actual async timer behavior instead of manually calling state transitions, providing better coverage of the spawned task logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Replace all manual state manipulation with async timer tests using helper function to eliminate direct state access from tests. Changes: - Add spawn_and_advance_past_grace_period() helper to reduce test duplication - Convert test_no_twin_multi_signer_aggregate_message to async timer test - Convert test_no_twin_different_operator_id to async timer test - Refactor test_twin_detected_after_grace_period_timer to use helper - Keep test_service_creation as sync (tests initial state only) All detection logic tests now verify actual timer behavior via tokio::time::advance(), following Tokio testing best practices. Tests are more maintainable and realistic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Replace yield loops (0..10 iterations) with single yield_now() calls after time advancement. According to tokio documentation, tasks waiting on timers complete immediately when time advances, but may not be polled until the next yield point. Changes: - spawn_and_advance_past_grace_period(): Use single yield after advance - test_no_twin_after_monitoring_period_timer(): Use single yield after each advance Verified stable across multiple test runs. Reduces code duplication while maintaining reliability. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Replace two boolean fields (monitoring, in_grace_period) with a clear three-state enum (GracePeriod → Monitoring → Disabled). This eliminates impossible states and makes the state machine explicit. Changes: - Convert DoppelgangerState from struct with bools to enum with three variants - Add PartialEq, Eq derives for better testing - Document state transitions clearly in type-level documentation - Update is_monitoring() to use pattern matching - Update transition methods to use explicit enum assignment - All existing tests pass without modification Benefits: - Impossible states (e.g., monitoring=false, in_grace_period=true) now prevented by type system - State transitions are explicit and clear - Better compile-time guarantees about state validity - More maintainable and self-documenting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Remove state.rs module and keep DoppelgangerState private in service.rs - Simplify tests to use public API (is_monitoring) instead of internal state comparisons - Remove redundant assert_eq checks in favor of behavior-focused assertions - Fix clippy lint for identity operation in test

- Remove slot_clock (unused) - Move types to dev-dependencies (only used in tests)

claude-code-actions-sigp · 2025-10-22T20:14:19Z

Claude finished @diegomrsantos's task —— View job

Code Review: Operator Doppelgänger Protection

Comprehensive review of the simplified implementation completed. The refactoring has addressed most of the previous concerns - excellent work on the simplification!

Summary

Overall Assessment: This is a well-designed and significantly improved implementation. The recent simplification (removing epoch tracking, generics, and watch channels) has made the code much cleaner while maintaining safety guarantees. The implementation is production-ready with minor suggestions for polish.

Key Strengths:

Simple, focused state machine (3 states, clear transitions)
Clean single-sleep timer design (no complex epoch polling)
Excellent documentation of the grace period rationale
Comprehensive test coverage using async timers
Proper thread safety with Mutex and Arc
Grace period automatically derived from gossipsub parameters

Recommendation: ✅ Approve with minor suggestions. The critical issues from previous reviews have been resolved through the simplification.

Architecture & Design ✅

Excellent Simplification

The refactoring successfully addressed @dknopik's "needlessly complex" feedback:

Removed unnecessary generics - No more PhantomData<E>, direct use of OwnOperatorId
Simplified timing - Single sleep(grace + monitoring) instead of slot-by-slot polling
Removed watch channel - Direct state checking instead of broadcasting
Eliminated epoch tracking - Calculate duration upfront, no need for monitor_end_epoch
State machine internalized - DoppelgangerState is now private, cleaner API

Before: ~200 LOC service with generics, watch channels, epoch polling loop
After: ~150 LOC with single timer, direct state transitions

This is a textbook example of refactoring for simplicity while preserving correctness.

Code Quality

1. State Management ✅

Location: anchor/operator_doppelganger/src/service.rs:22-77

The state enum is clean and well-documented:

GracePeriod → Monitoring → Completed

The documentation explaining the grace period (lines 24-41) is excellent - it clearly explains the gossipsub cache vulnerability and why we need to wait.

Minor suggestion: Consider adding a state transition diagram in the module-level docs for discoverability.

2. Timer Logic ✅

Location: anchor/operator_doppelganger/src/service.rs:123-149

The new single-sleep approach is much simpler:

tokio::time::sleep(grace_period).await;
self.state.lock().end_grace_period();
tokio::time::sleep(monitoring_duration).await;
self.end_monitoring_period();

This replaces the previous slot-by-slot polling loop, eliminating complexity and potential timing bugs.

3. Detection Logic ✅

Location: anchor/operator_doppelganger/src/service.rs:173-216

The is_doppelganger() method is clean and has proper early returns:

Checks monitoring state first (fast path)
Handles None operator ID gracefully (still syncing)
Filters multi-signer messages correctly
Logs detailed context on detection

Good: Separation of detection (is_doppelganger) from action (check_message)

Security Considerations

1. Grace Period Calculation ✅

Location: anchor/network/src/behaviour.rs:34-37

pub const OPERATOR_DOPPELGANGER_GRACE_PERIOD_SECS: u64 = {
    let cache_window_millis = GOSSIPSUB_HISTORY_LENGTH as u64 * GOSSIPSUB_HEARTBEAT_INTERVAL_MILLIS;
    cache_window_millis / 1000 + 1
};

Excellent: Automatically derived from gossipsub parameters (4.2s cache + 1s buffer = 5s). This will stay synchronized if gossipsub parameters change.

Consideration: The +1 second buffer is reasonable but not explicitly justified. Consider documenting why 1s is sufficient (network delays, clock skew, etc.). See inline comment.

2. Message Filtering ✅

Location: anchor/operator_doppelganger/src/service.rs:190-201

The single-signer check correctly prevents false positives:

Multi-signer aggregates are ignored (correct - those are legitimate)
Only messages with exactly our operator ID are flagged
Works for both QBFT and partial signature messages (good!)

Note: Previous concern about partial signatures not being checked has been addressed - both message types are now checked in manager.rs:168-180.

3. Shutdown Safety ✅

Location: anchor/operator_doppelganger/src/service.rs:228-231

let _ = self
    .shutdown_sender
    .lock()
    .try_send(ShutdownReason::Failure("Operator doppelgänger detected"));

Good: Uses try_send to avoid blocking. The let _ = is acceptable here since we're shutting down anyway.

Minor suggestion: Consider logging if the send fails, though in practice this shouldn't happen.

Performance

1. Lock Contention ⚠️

Location: anchor/operator_doppelganger/src/service.rs:178

let state = self.state.lock();

Every incoming message acquires this lock, even after monitoring completes.

Impact: Low during monitoring (2 epochs = ~12 minutes), but this continues forever after monitoring ends.

Optimization suggestions - See inline comment for details:

Option A - Early exit without lock (recommended)
Option B - Use RwLock instead of Mutex
Option C - Use an AtomicU8 for state (most efficient)

None of these are critical since the monitoring period is short, but Option A would be a simple improvement.

2. Message Dropping During Monitoring ✅

Location: anchor/message_receiver/src/manager.rs:168-180

Messages are dropped during monitoring with a simple early return. This is correct and efficient.

Question: Should there be any metrics or logging for dropped message counts during monitoring? This might help operators understand why duties are being missed during the monitoring window. See inline comment.

Testing

Test Coverage ✅

Location: anchor/operator_doppelganger/src/service.rs:244-535

The test suite is comprehensive (300+ lines):

State lifecycle - Basic state transitions ✅
Grace period - No detection during grace period (prevents false positives) ✅
Twin detection - Detects single-signer with our ID ✅
Multi-signer filtering - Ignores aggregates ✅
Different operator - Ignores other operators ✅
Monitoring end - No detection after period ends ✅

Excellent: All tests use #[tokio::test(start_paused = true)] for deterministic async timer testing. This is the correct approach for timer-based tests.

Test quality observations:

Good use of helper functions (create_test_message, spawn_and_advance_past_grace_period)
Tests are independent and well-named
Proper use of tokio::time::advance() for time-based scenarios

Missing test scenarios (non-critical):

Operator ID not yet available (sync still in progress) - line 186-188 path
Shutdown sender failure scenario
Concurrent message checking (multiple threads)

Integration & Configuration

1. CLI Configuration ✅

Location: anchor/client/src/cli.rs:488-510

#[clap(
    long,
    default_value_t = true,
    help = "Enable operator doppelgänger protection..."
)]
pub operator_dg: bool,

#[clap(
    long,
    default_value_t = 2,
    requires = "operator_dg"
)]
pub operator_dg_wait_epochs: u64,

Note on requires attribute: @dknopik mentioned this doesn't prevent the option from being listed. If that's correct, then the current implementation is fine. The flag provides helpful context about dependencies.

Suggestion: See inline comment for improved help text that mentions:

How to disable: --operator-dg=false
What happens during monitoring (duties are dropped)
Default grace period (5s)

2. Integration Points ✅

Service Creation: anchor/client/src/lib.rs:500-508

Created only when enabled and not in impostor mode ✅
Waits for sync before starting ✅
Clean separation from other services ✅

Message Pipeline: anchor/message_receiver/src/manager.rs:168-180

Checks both QBFT and partial signatures ✅
Drops messages during monitoring ✅
No processing overhead after monitoring completes ✅

Shutdown: anchor/operator_doppelganger/src/service.rs:228-231

Uses ShutdownReason::Failure ✅
Non-blocking with try_send ✅

Documentation

1. Code Documentation ✅

Excellent:

Comprehensive explanation of grace period vulnerability (lines 24-41)
Clear state transition documentation (lines 12-20)
Function-level docs explain "why" not just "what"

Good examples:

spawn_monitor_task explains the grace period rationale
is_doppelganger clarifies it's detection without side effects
check_message explains it includes shutdown triggering

2. PR Description ✅

The PR description is well-structured with:

Clear explanation of core implementation
Configuration options documented
Grace period rationale explained
Test coverage summary

Previous Review Concerns - Resolution Status

@dknopik's concerns (2025-10-21):

✅ "needlessly complex" - RESOLVED: Significant simplification completed
✅ "how is committee_id relevant at all?" - RESOLVED: Committee tracking removed entirely
✅ "Why bother with [height tracking]" - RESOLVED: Height tracking removed, now uses simple timer
✅ "why not just sleep until the target epoch starts?" - RESOLVED: Now using single sleep timer
✅ "I do not think we need separate modes" - PARTIALLY: Still has 3 states, but simplified (GracePeriod/Monitoring/Completed seem necessary)
✅ "code needs to cope with not having operator ID yet" - RESOLVED: Line 186-188 handles None gracefully
✅ "requires does not prevent option from being listed" - NOTED: If correct, current implementation is fine

Previous claude-code review concerns (2025-10-21):

✅ Startup crash issue - RESOLVED: Service created but not started until after sync (line 621)
✅ Partial signature checking - RESOLVED: Now checked alongside QBFT messages
✅ Generic removal - RESOLVED: No more unnecessary generics
⚠️ Lock contention - PARTIALLY: Could still optimize with early exit (see Performance section)

Recommendations

High Priority: None ✅

All critical issues have been resolved.

Medium Priority:

Performance optimization (optional):
- Add early exit check before acquiring lock in is_doppelganger() (see inline comment)
- Or use RwLock instead of Mutex
Observability (nice-to-have):
- Add metric for messages dropped during monitoring (see inline comment)
- Add metric for monitoring state transitions
- Consider logging when monitoring period ends
Documentation polish:
- Explain the +1 second buffer in grace period calculation (see inline comment)
- Update CLI help text to mention how to disable and what monitoring does (see inline comment)
- Add state diagram to module-level docs

Low Priority:

Test coverage (optional):
- Add test for operator ID not yet available scenario
- Add concurrent message checking test
Error handling (nice-to-have):
- Log if shutdown sender fails (though unlikely to matter)

Final Verdict

Status: ✅ Approve with minor suggestions

This is a clean, well-tested implementation that successfully addresses the security concern of preventing operator equivocation. The recent simplification has made the code maintainable and easy to reason about.

Strengths:

Simple timer-based approach
Excellent documentation of edge cases
Comprehensive test coverage
Proper thread safety
Clean integration points

Minor improvements:

Consider lock contention optimization
Add observability metrics
Polish documentation

The code is production-ready. The suggested improvements are optimizations and nice-to-haves, not blockers.

Diff Summary:

New crate: operator_doppelganger/ (3 files, ~540 LOC including tests)
Integration: Clean additions to client init and message pipeline
Configuration: Two CLI flags with sensible defaults
Dependencies: Minimal, appropriate for the functionality

Inline Comments: 4 specific suggestions added to the PR for consideration.

claude-code-actions-sigp · 2025-10-22T20:23:19Z

anchor/network/src/behaviour.rs

+/// Default: 5s (cache window is 4.2s)
+pub const OPERATOR_DOPPELGANGER_GRACE_PERIOD_SECS: u64 = {
+    let cache_window_millis = GOSSIPSUB_HISTORY_LENGTH as u64 * GOSSIPSUB_HEARTBEAT_INTERVAL_MILLIS;
+    cache_window_millis / 1000 + 1


Documentation suggestion: Consider documenting why +1 second buffer is sufficient.

The calculation is clear, but readers might wonder if 1s accounts for:

Network propagation delays

Clock skew between nodes

Processing time variations

Something like:

// Add 1 second buffer for network delays and clock skew cache_window_millis / 1000 + 1

This helps future maintainers understand the safety margin.

anchor/operator_doppelganger/src/service.rs

claude-code-actions-sigp · 2025-10-22T20:23:22Z

anchor/client/src/cli.rs

+        long,
+        help = "Enable operator doppelgänger protection. When enabled, the node will monitor \
+                for messages signed by its operator ID on startup and shut down if a twin \
+                (duplicate operator) is detected. Enabled by default.",


Documentation improvement: Consider enhancing the help text to include:

help = "Enable operator doppelgänger protection. When enabled, the node will monitor \ for messages signed by its operator ID on startup and shut down if a twin \ (duplicate operator) is detected. During the monitoring period (~2 epochs + 5s grace), \ incoming duties will be dropped to ensure safe detection. \ Use --operator-dg=false to disable. Enabled by default."

This clarifies:

What happens during monitoring (duties dropped)

How to disable the feature

Approximate monitoring duration

Helps operators understand the behavior without reading source code.

claude-code-actions-sigp · 2025-10-22T20:23:23Z

anchor/message_receiver/src/manager.rs

+                            ValidatedSSVMessage::PartialSignatureMessages(_) => None,
+                        };
+                        service.check_message(&signed_ssv_message, qbft_msg);
+                        // Drop message during monitoring period - don't process


Observability suggestion: Consider adding metrics or periodic logging for messages dropped during monitoring.

This would help operators understand:

That monitoring is active

How many duties are being skipped

When monitoring period ends

Example:

if service.is_monitoring() { // ... existing check_message logic ... // Optional: track dropped messages metrics::inc_counter(&metrics::DOPPELGANGER_MESSAGES_DROPPED); return; }

Not critical, but useful for operational visibility.

Replace Mutex with RwLock for DoppelgangerState to eliminate lock contention in the hot path (message validation). Multiple readers can now check state concurrently without blocking each other. Benefits: - Lock-free reads after monitoring completes (entire node lifetime) - Minimal overhead during brief monitoring period (2 epochs) - Cleaner API than AtomicU8 alternative - Standard Rust pattern for read-heavy data Impact: Every incoming message checks is_doppelganger() - this optimization removes unnecessary lock contention for >99.9% of node uptime (after monitoring period ends).

Add message blocking during operator doppelgänger monitoring period to prevent the node from sending any messages while detecting potential twin operators. This ensures complete silence during grace period and monitoring. Changes: - Add DoppelgangerMonitoring error variant to MessageSender - Pass doppelganger service to NetworkMessageSender - Add monitoring checks to sign_and_send() and send() methods - Move doppelganger service creation earlier in client initialization

dknopik · 2025-10-23T10:32:16Z

Btw, thinking about it more, doesn't doing it after validation possibly mask detection of some messages? e.g. if we and a doppelganger both send a qbft propose message, validation will only consider our message as valid, as we only accept one propose per round. So I think we only need to make sure that the signature is valid and that the message is fresh for doppelganger protection

diegomrsantos and others added 4 commits October 15, 2025 22:18

Merge branch 'unstable' into feat/operator-doppelganger-protection

4237308

dknopik reviewed Oct 16, 2025

View reviewed changes

Merge branch 'unstable' into feat/operator-doppelganger-protection

be56784

diegomrsantos self-assigned this Oct 16, 2025

diegomrsantos and others added 15 commits October 17, 2025 16:15

Merge branch 'unstable' into feat/operator-doppelganger-protection

f776d11

fix: handle slot clock read failure in doppelgänger check

3117374

Add explicit error handling when slot_clock.now() returns None. If we can't read the current slot, we can't reliably determine the epoch or update the mode, so we skip the doppelgänger check and log a warning.

chore: change stale message log from warn to debug

4834057

Stale messages during doppelgänger monitoring are expected (network delays, replays) and not actionable. Using debug level reduces noise while keeping the information available for troubleshooting.

refactor: remove redundant enabled field from doppelgänger service

8f98ea0

The enabled check is already done in lib.rs - we only create the service if operator_dg is true and impostor mode is disabled. No need to store and check an enabled flag in the service itself.

Merge branch 'unstable' into feat/operator-doppelganger-protection

d93ef67

lint

58170c6

diegomrsantos marked this pull request as ready for review October 21, 2025 19:35

diegomrsantos requested a review from dknopik October 21, 2025 19:35

This comment was marked as outdated.

Sign in to view

claude-code-actions-sigp bot reviewed Oct 21, 2025

View reviewed changes

anchor/client/src/lib.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 21, 2025

View reviewed changes

anchor/client/src/cli.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 21, 2025

View reviewed changes

anchor/client/src/cli.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 21, 2025

View reviewed changes

anchor/operator_doppelganger/src/service.rs Outdated Show resolved Hide resolved

diegomrsantos marked this pull request as draft October 22, 2025 14:24

diegomrsantos marked this pull request as ready for review October 22, 2025 14:24

diegomrsantos and others added 4 commits October 22, 2025 17:06

chore: simplify redundant grace period comments

00c7546

diegomrsantos and others added 5 commits October 22, 2025 18:09

diegomrsantos force-pushed the feat/operator-doppelganger-protection branch from e8ed128 to 5d76d33 Compare October 22, 2025 17:41

diegomrsantos force-pushed the feat/operator-doppelganger-protection branch from 5d76d33 to 2aa1ffb Compare October 22, 2025 17:43

chore: remove unused dependencies from operator_doppelganger

56ef9a8

- Remove slot_clock (unused) - Move types to dev-dependencies (only used in tests)

diegomrsantos force-pushed the feat/operator-doppelganger-protection branch from 2f32d96 to 56ef9a8 Compare October 22, 2025 20:06

diegomrsantos marked this pull request as draft October 22, 2025 20:13

diegomrsantos marked this pull request as ready for review October 22, 2025 20:13

claude-code-actions-sigp bot reviewed Oct 22, 2025

View reviewed changes

anchor/operator_doppelganger/src/service.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 22, 2025

View reviewed changes

diegomrsantos mentioned this pull request Oct 22, 2025

ci: add timeout and debug logging to claude-pr-review workflow #705

Open

diegomrsantos added 3 commits October 22, 2025 23:28

Merge branch 'unstable' into feat/operator-doppelganger-protection

072924a

feat: Operator Doppelgänger Protection #692

Are you sure you want to change the base?

feat: Operator Doppelgänger Protection #692

Conversation

diegomrsantos commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Core Implementation

Configuration

Grace Period Rationale

Testing

Additional Info

Uh oh!

Uh oh!

dknopik Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

dknopik Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

dknopik Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

diegomrsantos commented Oct 16, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude-code-actions-sigp bot commented Oct 22, 2025

Uh oh!

diegomrsantos commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude-code-actions-sigp bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Operator Doppelgänger Protection

Summary

Architecture & Design ✅

Code Quality

1. State Management ✅

2. Timer Logic ✅

3. Detection Logic ✅

Security Considerations

1. Grace Period Calculation ✅

2. Message Filtering ✅

3. Shutdown Safety ✅

Performance

1. Lock Contention ⚠️

2. Message Dropping During Monitoring ✅

Testing

Test Coverage ✅

Integration & Configuration

1. CLI Configuration ✅

2. Integration Points ✅

Documentation

1. Code Documentation ✅

2. PR Description ✅

Previous Review Concerns - Resolution Status

@dknopik's concerns (2025-10-21):

Previous claude-code review concerns (2025-10-21):

Recommendations

High Priority: None ✅

Medium Priority:

Low Priority:

diegomrsantos commented Oct 15, 2025 •

edited

Loading

diegomrsantos Oct 23, 2025 •

edited

Loading

diegomrsantos commented Oct 22, 2025 •

edited

Loading

claude-code-actions-sigp bot commented Oct 22, 2025 •

edited

Loading