Skip to content

Conversation

@jingy-li
Copy link
Contributor

Problem Statement

When a Venice server node undergoes a disk swap and stop/restarts ungracefully, stale Helix Customized View states remain in ZooKeeper, causing routers to incorrectly treat the host as ready to serve traffic before it's actually prepared.

Timeline:

  1. Node undergoes disk swap (local storage is empty or inconsistent)
  2. Node restarts ungracefully (Helix state wasn't fully cleaned up prior to shutdown)
  3. [Restarting] Server startup invokes resetAllInstanceCVStates() to clear CV states
    • The reset logic reads partition list from disk via storageService.getStoreAndUserPartitionsMapping()
    • Because disk is empty/swapped, the partition map is empty
  4. No CV states are deleted → stale states remain in ZK
  5. Helix continues to surface stale CV data (e.g., COMPLETED status)
  6. Router treats host as ready and sends traffic
  7. Host is not actually ready → requests fail

Root Cause:
The existing resetAllInstanceCVStates() implementation relies on disk-based partition discovery. getStoreAndUserPartitionsMapping() calls engine.getPersistedPartitionIds(), which only returns partitions currently on disk. When disk state is empty or inconsistent, this returns an empty map, causing CV cleanup to be silently skipped.

Solution

Replace the disk-based partition discovery approach with Helix's bulk delete API (deleteAllResourcesCustomizedStates()), which directly queries ZooKeeper for all resources with CV states, regardless of local disk state.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Jingyan Li and others added 2 commits February 10, 2026 14:46
…ates()

Added comprehensive test coverage for the new deleteAllCustomizedStates()
method:
- Test with helixHybridStoreQuotaEnabled disabled (only OFFLINE_PUSH deleted)
- Test with helixHybridStoreQuotaEnabled enabled (both states deleted)
- Test exception handling for OFFLINE_PUSH state
- Test exception handling for HYBRID_STORE_QUOTA state

This resolves the diff coverage failure for venice-common module.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant