Skip to content

Conversation

@majisourav99
Copy link
Contributor

Problem Statement

OfflinePushes ZNodes in ZooKeeper accumulate stale replica statuses over time, leading to unbounded growth of PartitionStatus znodes.
This occurs due to:

  1. Rebalancing: When Helix rebalances partitions across nodes, new instances get assigned but old replica statuses remain in ZK
  2. Node Failures/Replacements: When nodes are replaced or fail, their replica statuses persist even though they're no longer part
    of the partition assignment
  3. Instance Decommissioning: When instances are removed from the cluster, their historical replica statuses accumulate

Root Cause

The PartitionStatus class maintains a Map<String, ReplicaStatus> (see PartitionStatus.java:21) where:

  • New replicas are added via updateReplicaStatus() when instances report their status
  • No cleanup mechanism exists to remove replicas that are no longer assigned to the partition
  • The map grows indefinitely as instances come and go

This can lead to:

  • Large ZNode sizes: Potentially hitting ZooKeeper's 1MB znode limit
  • Performance degradation: Increased serialization/deserialization overhead
  • Memory pressure: Controllers loading large partition statuses into memory
  • Confusing metrics: Stale replica data polluting monitoring dashboards

Example Scenario

Initial State (3 replicas):
Partition 0: [instance1, instance2, instance3]

After Rebalance:
Partition 0: [instance1, instance4, instance5] ← Current Helix assignment
But PartitionStatus still contains: [instance1, instance2, instance3, instance4, instance5]
^^^^^^^^^^^^^^^^^^^^^^^^^ STALE

Solution

Enhanced the existing LeakedPushStatusCleanUpService to periodically clean up stale replica statuses by:

  1. Comparing current Helix assignments with replica statuses stored in ZK
  2. Identifying stale replicas: Replicas in PartitionStatus but NOT in current PartitionAssignment
  3. Pruning stale entries: Creating updated PartitionStatus objects with only current replicas
  4. Preserving status history: Maintaining the status history for active replicas

Implementation Details

New Method: cleanupStaleReplicaStatuses(String kafkaTopic)

  • Fetches current partition assignment from RoutingDataRepository
  • Loads existing OfflinePushStatus with all partition statuses
  • For each partition:
    • Gets current instance assignments from Helix
    • Identifies stale replica statuses (in ZK but not in Helix)
    • Creates new PartitionStatus with only active replicas
    • Updates ZK via new updatePartitionStatus() method

New Interface Method: OfflinePushAccessor.updatePartitionStatus()

  • Added to support bulk partition status updates
  • Implemented in VeniceOfflinePushMonitorAccessor
  • Uses HelixUtils.update() for atomic ZK updates

Integration with Existing Service

  • Runs as part of the existing cleanup loop in LeakedPushStatusCleanUpService
  • Same configurable sleep interval (controlled by LEAKED_PUSH_STATUS_CLEAN_UP_SERVICE_SLEEP_INTERVAL_MS)
  • Only runs when RoutingDataRepository is available (gracefully handles null)

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@majisourav99 majisourav99 changed the title Refs/heads/clean offline pushstatus [controller] Cleanup stale replica statuses in Offlinepush znodes Feb 3, 2026
Copy link
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Left few comments

// Get current instances assigned to this partition
Partition partition = partitionAssignment.getPartition(partitionId);
if (partition == null) {
LOGGER.warn("Partition {} not found in partition assignment for topic {}", partitionId, kafkaTopic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Utils.getReplicaId to log topic name and partition id together ?


// Get set of currently assigned instance IDs
Set<String> currentInstanceIds =
partition.getAllInstancesSet().stream().map(Instance::getNodeId).collect(Collectors.toSet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid streams APIs? This is control path but still good to not use as a general practice?

replicaStatuses.stream().map(ReplicaStatus::getInstanceId).collect(Collectors.toSet());

// Find stale replicas (in push status but not in current assignment)
Set<String> staleInstanceIds = new HashSet<>(existingInstanceIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we augment it with timestamp based checking as well to guard against possible race conditions?

@Override
public void updatePartitionStatus(String kafkaTopic, PartitionStatus partitionStatus) {
if (!pushStatusExists(kafkaTopic)) {
LOGGER.warn("Push status does not exist for topic {}, skipping partition status update", kafkaTopic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a method in the accessor I'd throw a new exception here like VeniceNoPartitionStatusException here and let the caller decide how to handle it. e.g. cleanup task can choose to log warn but other potential future callers may want a different behavior.

partitionId,
clusterName);
HelixUtils.update(partitionStatusAccessor, partitionStatusPath, partitionStatus);
LOGGER.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we want to log this to provide an audit history. In that case wouldn't it be more useful to info the confirmed updates and debug the attempt log? WDYT?

}

// Update the partition status in ZK with cleaned up replica statuses
offlinePushAccessor.updatePartitionStatus(kafkaTopic, updatedPartitionStatus);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this vulnerable to race conditions? What happens in the following scenario?
Scenario 1:

  1. Get partition status from offlinePushAccessor.getOfflinePushStatusAndItsPartitionStatuses(kafkaTopic); Let's say partition 0: [A, B, C, D] and D is stale.
  2. Stale replica data won't change in the underlying partition status but new instances could have joined and made their updates to zk from server. e.g. [A, B, C, D, E]
  3. Cleanup is performed and we attempt to update via offlinePushAccessor.updatePartitionStatus(kafkaTopic, updatedPartitionStatus); Which will override the partition status to [A, B, C]. We just lose E?

Copy link
Contributor

@xunyin8 xunyin8 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could explore if zk have something like compareAndSet or some sort of generation id to check and update only if data hasn't changed since last read by this writer.

majisourav99 and others added 7 commits February 9, 2026 10:06
LeakedPushStatusCleanUpService

Enhanced LeakedPushStatusCleanUpService to clean up stale replica statuses
from PartitionStatus znodes during rebalancing or node failures. This prevents
unbounded growth of replica statuses in ZK that can lead to large znode sizes.

- Added cleanupStaleReplicaStatuses() method to identify and remove replicas
  that are no longer assigned to partitions according to Helix
- Added updatePartitionStatus() method to OfflinePushAccessor interface
- Implemented updatePartitionStatus() in VeniceOfflinePushMonitorAccessor
- Updated LeakedPushStatusCleanUpService constructor to accept
RoutingDataRepository

The cleanup runs as part of the existing background task and only removes
replica statuses for instances not currently assigned to the partition.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@majisourav99 majisourav99 force-pushed the cleanOfflinePushstatus branch from 83127ac to 003c3fa Compare February 9, 2026 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants