[controller] Cleanup stale replica statuses in Offlinepush znodes #2445

majisourav99 · 2026-02-03T21:15:57Z

Problem Statement

OfflinePushes ZNodes in ZooKeeper accumulate stale replica statuses over time, leading to unbounded growth of PartitionStatus znodes.
This occurs due to:

Rebalancing: When Helix rebalances partitions across nodes, new instances get assigned but old replica statuses remain in ZK
Node Failures/Replacements: When nodes are replaced or fail, their replica statuses persist even though they're no longer part
of the partition assignment
Instance Decommissioning: When instances are removed from the cluster, their historical replica statuses accumulate

Root Cause

The PartitionStatus class maintains a Map<String, ReplicaStatus> (see PartitionStatus.java:21) where:

New replicas are added via updateReplicaStatus() when instances report their status
No cleanup mechanism exists to remove replicas that are no longer assigned to the partition
The map grows indefinitely as instances come and go

This can lead to:

Large ZNode sizes: Potentially hitting ZooKeeper's 1MB znode limit
Performance degradation: Increased serialization/deserialization overhead
Memory pressure: Controllers loading large partition statuses into memory
Confusing metrics: Stale replica data polluting monitoring dashboards

Example Scenario

Initial State (3 replicas):
Partition 0: [instance1, instance2, instance3]

After Rebalance:
Partition 0: [instance1, instance4, instance5] ← Current Helix assignment
But PartitionStatus still contains: [instance1, instance2, instance3, instance4, instance5]
^^^^^^^^^^^^^^^^^^^^^^^^^ STALE

Solution

Enhanced the existing LeakedPushStatusCleanUpService to periodically clean up stale replica statuses by:

Comparing current Helix assignments with replica statuses stored in ZK
Identifying stale replicas: Replicas in PartitionStatus but NOT in current PartitionAssignment
Pruning stale entries: Creating updated PartitionStatus objects with only current replicas
Preserving status history: Maintaining the status history for active replicas

Implementation Details

New Method: `cleanupStaleReplicaStatuses(String kafkaTopic)`

Fetches current partition assignment from RoutingDataRepository
Loads existing OfflinePushStatus with all partition statuses
For each partition:
- Gets current instance assignments from Helix
- Identifies stale replica statuses (in ZK but not in Helix)
- Creates new PartitionStatus with only active replicas
- Updates ZK via new updatePartitionStatus() method

New Interface Method: `OfflinePushAccessor.updatePartitionStatus()`

Added to support bulk partition status updates
Implemented in VeniceOfflinePushMonitorAccessor
Uses HelixUtils.update() for atomic ZK updates

Integration with Existing Service

Runs as part of the existing cleanup loop in LeakedPushStatusCleanUpService
Same configurable sleep interval (controlled by LEAKED_PUSH_STATUS_CLEAN_UP_SERVICE_SLEEP_INTERVAL_MS)
Only runs when RoutingDataRepository is available (gracefully handles null)

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

sushantmane

Thanks. Left few comments

sushantmane · 2026-02-06T17:43:31Z

...controller/src/main/java/com/linkedin/venice/pushmonitor/LeakedPushStatusCleanUpService.java

+        // Get current instances assigned to this partition
+        Partition partition = partitionAssignment.getPartition(partitionId);
+        if (partition == null) {
+          LOGGER.warn("Partition {} not found in partition assignment for topic {}", partitionId, kafkaTopic);


Can we use Utils.getReplicaId to log topic name and partition id together ?

sushantmane · 2026-02-06T17:44:36Z

...controller/src/main/java/com/linkedin/venice/pushmonitor/LeakedPushStatusCleanUpService.java

+
+        // Get set of currently assigned instance IDs
+        Set<String> currentInstanceIds =
+            partition.getAllInstancesSet().stream().map(Instance::getNodeId).collect(Collectors.toSet());


Can we avoid streams APIs? This is control path but still good to not use as a general practice?

sushantmane · 2026-02-06T17:45:37Z

...controller/src/main/java/com/linkedin/venice/pushmonitor/LeakedPushStatusCleanUpService.java

+            replicaStatuses.stream().map(ReplicaStatus::getInstanceId).collect(Collectors.toSet());
+
+        // Find stale replicas (in push status but not in current assignment)
+        Set<String> staleInstanceIds = new HashSet<>(existingInstanceIds);


Should we augment it with timestamp based checking as well to guard against possible race conditions?

xunyin8 · 2026-02-06T18:50:42Z

.../venice-common/src/main/java/com/linkedin/venice/helix/VeniceOfflinePushMonitorAccessor.java

+  @Override
+  public void updatePartitionStatus(String kafkaTopic, PartitionStatus partitionStatus) {
+    if (!pushStatusExists(kafkaTopic)) {
+      LOGGER.warn("Push status does not exist for topic {}, skipping partition status update", kafkaTopic);


Given this is a method in the accessor I'd throw a new exception here like VeniceNoPartitionStatusException here and let the caller decide how to handle it. e.g. cleanup task can choose to log warn but other potential future callers may want a different behavior.

xunyin8 · 2026-02-06T18:53:55Z

.../venice-common/src/main/java/com/linkedin/venice/helix/VeniceOfflinePushMonitorAccessor.java

+        partitionId,
+        clusterName);
+    HelixUtils.update(partitionStatusAccessor, partitionStatusPath, partitionStatus);
+    LOGGER.debug(


I'm assuming we want to log this to provide an audit history. In that case wouldn't it be more useful to info the confirmed updates and debug the attempt log? WDYT?

xunyin8 · 2026-02-06T19:03:15Z

...controller/src/main/java/com/linkedin/venice/pushmonitor/LeakedPushStatusCleanUpService.java

+          }
+
+          // Update the partition status in ZK with cleaned up replica statuses
+          offlinePushAccessor.updatePartitionStatus(kafkaTopic, updatedPartitionStatus);


Isn't this vulnerable to race conditions? What happens in the following scenario?
Scenario 1:

Get partition status from offlinePushAccessor.getOfflinePushStatusAndItsPartitionStatuses(kafkaTopic); Let's say partition 0: [A, B, C, D] and D is stale.

Stale replica data won't change in the underlying partition status but new instances could have joined and made their updates to zk from server. e.g. [A, B, C, D, E]

Cleanup is performed and we attempt to update via offlinePushAccessor.updatePartitionStatus(kafkaTopic, updatedPartitionStatus); Which will override the partition status to [A, B, C]. We just lose E?

I think we could explore if zk have something like compareAndSet or some sort of generation id to check and update only if data hasn't changed since last read by this writer.

LeakedPushStatusCleanUpService Enhanced LeakedPushStatusCleanUpService to clean up stale replica statuses from PartitionStatus znodes during rebalancing or node failures. This prevents unbounded growth of replica statuses in ZK that can lead to large znode sizes. - Added cleanupStaleReplicaStatuses() method to identify and remove replicas that are no longer assigned to partitions according to Helix - Added updatePartitionStatus() method to OfflinePushAccessor interface - Implemented updatePartitionStatus() in VeniceOfflinePushMonitorAccessor - Updated LeakedPushStatusCleanUpService constructor to accept RoutingDataRepository The cleanup runs as part of the existing background task and only removes replica statuses for instances not currently assigned to the partition. Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

majisourav99 changed the title ~~Refs/heads/clean offline pushstatus~~ [controller] Cleanup stale replica statuses in Offlinepush znodes Feb 3, 2026

sushantmane reviewed Feb 6, 2026

View reviewed changes

xunyin8 reviewed Feb 6, 2026

View reviewed changes

majisourav99 and others added 7 commits February 9, 2026 10:06

build fix

d39a903

build fix

e19a49a

build fix

de2df3a

build fix

a4bc1be

build fix

57f51be

fix race condition

003c3fa

majisourav99 force-pushed the cleanOfflinePushstatus branch from 83127ac to 003c3fa Compare February 9, 2026 18:07

majisourav99 added 3 commits February 10, 2026 15:22

unit test fix

8f7d32b

unit test fix

8fb8243

Merge branch 'linkedin:main' into cleanOfflinePushstatus

0ea913b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[controller] Cleanup stale replica statuses in Offlinepush znodes #2445

[controller] Cleanup stale replica statuses in Offlinepush znodes #2445

Uh oh!

majisourav99 commented Feb 3, 2026

Uh oh!

sushantmane left a comment

Uh oh!

sushantmane Feb 6, 2026

Uh oh!

sushantmane Feb 6, 2026

Uh oh!

sushantmane Feb 6, 2026

Uh oh!

xunyin8 Feb 6, 2026

Uh oh!

xunyin8 Feb 6, 2026

Uh oh!

xunyin8 Feb 6, 2026

Uh oh!

xunyin8 Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[controller] Cleanup stale replica statuses in Offlinepush znodes #2445

Are you sure you want to change the base?

[controller] Cleanup stale replica statuses in Offlinepush znodes #2445

Uh oh!

Conversation

majisourav99 commented Feb 3, 2026

Problem Statement

Root Cause

Example Scenario

Solution

Implementation Details

New Method: cleanupStaleReplicaStatuses(String kafkaTopic)

New Interface Method: OfflinePushAccessor.updatePartitionStatus()

Integration with Existing Service

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

sushantmane left a comment

Choose a reason for hiding this comment

Uh oh!

sushantmane Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

sushantmane Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

sushantmane Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

xunyin8 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

xunyin8 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

xunyin8 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

xunyin8 Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

New Method: `cleanupStaleReplicaStatuses(String kafkaTopic)`

New Interface Method: `OfflinePushAccessor.updatePartitionStatus()`

xunyin8 Feb 6, 2026 •

edited

Loading