Skip to content

[controller] Route controller-cluster Helix ops to controller.cluster.zk.address + ZK isolation integration tests#2848

Draft
namithanivead wants to merge 6 commits into
linkedin:mainfrom
namithanivead:nvijayak/controller-cluster-zk-address-helix-admin
Draft

[controller] Route controller-cluster Helix ops to controller.cluster.zk.address + ZK isolation integration tests#2848
namithanivead wants to merge 6 commits into
linkedin:mainfrom
namithanivead:nvijayak/controller-cluster-zk-address-helix-admin

Conversation

@namithanivead

Copy link
Copy Markdown
Contributor

Summary

This PR implements ZooKeeper isolation for the Venice parent controller by separating the Helix controller-cluster ZK (used for controller leader election and HaaS super cluster) from the Venice storage-cluster ZK (used for store metadata, schemas, push status).

Problem

Previously, ZkHelixAdminClient always connected to zookeeper.address (the storage ZK) for all Helix admin operations — including creating the venice-controllers controller cluster and the HaaS grand cluster. This means the controller cluster znodes and storage cluster znodes were co-mingled on the same ZK ensemble, making it impossible to isolate them.

The config key controller.cluster.zk.address (CONTROLLER_CLUSTER_ZK_ADDRESSS) already existed and was parsed correctly (defaulting to zookeeper.address when not set), but no code actually used it for the Helix admin client.

Solution

Production change — ZkHelixAdminClient

  • Creates two HelixAdmin/ConfigAccessor pairs: one for zookeeper.address (storage cluster ops), one for controller.cluster.zk.address (controller cluster + HaaS grand cluster ops).
  • When both addresses are equal (the default — all existing deployments), the second pair aliases the first to avoid an extra ZK connection. No behaviour change for existing deployments.
  • When addresses differ (split-ZK deployment), operations on the controller cluster and HaaS grand cluster are routed to the controller ZK; storage cluster operations stay on the storage ZK.

Integration tests

Two new test classes added:

TestVeniceControllerZkClientIsolation

  • testSharedZkBaseline: regression guard — shared ZK still works (aliases path)
  • testControllerClusterIsolatedFromStorageZk: split ZK — asserts via ZKHelixAdmin.getClusters() that controller cluster lands only on the controller ZK and the storage cluster lands only on the storage ZK

TestVeniceControllerZkIsolationWithHAAS (mirrors TestHAASController pattern)

  • testControllerClusterAndSuperClusterIsolatedOnDedicatedZk: starts HelixAsAServiceWrapper on a dedicated controller ZK, configures the Venice controller with controller.cluster.zk.address = <controller ZK>, and uses ZK inspector probes to verify each ensemble holds exactly the clusters it should

Testing

  • Shared-ZK path: covered by existing TestHAASController (8 tests, all pass with no changes) + new testSharedZkBaseline
  • Split-ZK path: new testControllerClusterIsolatedFromStorageZk and testControllerClusterAndSuperClusterIsolatedOnDedicatedZk

nvijayak and others added 2 commits June 8, 2026 10:15
controller.cluster.zk.address

ZkHelixAdminClient previously opened a single ZkClient against
zookeeper.address
(via getZkAddress()) and used the resulting ZKHelixAdmin for every operation,
including those targeting the controller cluster (controllerClusterName) and
the
HAAS grand cluster (haasSuperClusterName). When controller.cluster.zk.address
differs from zookeeper.address, this caused a split: the participant
ZKHelixManager opened by VeniceHelixAdmin#connectToControllerCluster correctly
joined on controller.cluster.zk.address, but admin operations such as
enablePartition(controllerClusterName, ...) called the captured ZkClient on
zookeeper.address and predictably failed with "instance config does not exist"
because the participant config had been written to the other ensemble.

This change adds a second HelixAdmin/ConfigAccessor connected to
controller.cluster.zk.address. helixAdminFor(clusterName) routes operations on
the controller cluster and HAAS grand cluster to the new admin while every
other cluster (storage clusters) continues to use the storage-ZK admin. When
the two ZK addresses are equal (the common case), the second admin aliases
the first to avoid opening a redundant connection, so existing single-ZK
deployments are unchanged.

Tests: extends TestZkHelixAdminClient with testSplitZkRouting covering
isVeniceControllerClusterCreated, isVeniceStorageClusterCreated,
isClusterInGrandCluster, enablePartition (both clusterName branches), and
updateClusterConfigs. Existing tests alias the new fields to the same mocks
to preserve single-ZK semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
controller/storage ZK

Add two integration test classes that prove a Venice controller can be
pointed at two separate ZooKeeper ensembles:

TestVeniceControllerZkClientIsolation
- testSharedZkBaseline: shared ZK regression guard (both clusters land
  on the single ZK, aliasing path in ZkHelixAdminClient is exercised)
- testControllerClusterIsolatedFromStorageZk: split-ZK path — asserts
  via ZKHelixAdmin.getClusters() that the controller cluster lands on
  the controller ZK only, and the storage cluster lands on the storage
  ZK only.

TestVeniceControllerZkIsolationWithHAAS
- testControllerClusterAndSuperClusterIsolatedOnDedicatedZk: mirrors
  the TestHAASController pattern (VeniceClusterWrapper +
  HelixAsAServiceWrapper) but starts HaaS on a dedicated controller ZK
  and sets controller.cluster.zk.address accordingly. Asserts that each
  ZK ensemble holds exactly the Helix clusters it should — the
  controller ZK gets helix_controllers + venice-controllers, the Venice
  ZK gets only the storage cluster.

These are the 'work backwards' integration tests Xun recommended: run
the full controller integration suite once with shared ZK (existing
TestHAASController covers this) and once with split ZK (new tests here).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@namithanivead namithanivead marked this pull request as draft June 8, 2026 18:05
nvijayak and others added 4 commits June 8, 2026 12:22
…rage

TestHAASController.testConcurrentClusterInitialization and
testCloudConfig both mock VeniceControllerMultiClusterConfig but never
stub getControllerClusterZkAddress(). The new ZkHelixAdminClient
constructor reads this value; Mockito returns null for unstubbed calls,
causing storageZkAddress.equals(null) == false, which falls into the
split-ZK else branch and calls ZkClientFactory.newZkClient(null). The
resulting ZkClient has _zkServers=null, causing an NPE when any Helix
operation is invoked.

Fix: stub getControllerClusterZkAddress() to the same address as
getZkAddress() in both tests, reflecting the shared-ZK default.

Also extends testSplitZkRouting to cover the previously-missing branch
in helixAdminFor/helixConfigAccessorFor: the case where clusterName
matches haasSuperClusterName but not controllerClusterName (the B=true
arm of the OR condition). Adds testCloseSharedZk to cover the
close() branch where controllerClusterHelixAdmin aliases helixAdmin.
These additions bring diffCoverage branch coverage above the 50%
threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er ZK

In split-ZK deployments, createControllerClusterIfRequired() was using
this.admin (connected to the storage ZK) to create the venice-controllers
cluster. connectToControllerCluster() however connects via
getControllerClusterZkAddress() (the controller ZK), causing a
'Cluster structure is not set up for cluster: venice-controllers' error
when the two ZKs differ.

Fix: delegate to helixAdminClient.createVeniceControllerCluster(), which
uses controllerClusterHelixAdmin (routed to controller.cluster.zk.address).
This is correct for both shared-ZK (no behavioral change) and split-ZK
deployments, and is a strict superset of the old implementation (same config
plus retry logic and cloud-config support).

Also fix TestVeniceControllerZkIsolationWithHAAS: add
VENICE_STORAGE_CLUSTER_LEADER_HAAS=true alongside the existing
CONTROLLER_CLUSTER_LEADER_HAAS=true. Without this, initStorageCluster()
takes the createClusterIfRequired path which (a) uses admin (storage ZK) to
register the storage cluster as a resource in venice-controllers on the
wrong ZK, and (b) causes waitUntilClusterResourceIsVisibleInEV() to wait
up to 5 minutes for an ExternalView that never appears, hitting the 120s
TestNG timeout. In a real HAAS + split-ZK deployment both flags always travel
together.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er ops

In split-ZK mode, the controller cluster (venice-controllers) lives on
controller.cluster.zk.address, not on zookeeper.address. The four
admin.addResource/getResourceIdealState/setResourceIdealState/rebalance
calls in createClusterIfRequired all used admin (storage ZK), causing
'cluster venice-controllers is not setup yet' because that cluster does
not exist on the storage ZK.

Fix: replace those calls with
helixAdminClient.addVeniceStorageClusterToControllerCluster()
which correctly routes to the controller ZK via controllerClusterHelixAdmin.
This is safe in both shared-ZK (no behavioral change, same ZK) and split-ZK
deployments. Also remove now-unused DelayedAutoRebalancer,
CrushRebalanceStrategy
imports and the CONTROLLER_CLUSTER_NUMBER_OF_PARTITION constant.

Also fix TestVeniceControllerZkClientIsolation: change
@BeforeClass/@afterclass
for storageZk and controllerZk to @BeforeMethod/@AfterMethod. The split-ZK
test creates venice-controllers on controllerZk; without per-method ZK
lifecycle, that data leaks into testSharedZkBaseline which then fails its
assertion that controllerZk holds no Venice clusters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
addStorageClusterResourceToControllerCluster

The previous fix called addVeniceStorageClusterToControllerCluster() from
createClusterIfRequired(), which uses WagedRebalancer. WagedRebalancer
requires instance capacity configuration not present in integration tests,
so waitUntilClusterResourceIsVisibleInEV() blocked for 5 minutes, causing
all integration tests that start a Venice cluster to hang and hit their
TestNG timeouts.

Fix: introduce addStorageClusterResourceToControllerCluster(clusterName,
replicaCount)
in HelixAdminClient/ZkHelixAdminClient. This method:
- Uses controllerClusterHelixAdmin (routes to controller.cluster.zk.address)
  so it is correct in both shared-ZK and split-ZK deployments
- Keeps the original DelayedAutoRebalancer + CrushRebalanceStrategy config
  from the old createClusterIfRequired implementation, preserving assignment
  behavior for the non-HAAS path

createClusterIfRequired now calls
helixAdminClient.addStorageClusterResourceToControllerCluster
instead of the direct admin.addResource/setResourceIdealState/rebalance calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant