[controller] Route controller-cluster Helix ops to controller.cluster.zk.address + ZK isolation integration tests#2848
Draft
namithanivead wants to merge 6 commits into
Conversation
controller.cluster.zk.address ZkHelixAdminClient previously opened a single ZkClient against zookeeper.address (via getZkAddress()) and used the resulting ZKHelixAdmin for every operation, including those targeting the controller cluster (controllerClusterName) and the HAAS grand cluster (haasSuperClusterName). When controller.cluster.zk.address differs from zookeeper.address, this caused a split: the participant ZKHelixManager opened by VeniceHelixAdmin#connectToControllerCluster correctly joined on controller.cluster.zk.address, but admin operations such as enablePartition(controllerClusterName, ...) called the captured ZkClient on zookeeper.address and predictably failed with "instance config does not exist" because the participant config had been written to the other ensemble. This change adds a second HelixAdmin/ConfigAccessor connected to controller.cluster.zk.address. helixAdminFor(clusterName) routes operations on the controller cluster and HAAS grand cluster to the new admin while every other cluster (storage clusters) continues to use the storage-ZK admin. When the two ZK addresses are equal (the common case), the second admin aliases the first to avoid opening a redundant connection, so existing single-ZK deployments are unchanged. Tests: extends TestZkHelixAdminClient with testSplitZkRouting covering isVeniceControllerClusterCreated, isVeniceStorageClusterCreated, isClusterInGrandCluster, enablePartition (both clusterName branches), and updateClusterConfigs. Existing tests alias the new fields to the same mocks to preserve single-ZK semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
controller/storage ZK Add two integration test classes that prove a Venice controller can be pointed at two separate ZooKeeper ensembles: TestVeniceControllerZkClientIsolation - testSharedZkBaseline: shared ZK regression guard (both clusters land on the single ZK, aliasing path in ZkHelixAdminClient is exercised) - testControllerClusterIsolatedFromStorageZk: split-ZK path — asserts via ZKHelixAdmin.getClusters() that the controller cluster lands on the controller ZK only, and the storage cluster lands on the storage ZK only. TestVeniceControllerZkIsolationWithHAAS - testControllerClusterAndSuperClusterIsolatedOnDedicatedZk: mirrors the TestHAASController pattern (VeniceClusterWrapper + HelixAsAServiceWrapper) but starts HaaS on a dedicated controller ZK and sets controller.cluster.zk.address accordingly. Asserts that each ZK ensemble holds exactly the Helix clusters it should — the controller ZK gets helix_controllers + venice-controllers, the Venice ZK gets only the storage cluster. These are the 'work backwards' integration tests Xun recommended: run the full controller integration suite once with shared ZK (existing TestHAASController covers this) and once with split ZK (new tests here). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rage TestHAASController.testConcurrentClusterInitialization and testCloudConfig both mock VeniceControllerMultiClusterConfig but never stub getControllerClusterZkAddress(). The new ZkHelixAdminClient constructor reads this value; Mockito returns null for unstubbed calls, causing storageZkAddress.equals(null) == false, which falls into the split-ZK else branch and calls ZkClientFactory.newZkClient(null). The resulting ZkClient has _zkServers=null, causing an NPE when any Helix operation is invoked. Fix: stub getControllerClusterZkAddress() to the same address as getZkAddress() in both tests, reflecting the shared-ZK default. Also extends testSplitZkRouting to cover the previously-missing branch in helixAdminFor/helixConfigAccessorFor: the case where clusterName matches haasSuperClusterName but not controllerClusterName (the B=true arm of the OR condition). Adds testCloseSharedZk to cover the close() branch where controllerClusterHelixAdmin aliases helixAdmin. These additions bring diffCoverage branch coverage above the 50% threshold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er ZK In split-ZK deployments, createControllerClusterIfRequired() was using this.admin (connected to the storage ZK) to create the venice-controllers cluster. connectToControllerCluster() however connects via getControllerClusterZkAddress() (the controller ZK), causing a 'Cluster structure is not set up for cluster: venice-controllers' error when the two ZKs differ. Fix: delegate to helixAdminClient.createVeniceControllerCluster(), which uses controllerClusterHelixAdmin (routed to controller.cluster.zk.address). This is correct for both shared-ZK (no behavioral change) and split-ZK deployments, and is a strict superset of the old implementation (same config plus retry logic and cloud-config support). Also fix TestVeniceControllerZkIsolationWithHAAS: add VENICE_STORAGE_CLUSTER_LEADER_HAAS=true alongside the existing CONTROLLER_CLUSTER_LEADER_HAAS=true. Without this, initStorageCluster() takes the createClusterIfRequired path which (a) uses admin (storage ZK) to register the storage cluster as a resource in venice-controllers on the wrong ZK, and (b) causes waitUntilClusterResourceIsVisibleInEV() to wait up to 5 minutes for an ExternalView that never appears, hitting the 120s TestNG timeout. In a real HAAS + split-ZK deployment both flags always travel together. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…er ops In split-ZK mode, the controller cluster (venice-controllers) lives on controller.cluster.zk.address, not on zookeeper.address. The four admin.addResource/getResourceIdealState/setResourceIdealState/rebalance calls in createClusterIfRequired all used admin (storage ZK), causing 'cluster venice-controllers is not setup yet' because that cluster does not exist on the storage ZK. Fix: replace those calls with helixAdminClient.addVeniceStorageClusterToControllerCluster() which correctly routes to the controller ZK via controllerClusterHelixAdmin. This is safe in both shared-ZK (no behavioral change, same ZK) and split-ZK deployments. Also remove now-unused DelayedAutoRebalancer, CrushRebalanceStrategy imports and the CONTROLLER_CLUSTER_NUMBER_OF_PARTITION constant. Also fix TestVeniceControllerZkClientIsolation: change @BeforeClass/@afterclass for storageZk and controllerZk to @BeforeMethod/@AfterMethod. The split-ZK test creates venice-controllers on controllerZk; without per-method ZK lifecycle, that data leaks into testSharedZkBaseline which then fails its assertion that controllerZk holds no Venice clusters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
addStorageClusterResourceToControllerCluster The previous fix called addVeniceStorageClusterToControllerCluster() from createClusterIfRequired(), which uses WagedRebalancer. WagedRebalancer requires instance capacity configuration not present in integration tests, so waitUntilClusterResourceIsVisibleInEV() blocked for 5 minutes, causing all integration tests that start a Venice cluster to hang and hit their TestNG timeouts. Fix: introduce addStorageClusterResourceToControllerCluster(clusterName, replicaCount) in HelixAdminClient/ZkHelixAdminClient. This method: - Uses controllerClusterHelixAdmin (routes to controller.cluster.zk.address) so it is correct in both shared-ZK and split-ZK deployments - Keeps the original DelayedAutoRebalancer + CrushRebalanceStrategy config from the old createClusterIfRequired implementation, preserving assignment behavior for the non-HAAS path createClusterIfRequired now calls helixAdminClient.addStorageClusterResourceToControllerCluster instead of the direct admin.addResource/setResourceIdealState/rebalance calls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements ZooKeeper isolation for the Venice parent controller by separating the Helix controller-cluster ZK (used for controller leader election and HaaS super cluster) from the Venice storage-cluster ZK (used for store metadata, schemas, push status).
Problem
Previously,
ZkHelixAdminClientalways connected tozookeeper.address(the storage ZK) for all Helix admin operations — including creating thevenice-controllerscontroller cluster and the HaaS grand cluster. This means the controller cluster znodes and storage cluster znodes were co-mingled on the same ZK ensemble, making it impossible to isolate them.The config key
controller.cluster.zk.address(CONTROLLER_CLUSTER_ZK_ADDRESSS) already existed and was parsed correctly (defaulting tozookeeper.addresswhen not set), but no code actually used it for the Helix admin client.Solution
Production change —
ZkHelixAdminClientHelixAdmin/ConfigAccessorpairs: one forzookeeper.address(storage cluster ops), one forcontroller.cluster.zk.address(controller cluster + HaaS grand cluster ops).Integration tests
Two new test classes added:
TestVeniceControllerZkClientIsolationtestSharedZkBaseline: regression guard — shared ZK still works (aliases path)testControllerClusterIsolatedFromStorageZk: split ZK — asserts viaZKHelixAdmin.getClusters()that controller cluster lands only on the controller ZK and the storage cluster lands only on the storage ZKTestVeniceControllerZkIsolationWithHAAS(mirrorsTestHAASControllerpattern)testControllerClusterAndSuperClusterIsolatedOnDedicatedZk: startsHelixAsAServiceWrapperon a dedicated controller ZK, configures the Venice controller withcontroller.cluster.zk.address = <controller ZK>, and uses ZK inspector probes to verify each ensemble holds exactly the clusters it shouldTesting
TestHAASController(8 tests, all pass with no changes) + newtestSharedZkBaselinetestControllerClusterIsolatedFromStorageZkandtestControllerClusterAndSuperClusterIsolatedOnDedicatedZk