Skip to content

Mle-28303 Dynamic Host Scaling Feature Implementation #168

Merged
pengzhouml merged 45 commits into
developfrom
feature/MLE-28303-dynamic-host
Jun 3, 2026
Merged

Mle-28303 Dynamic Host Scaling Feature Implementation #168
pengzhouml merged 45 commits into
developfrom
feature/MLE-28303-dynamic-host

Conversation

@pengzhouml
Copy link
Copy Markdown
Collaborator

@pengzhouml pengzhouml commented May 18, 2026

This pull request introduces significant improvements to the E2E testing pipeline and Makefile, with a focus on supporting new E2E test scopes, better handling of local images for Minikube, and stricter validation for dynamic group configurations. It also updates the default version to 1.3.0 and enhances developer experience with more flexible and robust Makefile targets.

E2E Pipeline and Testing Enhancements:

  • Added support for selecting E2E test scopes (cluster, dynamic-host, volume-resize) via the E2E_SCOPE parameter in the Jenkins pipeline, including validation and error handling for unsupported combinations (e.g., restricting non-cluster scopes when running on EKS). Istio and Helm namespace-scoped tests are now only run for the cluster scope. (Jenkinsfile) [1] [2] [3] [4] [5] [6] [7]

  • Introduced new Makefile targets for focused E2E tests:

    • e2e-test-dynamic-host and e2e-test-dynamic-host-local for dynamic-host lifecycle tests, with logic to build/load local images for Minikube contexts.
    • e2e-test-volume-resize-local for volume-resize tests with local image and Minikube storage class setup.
    • Improved all E2E targets to load images into Minikube when appropriate, reducing remote pull failures and making local development easier. (Makefile)
  • Increased E2E test timeouts to 60 minutes for most targets and made timeouts configurable via E2E_TEST_TIMEOUT. (Makefile) [1] [2] [3] [4] [5]

Image and Build Improvements:

  • Updated default operator image version to 1.3.0 and added a LOCAL_E2E_IMG variable for local E2E testing. (Makefile) [1] [2]
  • Changed Docker build command to use --load for compatibility with local image loading. (Makefile)

Kustomize and Tooling:

  • Enhanced the kustomize Makefile target to respect externally set paths and provide better error messaging if the binary is not executable. (Makefile)

API Validation and CRD Changes:

  • Added a DynamicGroupConfig struct with strict ISO-8601 duration validation for the tokenDuration field. (api/v1/common_types.go)
  • Added CRD-level validation to ensure dynamic hosts require image tag latest or MarkLogic major version 12+. Also, added a max length validation for the image field. (api/v1/marklogiccluster_types.go) [1] [2]

Developer Experience:

  • Added a kill-envtest Makefile target to clean up stale test processes. (Makefile)

These changes collectively improve the reliability, flexibility, and developer usability of the E2E testing and build workflow, while enforcing important validation rules for dynamic group configurations and image selection.

Copilot AI review requested due to automatic review settings May 18, 2026 16:45
@pengzhouml pengzhouml changed the title Mle-28303 dynamic host Mle-28303 Dynamic Host Scaling Feature Implementation May 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds dynamic MarkLogic host pool support to the operator, including API fields, CRD/status schema, controller reconciliation, management API client logic, dynamic pod startup behavior, and a functional spec.

Changes:

  • Adds dynamic group API fields/status and generated CRD/deepcopy updates.
  • Adds dynamic host lifecycle reconciliation for group configuration, token join, scale-down cleanup, restart recovery, and finalizers.
  • Updates StatefulSet/Service generation, startup scripts, controller watches, and tests/spec docs for dynamic groups.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
pkg/mlmanage/client.go Adds MarkLogic Management API client for dynamic host operations.
pkg/k8sutil/statefulset.go Adds dynamic labels, env var, readiness probe, and scale-down delay behavior.
pkg/k8sutil/service.go Selects dynamic pods via component labels.
pkg/k8sutil/secret.go Creates shared manage-admin credentials for dynamic groups.
pkg/k8sutil/scripts/cluster-init-wrapper.sh Skips static init/join in dynamic mode.
pkg/k8sutil/scripts/cluster-config.sh Adds dynamic-mode guard.
pkg/k8sutil/marklogicServer.go Propagates dynamic config/defaults to child MarklogicGroup resources.
pkg/k8sutil/handler.go Invokes dynamic reconciliation for dynamic groups.
pkg/k8sutil/dynamic_reconcile.go Implements dynamic host lifecycle reconciliation.
pkg/k8sutil/common.go Adds dynamic/static component label helpers.
internal/controller/marklogicgroup_controller.go Adds pod update handling and pod ownership watch.
internal/controller/marklogiccluster_controller_test.go Adds dynamic group propagation tests.
docs/spec/Dynamic Host.md Adds functional specification for dynamic host support.
config/crd/bases/marklogic.progress.com_marklogicgroups.yaml Adds dynamic spec/status schema for MarklogicGroup.
config/crd/bases/marklogic.progress.com_marklogicclusters.yaml Adds dynamic fields/validation for MarklogicCluster groups.
api/v1/zz_generated.deepcopy.go Adds deepcopy support for dynamic structs/fields.
api/v1/marklogicgroup_types.go Adds dynamic fields and status structs to MarklogicGroup API.
api/v1/marklogiccluster_types.go Adds dynamic fields to cluster group entries.
api/v1/common_types.go Adds dynamic group configuration type.
Files not reviewed (1)
  • api/v1/zz_generated.deepcopy.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/k8sutil/dynamic_reconcile.go Outdated
Comment thread pkg/k8sutil/dynamic_reconcile.go Outdated
Comment thread pkg/mlmanage/client.go Outdated
Comment thread pkg/k8sutil/dynamic_reconcile.go Outdated
Comment thread pkg/k8sutil/dynamic_reconcile.go
Comment thread api/v1/marklogicgroup_types.go
Comment thread pkg/k8sutil/dynamic_reconcile.go
Comment thread pkg/k8sutil/dynamic_reconcile.go Outdated
Comment thread api/v1/common_types.go
Comment thread pkg/k8sutil/dynamic_reconcile.go Outdated
@vitalykorolev
Copy link
Copy Markdown
Collaborator

@pengzhouml please respond to Copilot's comments and mark conversations as Resolved.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 22 changed files in this pull request and generated 5 comments.

Files not reviewed (1)
  • api/v1/zz_generated.deepcopy.go: Language not supported
Comments suppressed due to low confidence (1)

pkg/k8sutil/statefulset.go:115

  • patchDiff.String() is logged before checking err from patch.DefaultPatchMaker.Calculate. If Calculate fails, patchDiff may be unusable and calling methods on it can panic or log misleading output. Please check/return on err before using patchDiff (including after the second Calculate call in the dynamic scale-down delay branch).
	patchDiff, err := patch.DefaultPatchMaker.Calculate(currentSts, statefulSetDef,
		patch.IgnoreStatusFields(),
		patch.IgnoreVolumeClaimTemplateTypeMetaAndStatus(),
		patch.IgnoreField("kind"))
	if shouldDelayDynamicEmptyDirScaleDown(cr, currentSts) {
		statefulSetDef.Spec.Replicas = currentSts.Spec.Replicas
		patchDiff, err = patch.DefaultPatchMaker.Calculate(currentSts, statefulSetDef,
			patch.IgnoreStatusFields(),
			patch.IgnoreVolumeClaimTemplateTypeMetaAndStatus(),
			patch.IgnoreField("kind"))
	}
	logger.Info("Patch Diff:", "Diff", patchDiff.String())
	logger.Info("statefulSetDef Spec:", "Spec", statefulSetDef.Spec.Replicas)
	if err != nil {
		logger.Error(err, "Error calculating patch")
		return result.Error(err).Output()

Comment thread pkg/mlmanage/client.go Outdated
Comment thread config/crd/bases/marklogic.progress.com_marklogicgroups.yaml
Comment thread config/crd/bases/marklogic.progress.com_marklogicclusters.yaml
Comment thread pkg/k8sutil/dynamic_reconcile.go Outdated
Comment thread internal/controller/marklogicgroup_controller.go
Peng Zhou and others added 13 commits May 26, 2026 23:11
Introduce the initial Dynamic Host scaffolding without implementing
Management API join/remove workflows.

- add isDynamic and dynamic config to cluster/group APIs
- add DynamicGroup status types and deepcopy support
- propagate dynamic fields from MarklogicCluster to MarklogicGroup
- default dynamic groups to RollingUpdate and non-persistent datadir
- use dynamic-host selector/labels for dynamic StatefulSets and Services
- inject MARKLOGIC_DYNAMIC_HOST into dynamic pods
- switch dynamic readiness to TCP on port 8001
- guard shell startup so dynamic pods skip static bootstrap/join logic
- add focused controller tests for milestone 1 behavior
- remove invalid defaulting of dynamic config on static groups
- keep omitted dynamic persistence unset instead of creating invalid spec
- skip bootstrap network gating for dynamic pod startup
- regenerate CRDs after validation tag updates
- tighten controller tests for readiness defaults and unique group names
- verify controller suite passes with envtest
Introduce controller-side dynamic host bootstrap and configuration
without implementing token join/remove flows.

- add Management API client plumbing for dynamic host operations
- create operator-managed manage-admin credentials for dynamic reconcile
- bootstrap and reconcile the MarkLogic manage-admin user
- add dynamic reconcile branch for bootstrap readiness and version checks
- ensure dynamic group creation and one-time dynamic host configuration
- record dynamic configuration state in MarklogicGroup status
- extend envtest coverage with fake-client based controller tests
Implement controller-driven dynamic host scale-up joins without
introducing remove or restart-recovery behavior yet.

- add token request and join flow to dynamic reconcile
- join locally ready dynamic pods sequentially
- verify MarkLogic membership before marking hosts joined
- record per-host join state and retry progress in dynamic status
- preserve retry-budget accounting across transient join failures
- tighten fake management client behavior for transient token retries
- extend envtest coverage for successful, degraded, and exhausted-retry joins
Implement storage-aware dynamic host cleanup for scale-down,
scale-to-zero, and group deletion.

- add dynamic-host remove support to the management client
- add pod and group finalizers for dynamic cleanup
- remove EmptyDir-backed hosts before allowing pod deletion
- retain PVC-backed hosts during ordinary scale-down
- clean up dynamic groups on deletion and scale-to-zero
- preserve cleanup and failure state across reconciles
- harden fake client host tracking for multi-group controller tests
- extend envtest coverage for scale-down and cleanup behavior
…c hosts

detect restart membership loss when pods are locally ready but absent from MarkLogic group membership
add restart-recovery flow with explicit host states (rejoin-pending, rejoining, rejoined) and ClusterRestartDetected reason
support PVC-backed restart recovery by cleaning retained state before rejoin
add retry-budget handling for restart cleanup and restart rejoin failures
preserve rejoined host state in dynamic status reconstruction
add controller envtests for EmptyDir rejoin, restart status visibility, PVC cleanup-before-rejoin, partial recovery, and bootstrap-unavailable recovery paths
stabilize restart-focused tests with deterministic reconcile triggering and robust token-call assertions
make restart-recovery specs deterministic by explicitly triggering reconciles after fake backend mutations
replace brittle transient status checks with durable recovery assertions
tighten token-call verification to host-specific counts to avoid cross-spec noise
relax final state expectations to accept stable joined or rejoined outcomes where timing can vary
…observability

replace steady-state phase configured with idle across reconcile flow/tests
add dynamic status timestamps: lastTransitionTime and host lastUpdated
add PodStartupTimeout detection for pods that never become locally ready
set degraded reason to PodStartupTimeout when startup timeout is hit
emit dynamic lifecycle transition events (normal/warning)
add/update DynamicHostsReady condition lifecycle
update CRD schema and deepcopy generation for new status fields
expand envtests for timeout, condition state, and event assertions
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1
commit message

fix(dynamic): align status.dynamic.phase enum casing with API contract

Use Pending, Reconciling, Deleting, Degraded, Failed, Idle for dynamic phases
Update controller tests to assert the new phase values
Verified with focused k8sutil and controller TestAPIs test runs
Copy link
Copy Markdown
Collaborator

@rwinieski rwinieski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending Jenkins to be green.

Otherwise the code looks good to me.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

rwinieski
rwinieski previously approved these changes Jun 1, 2026
@pengzhouml pengzhouml requested a review from rwinieski June 2, 2026 04:21
@pengzhouml pengzhouml merged commit e4870e4 into develop Jun 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants