Skip to content

⚠️ [Warm Replicas] Implement warm replica support for controllers. #3192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

godwinpang
Copy link
Contributor

This change implements the proposal for warm replicas as proposed in #3121.
It adds a NeedWarmUp option for controllers to optionally start as warmed replicas.

Note for reviewers: This draft PR is feature complete with tests but the main purpose is to make sure that things are on the right track. Some of the naming / comments are a bit inconsistent, I will address them in a followup cleanup tomorrow.

Builds upon #3190.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: godwinpang
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@godwinpang godwinpang marked this pull request as draft April 9, 2025 07:12
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2025
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@godwinpang
Copy link
Contributor Author

/retest

@@ -439,6 +439,11 @@ func (cm *controllerManager) Start(ctx context.Context) (err error) {
return fmt.Errorf("failed to start other runnables: %w", err)
}

// Start and wait for sources to start.
if err := cm.runnables.Warmup.Start(cm.internalCtx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should come right after the caches and the comment shouldn't make assumptions about what the Warmup internally does.

The other issue: This needs to block until the Warmup has terminated, otherwise we may end up starting the controller before the sources are started, as the check we added only checks if we started to start the sources, not if we finished doing so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to coming right after the caches, but don't completely understand the significance of having it after vs. before the non-leader election runnables; mind explaining a bit?

Copy link
Contributor Author

@godwinpang godwinpang Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to block until the Warmup has terminated, otherwise we may end up starting the controller before the sources are started, as the check we added only checks if we started to start the sources, not if we finished doing so

Is this a big problem? Replicas will only start the controller after they win leader election so I don't see an issue in the leader election failover case; are you saying that in non-leader election cases the behavior of warmup should be that it completely blocks controller startup?

@@ -314,6 +314,12 @@ type LeaderElectionRunnable interface {
NeedLeaderElection() bool
}

// WarmupRunnable knows if a Runnable should be a warmup runnable.
type WarmupRunnable interface {
// Warmup returns true if the Runnable should be run as warmup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no bool return

@@ -314,6 +314,12 @@ type LeaderElectionRunnable interface {
NeedLeaderElection() bool
}

// WarmupRunnable knows if a Runnable should be a warmup runnable.
type WarmupRunnable interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the purpose of it in the godoc

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants