Skip to content

retry on conflict when syncing work for binding controllers#7121

Open
AnupamSingh2004 wants to merge 1 commit intokarmada-io:masterfrom
AnupamSingh2004:retry-on-conflict-binding-controllers
Open

retry on conflict when syncing work for binding controllers#7121
AnupamSingh2004 wants to merge 1 commit intokarmada-io:masterfrom
AnupamSingh2004:retry-on-conflict-binding-controllers

Conversation

@AnupamSingh2004
Copy link

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR mitigates the issue where the binding_sync_work_duration_seconds metric records Kubernetes conflict errors (HTTP 409) as failures, causing false SLO violations and misleading availability metrics.

Problem:

  • Conflict errors are expected, retriable errors from Kubernetes optimistic concurrency control
  • Currently, every conflict error is recorded as result="error" in the metric
  • This causes false alerts and misleading dashboards (e.g., showing 85% availability when actual is >99%)

Solution:
By wrapping ensureWork() with retry.RetryOnConflict(), conflict errors are automatically retried and only the final outcome is recorded in the metric:

  • If retries succeed → result="success"
  • If retries exhaust with non-retriable error → result="error"

This is the same approach used in PR #7106 for the execution-controller (work_sync_workload_duration_seconds metric).

Files changed:

File Change
pkg/controllers/binding/binding_controller.go Wrap ensureWork with retry.RetryOnConflict
pkg/controllers/binding/cluster_resource_binding_controller.go Wrap ensureWork with retry.RetryOnConflict

Which issue(s) this PR fixes:

Part of #7111

Special notes for your reviewer:

This is a follow-up to #7106 which implemented the same pattern for the execution-controller. As requested by @jabellard in the issue discussion, the same fix is needed for the binding controllers.

The pattern follows exactly what was suggested:

start := time.Now()
err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
    return ensureWork(ctx, c.Client, c.ResourceInterpreter, workload, c.OverrideManager, binding, apiextensionsv1.NamespaceScoped)
})
metrics.ObserveSyncWorkLatency(err, start)

cc @jabellard @RainbowMango

Does this PR introduce a user-facing change?:

`Instrumentation`: The metric `binding_sync_work_duration_seconds` no longer counts retriable Kubernetes 409 conflicts as errors, improving availability accuracy and reducing false alerts.

Signed-off-by: AnupamSingh2004 <sanupam2004@gmail.com>
@karmada-bot karmada-bot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jan 20, 2026
@karmada-bot
Copy link
Contributor

Welcome @AnupamSingh2004! It looks like this is your first PR to karmada-io/karmada 🎉

@karmada-bot karmada-bot requested a review from Garrybest January 20, 2026 05:44
@karmada-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign whitewindmills for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot requested a review from seanlaii January 20, 2026 05:44
@gemini-code-assist
Copy link

Summary of Changes

Hello @AnupamSingh2004, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines how Kubernetes conflict errors are handled within the binding controllers. By integrating a retry mechanism for optimistic concurrency conflicts, the system will now correctly process these transient errors, leading to more accurate binding_sync_work_duration_seconds metrics. This prevents false alarms and provides a clearer picture of system availability, aligning with best practices for robust Kubernetes controller development.

Highlights

  • Metric Accuracy Improvement: The binding_sync_work_duration_seconds metric will now accurately reflect the success/failure rate by no longer counting Kubernetes 409 conflict errors as immediate failures, preventing false SLO violations and misleading availability metrics.
  • Conflict Handling Implementation: The ensureWork() function calls within the binding controllers are now wrapped with retry.RetryOnConflict(), enabling automatic retries for operations that encounter Kubernetes optimistic concurrency conflicts.
  • Consistency with Existing Patterns: This change aligns with the conflict retry logic previously introduced in PR retry on conflict when syncing workload to member clusters #7106 for the execution-controller, ensuring a consistent and robust approach across related components.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 20, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the issue of Kubernetes conflict errors being incorrectly reported as failures in the binding_sync_work_duration_seconds metric. By wrapping the ensureWork call with retry.RetryOnConflict in both binding_controller.go and cluster_resource_binding_controller.go, the change ensures that conflicts are retried and only the final outcome is recorded, leading to more accurate availability metrics. The implementation is clean, follows a standard pattern for handling optimistic concurrency conflicts in Kubernetes controllers, and is consistent with similar changes made elsewhere in the project. The changes look good and I have no further comments.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.55%. Comparing base (8f8d39a) to head (8406814).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7121   +/-   ##
=======================================
  Coverage   46.54%   46.55%           
=======================================
  Files         700      700           
  Lines       48128    48132    +4     
=======================================
+ Hits        22403    22409    +6     
+ Misses      24040    24039    -1     
+ Partials     1685     1684    -1     
Flag Coverage Δ
unittests 46.55% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@jabellard
Copy link
Member

/assign

@jabellard
Copy link
Member

jabellard commented Jan 20, 2026

Generally looks good to me.

/lgtm

/cc @RainbowMango

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 20, 2026
@karmada-bot
Copy link
Contributor

@jabellard: GitHub didn't allow me to request PR reviews from the following users: for, another, look.

Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Generally looks good to me.

/lgtm

/cc @RainbowMango for another look

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
wait for confirmation from #7111 (comment)

@karmada-bot karmada-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants