Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 reconciler/apibinding: make group resources of APIBindings + CRD race-free #3251

Merged
merged 3 commits into from
Feb 23, 2025

Conversation

sttts
Copy link
Member

@sttts sttts commented Jan 16, 2025

Summary

The conflict checker in the APIBinding reconciler computed a list of CRDs, both real and bound ones. It took the APIExport as reference instead of the bindings directly. This seems wrong. If a schema is removed from an export, or the whole export is deleted, the binding still exists, but it didn't contribute the conflict checker anymore.

This PR does the following:

  1. make the conflict checker use the APIBinding.status.boundResources as reference.
  2. store group resource bindings in an apis.kcp.io/resource-bindings annotation on the LogicalCluster. It records the APIBinding or whether it is a CRD. This annotation (a JSON map) is updated BEFORE (a) a CRD is created (admission) or (b) an APIBinding bind the group resource:
    internal.apis.kcp.io/resource-bindings: '{"partitions.topology.kcp.io":{"n":"topology.kcp.io"},"partitionsets.topology.kcp.io":{"n":"topology.kcp.io"},"shards.core.kcp.io":{"n":"shards.core.kcp.io"},"workspaces.tenancy.kcp.io":{"n":"tenancy.kcp.io"},"workspacetypes.tenancy.kcp.io":{"n":"tenancy.kcp.io"}}'

The tricky case is that we can only update the annotation during CRD creation via admission. And creation can fail. Then we have a dangling group resource reservation. This PR adds an expery timestamp to the group resource entries for CRD which matches the maximum request duration of 30s. For that time an APIBinding of the same group resource would fail (delayed). Don't know a better way to solve that.

Downside: CRD creation implies update of the LogicalCluster, with potential for conflicts. This might make CRD creation slower. So if you install 1000 CRDs, this PR will have an impact.

Related issue(s)

Fixes #3252

Release Notes

Fix critical race condition between APIBindings and CRDs potentially allowing the same resource to be bound by multiple bindings or CRDs, leading to data loss or inconsistent state.

@kcp-ci-bot kcp-ci-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2025
@sttts sttts force-pushed the sttts-multiple-identities branch 3 times, most recently from 90885fe to e338b59 Compare January 18, 2025 20:13
@kcp-ci-bot kcp-ci-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 18, 2025
@sttts sttts changed the title WIP: 🐛 reconciler/apibinding: check conflict via bound APIs in bindings, not exports WIP: 🐛 reconciler/apibinding: make group resources of APIBindings + CRD race-free Jan 18, 2025
@sttts sttts force-pushed the sttts-multiple-identities branch 4 times, most recently from 93f12c1 to 8ba5890 Compare January 20, 2025 10:04
@sttts sttts changed the title WIP: 🐛 reconciler/apibinding: make group resources of APIBindings + CRD race-free 🐛 reconciler/apibinding: make group resources of APIBindings + CRD race-free Jan 20, 2025
@kcp-ci-bot kcp-ci-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 20, 2025
@embik
Copy link
Member

embik commented Jan 20, 2025

/retest

@sttts sttts force-pushed the sttts-multiple-identities branch 4 times, most recently from 3f47600 to facb4eb Compare January 20, 2025 13:24
@kcp-ci-bot kcp-ci-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 20, 2025
@mjudeikis
Copy link
Contributor

/build-image

@kcp-ci-bot kcp-ci-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2025
@embik embik added this to the v0.27.0 milestone Jan 31, 2025
@yastij
Copy link

yastij commented Feb 6, 2025

/cc

@kcp-ci-bot kcp-ci-bot requested a review from yastij February 6, 2025 10:26
Copy link
Contributor

@mjudeikis mjudeikis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks ok. Just nits

@@ -78,35 +97,60 @@ func (p *crdNoOverlappingGVRAdmission) Validate(ctx context.Context, a admission
if a.GetKind().GroupKind() != apiextensions.Kind("CustomResourceDefinition") {
return nil
}
cluster, err := request.ClusterNameFrom(ctx)
if a.GetOperation() != admission.Create {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to care about server side patch to create?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should also be a create, no?

Looking on the possible values:

	Create  Operation = "CREATE"
	Update  Operation = "UPDATE"
	Delete  Operation = "DELETE"
	Connect Operation = "CONNECT"

@embik
Copy link
Member

embik commented Feb 10, 2025

Question: Couldn't we use leases.coordination.k8s.io objects for the locks we need? That would eliminate the potential conflicts on the LogicalCluster object. 🤔

@sttts
Copy link
Member Author

sttts commented Feb 23, 2025

Question: Couldn't we use leases.coordination.k8s.io objects for the locks we need? That would eliminate the potential conflicts on the LogicalCluster object. 🤔

Don't we have that already for all of our controllers?

Signed-off-by: Dr. Stefan Schimanski <[email protected]>
@sttts sttts force-pushed the sttts-multiple-identities branch from facb4eb to c132269 Compare February 23, 2025 15:27
@kcp-ci-bot kcp-ci-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 23, 2025
@mjudeikis
Copy link
Contributor

/lgtm
/approve

Feeling lucky today :)

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 23, 2025
@kcp-ci-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 30533bd26f109dab9b0de99549fe3a3d28dbc860

@kcp-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mjudeikis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 23, 2025
@kcp-ci-bot kcp-ci-bot merged commit c14535a into kcp-dev:main Feb 23, 2025
16 checks passed
@embik
Copy link
Member

embik commented Feb 24, 2025

/kind bug

@kcp-ci-bot kcp-ci-bot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: conflicting bound CRDs can exist in the same Workspace, resulting in non-deterministic behavior
5 participants