-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Bug Report
What happened:
I have identified what I believe to be a possible race condition in COSI's API/proto spec that allows for a backend resource to be left (orphaned) without cleanup.
What you expected to happen:
Backend resource orphans should be minimized as much as possible.
How to reproduce this bug (as minimally and precisely as possible):
Imagine this case:
- A new Bucket is created
- The
DriverCreateBucketgRPC call is made, and it returns successfully, with abucket_id - When COSI records the
bucket_idinto Bucket.status, a temporary kube API error occurs - Simultaneously, the Bucket resource is deleted
- The next time COSI reconciles the Bucket, it has a deletion timestamp, and the
bucket_idis not recorded on the Bucket. Because there is nobucket_id, theDriverDeleteBucketcannot be called, but the Bucket resource can be cleaned up successfully, with COSI believing the previous provision to have been unsuccessful.
I'm quite certain that this corner case is possible; however, it seems like it would be extremely rare in production.
Anything else relevant for this bug report?:
I looked at the analogous CSI spec to see if it has notes or different info tracking.
From a high-level and naive view, it appears that CSI also has the potential to encounter this same race condition. However, this assumes the Kubernetes PV reconciler does not handle this case with special internal logic. It's hard for me to imagine that sig-storage hasn't identified this and mitigated it or at least discussed it.
I have 2 follow-up questions to help manage this in COSI:
- Is this corner case infrequent enough that COSI need not prioritize its resolution?
- How does Kubernetes/CSI mitigate this corner case?
- Is it reasonable to update COSI's KEP to use the
bucket_nameidempotency key used forDriverCreateBucketinstead ofbucket_idwhen issuingDriverDeleteBucketto avoid this race condition?
Resources and logs to submit:
N/A
Copy all relevant COSI resources here in yaml format:
# BucketClass
# BucketAccessClass
# BucketClaim
# Bucket
# BucketAccess# Copy COSI controller pod logs here
# Copy COSI sidecar logs here for the relevant driver
Environment:
- Kubernetes version (use
kubectl version), please list client and server: - Sidecar version (provide the release tag or commit hash):
- Provisoner name and version (provide the release tag or commit hash):
- Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release): - Kernel (e.g.
uname -a): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others: