gRPC resource provision-deprovision race condition

# Bug Report



**What happened**:

I have identified what I believe to be a possible race condition in COSI's API/proto spec that allows for a backend resource to be left (orphaned) without cleanup.

**What you expected to happen**:

Backend resource orphans should be minimized as much as possible.

**How to reproduce this bug (as minimally and precisely as possible)**:

Imagine this case:
1. A new Bucket is created
2. The `DriverCreateBucket` gRPC call is made, and it returns successfully, with a `bucket_id`
3. When COSI records the `bucket_id` into Bucket.status, a temporary kube API error occurs
4. Simultaneously, the Bucket resource is deleted
5. The next time COSI reconciles the Bucket, it has a deletion timestamp, and the `bucket_id` is not recorded on the Bucket. Because there is no `bucket_id`, the `DriverDeleteBucket` cannot be called, but the Bucket resource can be cleaned up successfully, with COSI believing the previous provision to have been unsuccessful.

I'm quite certain that this corner case is possible; however, it seems like it would be extremely rare in production.

**Anything else relevant for this bug report?**:

I looked at the analogous [CSI spec](https://github.com/container-storage-interface/spec/blob/master/spec.md#deletevolume) to see if it has notes or different info tracking. 

From a high-level and naive view, it appears that CSI also has the potential to encounter this same race condition. However, this assumes the Kubernetes PV reconciler does not handle this case with special internal logic. It's hard for me to imagine that sig-storage hasn't identified this and mitigated it or at least discussed it.

I have 2 follow-up questions to help manage this in COSI:
1. Is this corner case infrequent enough that COSI need not prioritize its resolution?
2. How does Kubernetes/CSI mitigate this corner case?
3. Is it reasonable to update COSI's KEP to use the `bucket_name` idempotency key used for `DriverCreateBucket` instead of `bucket_id` when issuing `DriverDeleteBucket` to avoid this race condition?

**Resources and logs to submit**:
N/A

Copy all relevant COSI resources here in yaml format:

```yaml
# BucketClass
# BucketAccessClass
# BucketClaim
# Bucket
# BucketAccess
```

```
# Copy COSI controller pod logs here
```

```
# Copy COSI sidecar logs here for the relevant driver
```



**Environment**:

- Kubernetes version (use `kubectl version`), please list client and server:
- Sidecar version (provide the release tag or commit hash):
- Provisoner name and version (provide the release tag or commit hash):
- Cloud provider or hardware configuration:
- OS (e.g: `cat /etc/os-release`):
- Kernel (e.g. `uname -a`):
- Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC resource provision-deprovision race condition #227

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gRPC resource provision-deprovision race condition #227

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions