Skip to content

gRPC resource provision-deprovision race condition #227

@BlaineEXE

Description

@BlaineEXE

Bug Report

What happened:

I have identified what I believe to be a possible race condition in COSI's API/proto spec that allows for a backend resource to be left (orphaned) without cleanup.

What you expected to happen:

Backend resource orphans should be minimized as much as possible.

How to reproduce this bug (as minimally and precisely as possible):

Imagine this case:

  1. A new Bucket is created
  2. The DriverCreateBucket gRPC call is made, and it returns successfully, with a bucket_id
  3. When COSI records the bucket_id into Bucket.status, a temporary kube API error occurs
  4. Simultaneously, the Bucket resource is deleted
  5. The next time COSI reconciles the Bucket, it has a deletion timestamp, and the bucket_id is not recorded on the Bucket. Because there is no bucket_id, the DriverDeleteBucket cannot be called, but the Bucket resource can be cleaned up successfully, with COSI believing the previous provision to have been unsuccessful.

I'm quite certain that this corner case is possible; however, it seems like it would be extremely rare in production.

Anything else relevant for this bug report?:

I looked at the analogous CSI spec to see if it has notes or different info tracking.

From a high-level and naive view, it appears that CSI also has the potential to encounter this same race condition. However, this assumes the Kubernetes PV reconciler does not handle this case with special internal logic. It's hard for me to imagine that sig-storage hasn't identified this and mitigated it or at least discussed it.

I have 2 follow-up questions to help manage this in COSI:

  1. Is this corner case infrequent enough that COSI need not prioritize its resolution?
  2. How does Kubernetes/CSI mitigate this corner case?
  3. Is it reasonable to update COSI's KEP to use the bucket_name idempotency key used for DriverCreateBucket instead of bucket_id when issuing DriverDeleteBucket to avoid this race condition?

Resources and logs to submit:
N/A

Copy all relevant COSI resources here in yaml format:

# BucketClass
# BucketAccessClass
# BucketClaim
# Bucket
# BucketAccess
# Copy COSI controller pod logs here
# Copy COSI sidecar logs here for the relevant driver

Environment:

  • Kubernetes version (use kubectl version), please list client and server:
  • Sidecar version (provide the release tag or commit hash):
  • Provisoner name and version (provide the release tag or commit hash):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    api/v1alpha2Issue reported against or feature request for v1alpha2 APIkind/bugCategorizes issue or PR as related to a bug.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions