Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workspace pod may stall during rollout #801

Open
EronWright opened this issue Feb 4, 2025 · 3 comments · May be fixed by #802
Open

Workspace pod may stall during rollout #801

EronWright opened this issue Feb 4, 2025 · 3 comments · May be fixed by #802
Labels
kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team

Comments

@EronWright
Copy link
Contributor

EronWright commented Feb 4, 2025

What happened?

When the workspace spec is undeployable, e.g. due to an invalid docker image, the rollout fails. Unfortunately, attempts to update the spec aren't effective at unblocking the system, due to a limitation of StatefulSet:

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback

One must manually delete the workspace pod to unblock the system.

See also:

Example

Lots of ways to trigger this:

  • an invalid pod specification (e.g. invalid image)
  • unable to fetch the program source (e.g. from git), leading to pod startup failure

Output of pulumi about

PKO v2.0.0-beta.3

Additional context

Let's keep in mind some requirements.

  1. The system should respect the pod termination grace period when updating or deleting the pod, to gracefully cancel any in-flight Pulumi operation.
  2. The system should preserve the persistent volume(s) of the workspace pod, in case the user makes use of PVs.
  3. The system should wipe the temporary directory within the pod.

A possible solution may be to use the "parallel" pod management strategy:
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#parallel-pod-management

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@EronWright EronWright added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Feb 4, 2025
@EronWright EronWright changed the title Workspace pod maty stall during rollout Workspace pod may stall during rollout Feb 4, 2025
@EronWright
Copy link
Contributor Author

A k3d cluster configuration for testing the behavior of Parallel with the MaxUnavailableStatefulSet feature gate.

apiVersion: k3d.io/v1alpha5
kind: Simple
metadata:
  name: issue-801
options:
  k3s:
    extraArgs:
      - arg: "--kube-apiserver-arg=feature-gates=MaxUnavailableStatefulSet=true"
        nodeFilters:
          - server:*
      - arg: "--kube-scheduler-arg=feature-gates=MaxUnavailableStatefulSet=true"
        nodeFilters:
          - server:*
      - arg: "--kubelet-arg=feature-gates=MaxUnavailableStatefulSet=true"
        nodeFilters:
          - agent:*

@EronWright
Copy link
Contributor Author

I experimented with Parallel strategy, and it does seem effective even when the MaxUnavailableStatefulSet feature gate is enabled.

I also experimented with the OnDelete strategy, and found it difficult to implement for the following reasons:

  1. To delete the pod, one must check its revision hash annotation to know whether it is the old pod or the new pod.
  2. The workspace controller uses a cache to get pods, so it would need a watch on Pod to reliably observe (and delete) the old pod.
  3. When one makes an update to a statefulset, one doesn't immediately know whether a new revision was triggered, one must watch for status changes, or second guess the statefulset controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant