Bug: Timeouts when loading up jupyterhub singleuser images #3686

vevetron · 2025-02-10T22:56:32Z

Describe the bug
Some users in some Kubernetes zones having trouble spawning up their pods. It times out after a long time. Typically @amandaha8 .

I don't think it's related to spawning up a new node but related to taking to long to attach the users data.

To Reproduce
Only happens sometimes and goes away.

Expected behavior
Spawns GREAAATT

vevetron · 2025-02-10T22:57:22Z

Next time this bug happens follow this thread:
https://discourse.jupyter.org/t/spawn-failed-timeout-even-when-start-timeout-is-set-to-3600-seconds/8098/7

And enable debug logging
debug:
enabled: true

But for now try to increase timeouts

vevetron · 2025-02-10T23:13:53Z

vevetron · 2025-02-10T23:15:00Z

Might be related.

error killing pod: [failed to "KillContainer" for "notebook" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "7189b427-3927-4237-8a63-7b111abe853f" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

vevetron · 2025-02-10T23:52:06Z

https://z2jh.jupyter.org/en/stable/administrator/debug.html

Try these debug commands if things fail again.

vevetron · 2025-02-11T18:52:48Z

I turned on debugging then tried to start up Amanda's pod. Two new error messages (perhaps because I turned on debugging?). Or perhaps because I increased the start timeout to a long time.

[Warning] Multi-Attach error for volume "pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2" Volume is already exclusively attached to one node and can't be attached to another
[Warning] error killing pod: [failed to "KillContainer" for "notebook" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "7189b427-3927-4237-8a63-7b111abe853f" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

(i'm not certain i turned on debugging correctly but it didn't throw up errors)

Notes from @themightychris :

this is the key thing :
Volume is already exclusively attached to one node and can't be attached to another
each analyst env is bound to a persistent volume. Those volumes can't be attached to more than one pod at a time or across zones
so that indicates there's a pod trying to run already with Amanda's volume
when I was debugging by manually spinning up envs I'd often comment out the volume mount
that will isolate you from testing the volume mounting, which might be good or bad as that's a major source of the instance spin up challenges"

vevetron · 2025-02-11T18:55:41Z

One idea - issues with containerd. I remember we had crazy log issues with containerd

containerd/containerd#8847

vevetron · 2025-02-13T19:04:20Z

This bug pops up a lot for @amandaha8

lottspot · 2025-02-13T23:10:28Z

https://cloudlogging.app.goo.gl/xYoHuNueYvuemQcNA

The pod logs are totally spammed with these errors. The notebook container seems to spend most of its time in a state which cannot be force terminated, which is undoubtedly the source of the symptom. Why it is unkillable is less clear at the moment.

When the notebook container does finally restart itself, it throws tons of errors from certain extensions which fail to link. At least some of the errors are related to the runtime being incompatible, which makes me wonder if there is an upgrade we can take which may also help with this issue.

vevetron · 2025-02-14T05:01:24Z

I deleted amanda8's PVC. Her PV (pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2) should have disappeared but stayed in a "released" state.

The error killing pod messages continued.

I started to see error messages like:
apiVersion: v1 kind: Event metadata: name: pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2.1823f7345927fe19 namespace: default uid: c82dc6a6-ef7a-43e9-90ee-3faa13665cd7 resourceVersion: '8494020' creationTimestamp: '2025-02-14T04:01:25Z' selfLink: >- /api/v1/namespaces/default/events/pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2.1823f7345927fe19 involvedObject: kind: PersistentVolume name: pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2 uid: eec8d300-cafc-453e-88c5-583be96b20f1 apiVersion: v1 resourceVersion: '1227599054' reason: VolumeFailedDelete message: >- persistentvolume pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2 is still attached to node gke-data-infra-apps-jupyterhub-users-6aa76dbb-b03b source: component: >- pd.csi.storage.gke.io_gke-0fe1e974c7ae431b8906-4d25-25c2-vm_7faf716d-bff7-44d2-bfbd-7ea6b15eb27d firstTimestamp: '2025-02-14T04:01:25Z' lastTimestamp: '2025-02-14T04:02:28Z' count: 7 type: Warning eventTime: null reportingComponent: >- pd.csi.storage.gke.io_gke-0fe1e974c7ae431b8906-4d25-25c2-vm_7faf716d-bff7-44d2-bfbd-7ea6b15eb27d reportingInstance: ''
All the issues are with node gke-data-infra-apps-jupyterhub-users-6aa76dbb-b03b?

After a while, I started amanda8's profile in jupyterhub, and it attached a new volume, and it attached quite fast.

The KillContainerError messages haven't stopped.

I was easily able to kill Amanda8's newest pvc, and the related PV disappeared as well.

vevetron mentioned this issue Feb 10, 2025

Try to change jupyterhub timeout settings #3687

Merged

6 tasks

vevetron closed this as completed in #3687 Feb 10, 2025

vevetron reopened this Feb 11, 2025

HaroldBooker assigned HaroldBooker and unassigned HaroldBooker Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Timeouts when loading up jupyterhub singleuser images #3686

Bug: Timeouts when loading up jupyterhub singleuser images #3686

vevetron commented Feb 10, 2025

vevetron commented Feb 10, 2025

vevetron commented Feb 10, 2025

vevetron commented Feb 10, 2025 •

edited

Loading

vevetron commented Feb 10, 2025

vevetron commented Feb 11, 2025 •

edited

Loading

vevetron commented Feb 11, 2025 •

edited

Loading

vevetron commented Feb 13, 2025

lottspot commented Feb 13, 2025

vevetron commented Feb 14, 2025

Bug: Timeouts when loading up jupyterhub singleuser images #3686

Bug: Timeouts when loading up jupyterhub singleuser images #3686

Comments

vevetron commented Feb 10, 2025

vevetron commented Feb 10, 2025

vevetron commented Feb 10, 2025

vevetron commented Feb 10, 2025 • edited Loading

vevetron commented Feb 10, 2025

vevetron commented Feb 11, 2025 • edited Loading

vevetron commented Feb 11, 2025 • edited Loading

vevetron commented Feb 13, 2025

lottspot commented Feb 13, 2025

vevetron commented Feb 14, 2025

vevetron commented Feb 10, 2025 •

edited

Loading

vevetron commented Feb 11, 2025 •

edited

Loading

vevetron commented Feb 11, 2025 •

edited

Loading