Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Timeouts when loading up jupyterhub singleuser images #3686

Open
vevetron opened this issue Feb 10, 2025 · 9 comments · Fixed by #3687
Open

Bug: Timeouts when loading up jupyterhub singleuser images #3686

vevetron opened this issue Feb 10, 2025 · 9 comments · Fixed by #3687

Comments

@vevetron
Copy link
Contributor

Describe the bug
Some users in some Kubernetes zones having trouble spawning up their pods. It times out after a long time. Typically @amandaha8 .

I don't think it's related to spawning up a new node but related to taking to long to attach the users data.

To Reproduce
Only happens sometimes and goes away.

Expected behavior
Spawns GREAAATT

@vevetron
Copy link
Contributor Author

Next time this bug happens follow this thread:
https://discourse.jupyter.org/t/spawn-failed-timeout-even-when-start-timeout-is-set-to-3600-seconds/8098/7

And enable debug logging
debug:
enabled: true

But for now try to increase timeouts

@vevetron
Copy link
Contributor Author

Image

@vevetron
Copy link
Contributor Author

vevetron commented Feb 10, 2025

Image

Might be related.

error killing pod: [failed to "KillContainer" for "notebook" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "7189b427-3927-4237-8a63-7b111abe853f" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

@vevetron
Copy link
Contributor Author

https://z2jh.jupyter.org/en/stable/administrator/debug.html

Try these debug commands if things fail again.

@vevetron
Copy link
Contributor Author

vevetron commented Feb 11, 2025

I turned on debugging then tried to start up Amanda's pod. Two new error messages (perhaps because I turned on debugging?). Or perhaps because I increased the start timeout to a long time.

[Warning] Multi-Attach error for volume "pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2" Volume is already exclusively attached to one node and can't be attached to another
[Warning] error killing pod: [failed to "KillContainer" for "notebook" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "7189b427-3927-4237-8a63-7b111abe853f" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

(i'm not certain i turned on debugging correctly but it didn't throw up errors)

Notes from @themightychris :

this is the key thing :
Volume is already exclusively attached to one node and can't be attached to another
each analyst env is bound to a persistent volume. Those volumes can't be attached to more than one pod at a time or across zones
so that indicates there's a pod trying to run already with Amanda's volume
when I was debugging by manually spinning up envs I'd often comment out the volume mount
that will isolate you from testing the volume mounting, which might be good or bad as that's a major source of the instance spin up challenges"

@vevetron
Copy link
Contributor Author

vevetron commented Feb 11, 2025

One idea - issues with containerd. I remember we had crazy log issues with containerd

containerd/containerd#8847

@vevetron
Copy link
Contributor Author

Image

This bug pops up a lot for @amandaha8

@lottspot
Copy link
Contributor

https://cloudlogging.app.goo.gl/xYoHuNueYvuemQcNA

Image

The pod logs are totally spammed with these errors. The notebook container seems to spend most of its time in a state which cannot be force terminated, which is undoubtedly the source of the symptom. Why it is unkillable is less clear at the moment.

When the notebook container does finally restart itself, it throws tons of errors from certain extensions which fail to link. At least some of the errors are related to the runtime being incompatible, which makes me wonder if there is an upgrade we can take which may also help with this issue.

@vevetron
Copy link
Contributor Author

I deleted amanda8's PVC. Her PV (pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2) should have disappeared but stayed in a "released" state.

The error killing pod messages continued.

I started to see error messages like:
apiVersion: v1 kind: Event metadata: name: pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2.1823f7345927fe19 namespace: default uid: c82dc6a6-ef7a-43e9-90ee-3faa13665cd7 resourceVersion: '8494020' creationTimestamp: '2025-02-14T04:01:25Z' selfLink: >- /api/v1/namespaces/default/events/pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2.1823f7345927fe19 involvedObject: kind: PersistentVolume name: pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2 uid: eec8d300-cafc-453e-88c5-583be96b20f1 apiVersion: v1 resourceVersion: '1227599054' reason: VolumeFailedDelete message: >- persistentvolume pvc-537c2c55-3c90-44db-a36d-6bd4a4b0ebc2 is still attached to node gke-data-infra-apps-jupyterhub-users-6aa76dbb-b03b source: component: >- pd.csi.storage.gke.io_gke-0fe1e974c7ae431b8906-4d25-25c2-vm_7faf716d-bff7-44d2-bfbd-7ea6b15eb27d firstTimestamp: '2025-02-14T04:01:25Z' lastTimestamp: '2025-02-14T04:02:28Z' count: 7 type: Warning eventTime: null reportingComponent: >- pd.csi.storage.gke.io_gke-0fe1e974c7ae431b8906-4d25-25c2-vm_7faf716d-bff7-44d2-bfbd-7ea6b15eb27d reportingInstance: ''
All the issues are with node gke-data-infra-apps-jupyterhub-users-6aa76dbb-b03b?

After a while, I started amanda8's profile in jupyterhub, and it attached a new volume, and it attached quite fast.

The KillContainerError messages haven't stopped.

I was easily able to kill Amanda8's newest pvc, and the related PV disappeared as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants