Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker dies because timeout is not respected #13060

Closed
1 task
MattDelac opened this issue May 5, 2023 · 13 comments
Closed
1 task

Worker dies because timeout is not respected #13060

MattDelac opened this issue May 5, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@MattDelac
Copy link

MattDelac commented May 5, 2023

Expectation / Proposal

Original conversation
The worker dies because some tasks run longer than the timeout set up.

Traceback / Example

RuntimeError: Timed out after 602.8259875774384s while waiting for Cloud Run Job execution to complete. Your job may still be running on GCP.
An error occured while monitoring flow run 'cdbc0be6-c964-45b5-ba1c-fce2d4e36f17'. The flow run will not be marked as failed, but an issue may have occurred.

This is a separate issue, please open a question in the prefect-gcp repository if you want to discuss that further. It looks like your flow is running longer than the default timeout. See that piece of code.

@desertaxle
Copy link
Member

Thanks for submitting an issue @MattDelac! Do you have an example setup that we can use to reproduce this issue? In particular, sharing how your work pool is configured and the command that you use to start your worker would be helpful.

@MattDelac
Copy link
Author

The work pool is just a Prefect agent
image

And this is my startup script used in a compute engine VM

apt-get update -qy
apt-get install -y python3 python3-pip

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install

python3 -m pip install --upgrade pip wheel
pip install "prefect==2.10.*" "prefect-gcp"

prefect cloud login --key ${prefect_auth_key} --workspace mdelacourmedelysfr/medelys
PREFECT_API_ENABLE_HTTP2=false PREFECT_LOGGING_LEVEL=DEBUG prefect agent start --pool default-agent-pool --work-queue medelys-default

@zanieb
Copy link
Contributor

zanieb commented May 5, 2023

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

@MattDelac
Copy link
Author

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

As I posted here #7442 (comment), the worker is "waiting from Cloud" but Cloud says that the worker in unhealthy 🤷

@MattDelac
Copy link
Author

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

And you're right, the worker might not die per say but Cloud thinks it became unhealthy for reasons I cannot figure out

@desertaxle
Copy link
Member

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

@MattDelac
Copy link
Author

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

image

Flows are getting pilled up and marked as "late"

@MattDelac
Copy link
Author

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

image

@MattDelac
Copy link
Author

And looking at the logs, the agent just waits
image

And restarting nor recreating the VM does not fix it (or it fixes it once every 10 times 🤷 )

@MattDelac
Copy link
Author

MattDelac commented May 5, 2023

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

@desertaxle Is there a way to share a JSON config or something nicer and more verbose?

@MattDelac
Copy link
Author

Ok @desertaxle, the problem is not that the agent dies but here is the behavior I have

  • My work queue has a limit of 10 concurrent jobs
  • Jobs get started and some run longer than the timeout so they are never killed on the Cloud side. For some reason, they don't run anymore in Cloud Run either but they are kept "running" on Prefect cloud
  • Up to 10 jobs are always running artificially
  • Prefect Cloud marks my agent as unhealthy and stops processing jobs
  • When I manually kill the running jobs, 10 more from the "late ones" start

So yeah, the real fix here is to ensure that the timeout is respected and maybe to have Prefect Cloud checks if the jobs run once an hour, for example. It might help Prefect Cloud cleans up its internal state of the "running jobs"

@MattDelac
Copy link
Author

MattDelac commented May 7, 2023

Also Prefect Cloud cannot keep track properly of the jobs ...
This is really weird

image

image

I don't have any job running when checking in GCP. Only Prefect Cloud thinks that the jobs are still running.

@desertaxle desertaxle added the bug Something isn't working label May 30, 2023
@desertaxle desertaxle transferred this issue from PrefectHQ/prefect-gcp Apr 26, 2024
@cicdw
Copy link
Member

cicdw commented Jan 31, 2025

Given that this issue is related to agents, which have been deprecated and removed, I'm going to close as "not planned"; if this problem persists with Cloud Run Workers, please open a new issue and we will look into it!

@cicdw cicdw closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants