Worker dies because timeout is not respected #13060

MattDelac · 2023-05-05T13:58:35Z

Expectation / Proposal

Original conversation
The worker dies because some tasks run longer than the timeout set up.

Traceback / Example

RuntimeError: Timed out after 602.8259875774384s while waiting for Cloud Run Job execution to complete. Your job may still be running on GCP.
An error occured while monitoring flow run 'cdbc0be6-c964-45b5-ba1c-fce2d4e36f17'. The flow run will not be marked as failed, but an issue may have occurred.

This is a separate issue, please open a question in the prefect-gcp repository if you want to discuss that further. It looks like your flow is running longer than the default timeout. See that piece of code.

I would like to help contribute a pull request to resolve this!

The text was updated successfully, but these errors were encountered:

desertaxle · 2023-05-05T14:03:24Z

Thanks for submitting an issue @MattDelac! Do you have an example setup that we can use to reproduce this issue? In particular, sharing how your work pool is configured and the command that you use to start your worker would be helpful.

MattDelac · 2023-05-05T14:15:48Z

The work pool is just a Prefect agent

And this is my startup script used in a compute engine VM

apt-get update -qy
apt-get install -y python3 python3-pip

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install

python3 -m pip install --upgrade pip wheel
pip install "prefect==2.10.*" "prefect-gcp"

prefect cloud login --key ${prefect_auth_key} --workspace mdelacourmedelysfr/medelys
PREFECT_API_ENABLE_HTTP2=false PREFECT_LOGGING_LEVEL=DEBUG prefect agent start --pool default-agent-pool --work-queue medelys-default

zanieb · 2023-05-05T14:16:03Z

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

MattDelac · 2023-05-05T14:16:53Z

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

As I posted here #7442 (comment), the worker is "waiting from Cloud" but Cloud says that the worker in unhealthy 🤷

MattDelac · 2023-05-05T14:17:32Z

The worker does not seem to die in that traceback — it just logs the error. Can you include more logs indicating that the worker is dead?

And you're right, the worker might not die per say but Cloud thinks it became unhealthy for reasons I cannot figure out

desertaxle · 2023-05-05T14:27:17Z

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

MattDelac · 2023-05-05T14:28:12Z

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

Flows are getting pilled up and marked as "late"

MattDelac · 2023-05-05T14:29:29Z

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

MattDelac · 2023-05-05T14:31:56Z

And looking at the logs, the agent just waits

And restarting nor recreating the VM does not fix it (or it fixes it once every 10 times 🤷 )

MattDelac · 2023-05-05T14:34:19Z

Gotcha, looks like you're using an agent with a CloudRun infrastructure block. Could you share your CloudRun block configuration? Also, can your agent continue picking up flow runs after this error, or does it stop picking up flow runs?

@desertaxle Is there a way to share a JSON config or something nicer and more verbose?

MattDelac · 2023-05-07T20:47:12Z

Ok @desertaxle, the problem is not that the agent dies but here is the behavior I have

My work queue has a limit of 10 concurrent jobs
Jobs get started and some run longer than the timeout so they are never killed on the Cloud side. For some reason, they don't run anymore in Cloud Run either but they are kept "running" on Prefect cloud
Up to 10 jobs are always running artificially
Prefect Cloud marks my agent as unhealthy and stops processing jobs
When I manually kill the running jobs, 10 more from the "late ones" start

So yeah, the real fix here is to ensure that the timeout is respected and maybe to have Prefect Cloud checks if the jobs run once an hour, for example. It might help Prefect Cloud cleans up its internal state of the "running jobs"

MattDelac · 2023-05-07T20:55:58Z

Also Prefect Cloud cannot keep track properly of the jobs ...
This is really weird

I don't have any job running when checking in GCP. Only Prefect Cloud thinks that the jobs are still running.

cicdw · 2025-01-31T17:49:03Z

Given that this issue is related to agents, which have been deprecated and removed, I'm going to close as "not planned"; if this problem persists with Cloud Run Workers, please open a new issue and we will look into it!

desertaxle added the bug Something isn't working label May 30, 2023

desertaxle transferred this issue from PrefectHQ/prefect-gcp Apr 26, 2024

cicdw closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker dies because timeout is not respected #13060

Worker dies because timeout is not respected #13060

MattDelac commented May 5, 2023 •

edited

Loading

desertaxle commented May 5, 2023

MattDelac commented May 5, 2023

zanieb commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023

desertaxle commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023 •

edited

Loading

MattDelac commented May 7, 2023

MattDelac commented May 7, 2023 •

edited

Loading

cicdw commented Jan 31, 2025

Worker dies because timeout is not respected #13060

Worker dies because timeout is not respected #13060

Comments

MattDelac commented May 5, 2023 • edited Loading

Expectation / Proposal

Traceback / Example

desertaxle commented May 5, 2023

MattDelac commented May 5, 2023

zanieb commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023

desertaxle commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023

MattDelac commented May 5, 2023 • edited Loading

MattDelac commented May 7, 2023

MattDelac commented May 7, 2023 • edited Loading

cicdw commented Jan 31, 2025

MattDelac commented May 5, 2023 •

edited

Loading

MattDelac commented May 5, 2023 •

edited

Loading

MattDelac commented May 7, 2023 •

edited

Loading