Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Sticky execution after Worker shutdown causes "Workflow Task Timed Out" #783

Open
gonced8 opened this issue Mar 5, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@gonced8
Copy link

gonced8 commented Mar 5, 2025

What are you really trying to do?

Restarting workers, even with graceful shutdown, causes Workflow Tasks to timeout due to sticky execution.

Describe the bug

When a Workflow is executing and the Workflow Worker restarts, the next Workflow Task still gets scheduled to that worker, due to execution stickiness, and causes a "Workflow Task Timed Out". It would be expected for the Worker to inform Temporal server of its shutdown and disable the stickiness, scheduling the next Workflow Task to the original task queue.

Moreover, as for informing the server, the Workflow Worker should inform as soon as the shutdown process starts, so that no new Workflow Tasks are assigned to this worker during its graceful shutdown, because the worker wouldn't be polling anymore and those Workflow Tasks would timeout because of that.

Below I attach a minimal example to reproduce this bug. The Workflow has a 10s sleep and then returns. The Workflow Worker starts, stays running for 5 seconds and shutdowns. The Workflow Task after the sleep should be assigned to the original task queue, since the Workflow Worker is no longer running.

Here you can see a screenshot of the current behavior. There is a "Workflow Task Timed Out" because of sticky execution.
Image

Disabling the sticky execution (max_cached_workflows=0), the Workflow Task doesn't timeout, as expected.
Image

Minimal Reproduction

import asyncio
import multiprocessing
from datetime import timedelta

from temporalio import workflow
from temporalio.client import Client
from temporalio.worker import Worker


@workflow.defn
class SimpleWorkflow:
    @workflow.run
    async def run(self) -> None:
        workflow.logger.info("Running SimpleWorkflow")
        workflow.logger.info("Sleeping for 10 seconds")
        await workflow.sleep(timedelta(seconds=10))
        workflow.logger.info("Done sleeping")
        return


async def worker():
    # Create client connected to server at the given address
    client = await Client.connect("localhost:7233")

    # Run the worker
    async with Worker(
        client,
        task_queue="my-task-queue",
        workflows=[SimpleWorkflow],
        graceful_shutdown_timeout=timedelta(seconds=3),
        # max_cached_workflows=0,
    ) as worker:
        print("Starting worker")
        # Run for 5 seconds
        await asyncio.sleep(5)
        print("Stopping worker")

    print("Stopping Temporal client")


async def client():
    # Create client connected to server at the given address
    client = await Client.connect("localhost:7233")

    # Execute a workflow
    print("Executing SimpleWorkflow")
    await client.execute_workflow(
        SimpleWorkflow.run,
        id="test-id",
        task_queue="my-task-queue",
        run_timeout=timedelta(seconds=25),
    )
    print(f"Finished SimpleWorkflow")


def start_client():
    asyncio.run(client())


if __name__ == "__main__":
    # Start client in different process
    multiprocessing.Process(target=start_client).start()

    # Start worker
    asyncio.run(worker())
    print("Exited worker")

Environment/Versions

  • OS and processor: M2 Mac, but same error happening with AMD based image
  • Temporal Version: CLI 1.3.0 (Server 1.27.1, UI 2.36.0) and sdk-python 1.10.0
  • Are you using Docker or Kubernetes or building Temporal from source? Tried with Temporal CLI and also with Docker and Kubernetes.

Additional context

Discussed this issue with @cretz during Replay conference (thank you!).

@gonced8 gonced8 added the bug Something isn't working label Mar 5, 2025
@cretz
Copy link
Member

cretz commented Mar 7, 2025

Thanks for opening this! We will investigate. This penalty always existed in the past but we have a new-ish ShutdownWorker call and something may be amiss here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants