You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Restarting workers, even with graceful shutdown, causes Workflow Tasks to timeout due to sticky execution.
Describe the bug
When a Workflow is executing and the Workflow Worker restarts, the next Workflow Task still gets scheduled to that worker, due to execution stickiness, and causes a "Workflow Task Timed Out". It would be expected for the Worker to inform Temporal server of its shutdown and disable the stickiness, scheduling the next Workflow Task to the original task queue.
Moreover, as for informing the server, the Workflow Worker should inform as soon as the shutdown process starts, so that no new Workflow Tasks are assigned to this worker during its graceful shutdown, because the worker wouldn't be polling anymore and those Workflow Tasks would timeout because of that.
Below I attach a minimal example to reproduce this bug. The Workflow has a 10s sleep and then returns. The Workflow Worker starts, stays running for 5 seconds and shutdowns. The Workflow Task after the sleep should be assigned to the original task queue, since the Workflow Worker is no longer running.
Here you can see a screenshot of the current behavior. There is a "Workflow Task Timed Out" because of sticky execution.
Disabling the sticky execution (max_cached_workflows=0), the Workflow Task doesn't timeout, as expected.
Minimal Reproduction
importasyncioimportmultiprocessingfromdatetimeimporttimedeltafromtemporalioimportworkflowfromtemporalio.clientimportClientfromtemporalio.workerimportWorker@workflow.defnclassSimpleWorkflow:
@workflow.runasyncdefrun(self) ->None:
workflow.logger.info("Running SimpleWorkflow")
workflow.logger.info("Sleeping for 10 seconds")
awaitworkflow.sleep(timedelta(seconds=10))
workflow.logger.info("Done sleeping")
returnasyncdefworker():
# Create client connected to server at the given addressclient=awaitClient.connect("localhost:7233")
# Run the workerasyncwithWorker(
client,
task_queue="my-task-queue",
workflows=[SimpleWorkflow],
graceful_shutdown_timeout=timedelta(seconds=3),
# max_cached_workflows=0,
) asworker:
print("Starting worker")
# Run for 5 secondsawaitasyncio.sleep(5)
print("Stopping worker")
print("Stopping Temporal client")
asyncdefclient():
# Create client connected to server at the given addressclient=awaitClient.connect("localhost:7233")
# Execute a workflowprint("Executing SimpleWorkflow")
awaitclient.execute_workflow(
SimpleWorkflow.run,
id="test-id",
task_queue="my-task-queue",
run_timeout=timedelta(seconds=25),
)
print(f"Finished SimpleWorkflow")
defstart_client():
asyncio.run(client())
if__name__=="__main__":
# Start client in different processmultiprocessing.Process(target=start_client).start()
# Start workerasyncio.run(worker())
print("Exited worker")
Environment/Versions
OS and processor: M2 Mac, but same error happening with AMD based image
Thanks for opening this! We will investigate. This penalty always existed in the past but we have a new-ish ShutdownWorker call and something may be amiss here.
What are you really trying to do?
Restarting workers, even with graceful shutdown, causes Workflow Tasks to timeout due to sticky execution.
Describe the bug
When a Workflow is executing and the Workflow Worker restarts, the next Workflow Task still gets scheduled to that worker, due to execution stickiness, and causes a "Workflow Task Timed Out". It would be expected for the Worker to inform Temporal server of its shutdown and disable the stickiness, scheduling the next Workflow Task to the original task queue.
Moreover, as for informing the server, the Workflow Worker should inform as soon as the shutdown process starts, so that no new Workflow Tasks are assigned to this worker during its graceful shutdown, because the worker wouldn't be polling anymore and those Workflow Tasks would timeout because of that.
Below I attach a minimal example to reproduce this bug. The Workflow has a 10s sleep and then returns. The Workflow Worker starts, stays running for 5 seconds and shutdowns. The Workflow Task after the sleep should be assigned to the original task queue, since the Workflow Worker is no longer running.
Here you can see a screenshot of the current behavior. There is a "Workflow Task Timed Out" because of sticky execution.

Disabling the sticky execution (

max_cached_workflows=0
), the Workflow Task doesn't timeout, as expected.Minimal Reproduction
Environment/Versions
Additional context
Discussed this issue with @cretz during Replay conference (thank you!).
The text was updated successfully, but these errors were encountered: