Prevent Concurrent TaskInstance Tries in Scheduler HA #59234

pdellarciprete · 2025-12-09T10:40:17Z

Fix Concurrency Issue: Prevent Concurrent TaskInstance Tries in Scheduler HA

Related issue: #57618

This PR aims to fix a sporadic but critical race condition in the HA scheduler setup that could lead to concurrent execution of the same Task Instance (TI) with sequential try numbers (e.g., Try #1 and Try #2 running simultaneously).

🐞 Problem Description

In a multi-scheduler environment, if two schedulers detected that a TI was ready to run (state None) at nearly the exact same moment, both would attempt to push the task into the $\mathbf{SCHEDULED}$ state.

The race failure occurred due to the following two factors in the DagRun.schedule_tis() logic:

Permissive WHERE Clause: The UPDATE query only used TI.id.in_(id_chunk), lacking a final state check.
Flawed try_number Logic: The try_number was advanced (TI.try_number + 1) if the state was not UP_FOR_RESCHEDULE.

When Scheduler B lost the race to Scheduler A:

Scheduler A committed the state change to $\mathbf{SCHEDULED}$ (Try 1).
Scheduler B's transaction immediately followed. Since the WHERE clause matched the ID, and the scheduled_dttm field changed, the update was successful (rowcount=1).
Scheduler B's permissive CASE logic saw the new SCHEDULED state and incorrectly determined it should be advanced to try_number=2, thus corrupting the record and enabling a concurrent run.

✅ Solution: Enforce Optimistic Concurrency Control

This PR enforces Optimistic Concurrency Control (OCC) by adding a restrictive WHERE clause to the atomic update operation.

The added condition is: .where(TI.state.in_(SCHEDULEABLE_STATES))

How this fixes the race:

Atomic Claim: When Scheduler A runs the update, the WHERE clause holds true, and the update succeeds (rowcount=1).
Graceful Concession: When Scheduler B runs its update immediately after, the TI's state is already $\mathbf{SCHEDULED}$ (which is not in SCHEDULEABLE_STATES).
The WHERE clause fails to find a match, the query returns $\mathbf{rowcount=0}$, and Scheduler B correctly and safely concedes the scheduling claim without touching the TI record.

This ensures that the try_number can no longer be corrupted during the $\mathbf{None} \rightarrow \mathbf{SCHEDULED}$ transition, resolving the concurrent task execution bug.

… the `WHERE` clause

boring-cyborg · 2025-12-09T10:40:23Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: [email protected]
Slack: https://s.apache.org/airflow-slack

pdellarciprete · 2025-12-09T10:44:15Z

Releated to the issue reported here #57618

explicitly exclude tasks that are already in the SCHEDULED state in…

01cfa26

… the `WHERE` clause

pdellarciprete changed the title ~~explicitly exclude tasks that are already in the SCHEDULED state in…~~ Prevent Concurrent TaskInstance Tries in Scheduler HA Dec 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent Concurrent TaskInstance Tries in Scheduler HA #59234

Prevent Concurrent TaskInstance Tries in Scheduler HA #59234

pdellarciprete commented Dec 9, 2025 •

edited

Loading

Uh oh!

boring-cyborg bot commented Dec 9, 2025

Uh oh!

pdellarciprete commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Prevent Concurrent TaskInstance Tries in Scheduler HA #59234

Are you sure you want to change the base?

Prevent Concurrent TaskInstance Tries in Scheduler HA #59234

Conversation

pdellarciprete commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix Concurrency Issue: Prevent Concurrent TaskInstance Tries in Scheduler HA

🐞 Problem Description

✅ Solution: Enforce Optimistic Concurrency Control

Uh oh!

boring-cyborg bot commented Dec 9, 2025

Uh oh!

pdellarciprete commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pdellarciprete commented Dec 9, 2025 •

edited

Loading