Skip to content

Conversation

@pdellarciprete
Copy link

@pdellarciprete pdellarciprete commented Dec 9, 2025

Fix Concurrency Issue: Prevent Concurrent TaskInstance Tries in Scheduler HA

Related issue: #57618

This PR aims to fix a sporadic but critical race condition in the HA scheduler setup that could lead to concurrent execution of the same Task Instance (TI) with sequential try numbers (e.g., Try #1 and Try #2 running simultaneously).

🐞 Problem Description

In a multi-scheduler environment, if two schedulers detected that a TI was ready to run (state None) at nearly the exact same moment, both would attempt to push the task into the $\mathbf{SCHEDULED}$ state.

The race failure occurred due to the following two factors in the DagRun.schedule_tis() logic:

  1. Permissive WHERE Clause: The UPDATE query only used TI.id.in_(id_chunk), lacking a final state check.
  2. Flawed try_number Logic: The try_number was advanced (TI.try_number + 1) if the state was not UP_FOR_RESCHEDULE.

When Scheduler B lost the race to Scheduler A:

  • Scheduler A committed the state change to $\mathbf{SCHEDULED}$ (Try 1).
  • Scheduler B's transaction immediately followed. Since the WHERE clause matched the ID, and the scheduled_dttm field changed, the update was successful (rowcount=1).
  • Scheduler B's permissive CASE logic saw the new SCHEDULED state and incorrectly determined it should be advanced to try_number=2, thus corrupting the record and enabling a concurrent run.

✅ Solution: Enforce Optimistic Concurrency Control

This PR enforces Optimistic Concurrency Control (OCC) by adding a restrictive WHERE clause to the atomic update operation.

The added condition is: .where(TI.state.in_(SCHEDULEABLE_STATES))

How this fixes the race:

  1. Atomic Claim: When Scheduler A runs the update, the WHERE clause holds true, and the update succeeds (rowcount=1).
  2. Graceful Concession: When Scheduler B runs its update immediately after, the TI's state is already $\mathbf{SCHEDULED}$ (which is not in SCHEDULEABLE_STATES).
  3. The WHERE clause fails to find a match, the query returns $\mathbf{rowcount=0}$, and Scheduler B correctly and safely concedes the scheduling claim without touching the TI record.

This ensures that the try_number can no longer be corrupted during the $\mathbf{None} \rightarrow \mathbf{SCHEDULED}$ transition, resolving the concurrent task execution bug.

@boring-cyborg
Copy link

boring-cyborg bot commented Dec 9, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@pdellarciprete pdellarciprete changed the title explicitly exclude tasks that are already in the SCHEDULED state in… Prevent Concurrent TaskInstance Tries in Scheduler HA Dec 9, 2025
@pdellarciprete
Copy link
Author

Releated to the issue reported here #57618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant