Skip to content

fix(matching): Limit context deadline when calling RecordTaskStarted#7792

Open
natemort wants to merge 1 commit intocadence-workflow:masterfrom
natemort:retry
Open

fix(matching): Limit context deadline when calling RecordTaskStarted#7792
natemort wants to merge 1 commit intocadence-workflow:masterfrom
natemort:retry

Conversation

@natemort
Copy link
Member

@natemort natemort commented Mar 5, 2026

Calls from matching to history for RecordActivityTaskStarted or RecordDecisionTaskStarted keep the poller occupied while waiting for a response. High latency, especially from a single host or shard, can result in degraded throughput for a TaskList as a disproportionate amount of poller time is spent attempting to start these tasks.

Limit each call to a timeout of 1s, and enforce the expiration interval via a context timeout. Currently ThrottleRetry only prevents additional retries from occurring after the expiration interval. Depending on the incoming context, it's possible for a single attempt to outlast the entire expiration interval. Particularly in the context of matching, we have long pollers with very high context timeouts.

Add new options to ThrottleRetry to add a timeout for each attempt of the operation and to enforce the expiration interval via context timeout.

While it would be ideal to enforce the expiration interval by default, this would be a very risky change to make all at once. There are 68 different retry policies and it's likely that in some cases we have attempts that exceed the expiration interval. We should introduce this behavior gradually, ideally with metrics/logging to flag poorly configured RetryPolicies. Adding these metrics/logging is non-trivial and will be explored separately.

What changed?

  • Limit RecordActivityTaskStarted/RecordDecisionTaskStarted to 1s per attempt and 3s total.

Why?

  • Prevent bad hosts/shards from occupying a significant amount of poller time

How did you test it?

  • Unit/integration tests

Potential risks

  • If all history hosts become significantly degraded such that they have latency above 1s we will have two changes in behavior:
    • We would be unable to dispatch tasks
    • We would retry the tasks through matching by writing them to the end of the Tasks queue, creating additional load on Cassandra rather than hanging.

This seems like a reasonable compromise to be more resilient against individual hosts/shards.

Release notes

Documentation Changes

Calls from matching to history for RecordActivityTaskStarted or RecordDecisionTaskStarted keep the poller occupied while waiting for a response. High latency, especially from a single host or shard, can result in degraded throughput for a TaskList as a disproportionate amount of poller time is spent attempting to start these tasks.

Limit each call to a timeout of 1s, and enforce the expiration interval via a context timeout. Currently ThrottleRetry only prevents additional retries from occurring after the expiration interval. Depending on the incoming context, it's possible for a single attempt to outlast the entire expiration interval. Particularly in the context of matching, we have long pollers with very high context timeouts.

Add new options to ThrottleRetry to add a timeout for each attempt of the operation and to enforce the expiration interval via context timeout.

While it would be ideal to enforce the expiration interval by default, this would be a very risky change to make all at once. There are 68 different retry policies and it's likely that in some cases we have attempts that exceed the expiration interval. We should introduce this behavior gradually, ideally with metrics/logging to flag poorly configured RetryPolicies. Adding these metrics/logging is non-trivial and will be explored separately.
@gitar-bot
Copy link

gitar-bot bot commented Mar 5, 2026

Code Review ✅ Approved 1 resolved / 1 findings

Fixes context deadline handling in RecordTaskStarted by addressing MultiPhasesRetryPolicy.Expiration() to properly respect NoInterval semantics. No remaining issues found.

✅ 1 resolved
Bug: MultiPhasesRetryPolicy.Expiration() ignores NoInterval semantics

📄 common/backoff/retrypolicy.go:215
NoInterval = 0 means "no expiration limit" — ExponentialRetryPolicy.ComputeNextDelay treats expirationInterval == 0 as unlimited. However, MultiPhasesRetryPolicy.Expiration() simply sums the sub-policy expirations. If any sub-policy has NoInterval (0), the sum won't reflect "unlimited" — it will return the sum of the other policies' expirations, which is incorrect.

If MultiPhasesRetryPolicy is ever used with WithContextExpiration(), this would set a finite context deadline when one of the phases intended to have no expiration limit, potentially cutting retries short.

Not a current issue since MultiPhasesRetryPolicy isn't used with expireContext today, but worth guarding against.

Rules ❌ No requirements met

Repository Rules

GitHub Issue Linking Requirement: Add a GitHub issue link (e.g., #123 or https://github.com/cadence-workflow/cadence/issues/NNN) to the PR description.
PR Description Quality Standards: Add specific copyable test commands to [How did you test it?], and fill in or explicitly mark N/A for [Release notes] and [Documentation Changes] sections.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant