fix(matching): Limit context deadline when calling RecordTaskStarted by natemort · Pull Request #7792 · cadence-workflow/cadence

natemort · 2026-03-05T23:40:21Z

Calls from matching to history for RecordActivityTaskStarted or RecordDecisionTaskStarted keep the poller occupied while waiting for a response. High latency, especially from a single host or shard, can result in degraded throughput for a TaskList as a disproportionate amount of poller time is spent attempting to start these tasks.

Limit each call to a timeout of 1s, and enforce the expiration interval via a context timeout. Currently ThrottleRetry only prevents additional retries from occurring after the expiration interval. Depending on the incoming context, it's possible for a single attempt to outlast the entire expiration interval. Particularly in the context of matching, we have long pollers with very high context timeouts.

Add new options to ThrottleRetry to add a timeout for each attempt of the operation and to enforce the expiration interval via context timeout.

While it would be ideal to enforce the expiration interval by default, this would be a very risky change to make all at once. There are 68 different retry policies and it's likely that in some cases we have attempts that exceed the expiration interval. We should introduce this behavior gradually, ideally with metrics/logging to flag poorly configured RetryPolicies. Adding these metrics/logging is non-trivial and will be explored separately.

What changed?

Limit RecordActivityTaskStarted/RecordDecisionTaskStarted to 1s per attempt and 3s total.

Why?

Prevent bad hosts/shards from occupying a significant amount of poller time

How did you test it?

Unit/integration tests

Potential risks

If all history hosts become significantly degraded such that they have latency above 1s we will have two changes in behavior:
- We would be unable to dispatch tasks
- We would retry the tasks through matching by writing them to the end of the Tasks queue, creating additional load on Cassandra rather than hanging.

This seems like a reasonable compromise to be more resilient against individual hosts/shards.

Release notes

Documentation Changes

common/backoff/retrypolicy.go

Calls from matching to history for RecordActivityTaskStarted or RecordDecisionTaskStarted keep the poller occupied while waiting for a response. High latency, especially from a single host or shard, can result in degraded throughput for a TaskList as a disproportionate amount of poller time is spent attempting to start these tasks. Limit each call to a timeout of 1s, and enforce the expiration interval via a context timeout. Currently ThrottleRetry only prevents additional retries from occurring after the expiration interval. Depending on the incoming context, it's possible for a single attempt to outlast the entire expiration interval. Particularly in the context of matching, we have long pollers with very high context timeouts. Add new options to ThrottleRetry to add a timeout for each attempt of the operation and to enforce the expiration interval via context timeout. While it would be ideal to enforce the expiration interval by default, this would be a very risky change to make all at once. There are 68 different retry policies and it's likely that in some cases we have attempts that exceed the expiration interval. We should introduce this behavior gradually, ideally with metrics/logging to flag poorly configured RetryPolicies. Adding these metrics/logging is non-trivial and will be explored separately.

gitar-bot · 2026-03-05T23:49:06Z

Code Review ✅ Approved 1 resolved / 1 findings

Fixes context deadline handling in RecordTaskStarted by addressing MultiPhasesRetryPolicy.Expiration() to properly respect NoInterval semantics. No remaining issues found.

✅ 1 resolved

✅ Bug: MultiPhasesRetryPolicy.Expiration() ignores NoInterval semantics

📄 common/backoff/retrypolicy.go:215
NoInterval = 0 means "no expiration limit" — ExponentialRetryPolicy.ComputeNextDelay treats expirationInterval == 0 as unlimited. However, MultiPhasesRetryPolicy.Expiration() simply sums the sub-policy expirations. If any sub-policy has NoInterval (0), the sum won't reflect "unlimited" — it will return the sum of the other policies' expirations, which is incorrect.

If MultiPhasesRetryPolicy is ever used with WithContextExpiration(), this would set a finite context deadline when one of the phases intended to have no expiration limit, potentially cutting retries short.

Not a current issue since MultiPhasesRetryPolicy isn't used with expireContext today, but worth guarding against.

Rules ❌ No requirements met

Repository Rules

❌ GitHub Issue Linking Requirement: Add a GitHub issue link (e.g., #123 or https://github.com/cadence-workflow/cadence/issues/NNN) to the PR description.

❌ PR Description Quality Standards: Add specific copyable test commands to [How did you test it?], and fill in or explicitly mark N/A for [Release notes] and [Documentation Changes] sections.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

natemort requested review from Assem-Uber, Shaddoll, abhishekj720, adhityamamallan, arzonus, bowenxia, c-warren, davidporter-id-au, demirkayaender, dkrotx, eleonoradgr, fimanishi, gazi-yestemirova, jakobht, macrotim, neil-xie, sankari165, shijiesheng and timl3136 as code owners March 5, 2026 23:40

gitar-bot bot reviewed Mar 5, 2026

View reviewed changes

common/backoff/retrypolicy.go Show resolved Hide resolved

natemort force-pushed the retry branch from 16e6a00 to 76bfdc4 Compare March 5, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(matching): Limit context deadline when calling RecordTaskStarted#7792

fix(matching): Limit context deadline when calling RecordTaskStarted#7792
natemort wants to merge 1 commit intocadence-workflow:masterfrom
natemort:retry

natemort commented Mar 5, 2026

Uh oh!

Uh oh!

gitar-bot bot commented Mar 5, 2026 •

edited

Loading

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

natemort commented Mar 5, 2026

Uh oh!

Uh oh!

gitar-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gitar-bot bot commented Mar 5, 2026 •

edited

Loading