fix(matching): Limit context deadline when calling RecordTaskStarted#7792
fix(matching): Limit context deadline when calling RecordTaskStarted#7792natemort wants to merge 1 commit intocadence-workflow:masterfrom
Conversation
Calls from matching to history for RecordActivityTaskStarted or RecordDecisionTaskStarted keep the poller occupied while waiting for a response. High latency, especially from a single host or shard, can result in degraded throughput for a TaskList as a disproportionate amount of poller time is spent attempting to start these tasks. Limit each call to a timeout of 1s, and enforce the expiration interval via a context timeout. Currently ThrottleRetry only prevents additional retries from occurring after the expiration interval. Depending on the incoming context, it's possible for a single attempt to outlast the entire expiration interval. Particularly in the context of matching, we have long pollers with very high context timeouts. Add new options to ThrottleRetry to add a timeout for each attempt of the operation and to enforce the expiration interval via context timeout. While it would be ideal to enforce the expiration interval by default, this would be a very risky change to make all at once. There are 68 different retry policies and it's likely that in some cases we have attempts that exceed the expiration interval. We should introduce this behavior gradually, ideally with metrics/logging to flag poorly configured RetryPolicies. Adding these metrics/logging is non-trivial and will be explored separately.
Code Review ✅ Approved 1 resolved / 1 findingsFixes context deadline handling in RecordTaskStarted by addressing MultiPhasesRetryPolicy.Expiration() to properly respect NoInterval semantics. No remaining issues found. ✅ 1 resolved✅ Bug:
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
Calls from matching to history for RecordActivityTaskStarted or RecordDecisionTaskStarted keep the poller occupied while waiting for a response. High latency, especially from a single host or shard, can result in degraded throughput for a TaskList as a disproportionate amount of poller time is spent attempting to start these tasks.
Limit each call to a timeout of 1s, and enforce the expiration interval via a context timeout. Currently ThrottleRetry only prevents additional retries from occurring after the expiration interval. Depending on the incoming context, it's possible for a single attempt to outlast the entire expiration interval. Particularly in the context of matching, we have long pollers with very high context timeouts.
Add new options to ThrottleRetry to add a timeout for each attempt of the operation and to enforce the expiration interval via context timeout.
While it would be ideal to enforce the expiration interval by default, this would be a very risky change to make all at once. There are 68 different retry policies and it's likely that in some cases we have attempts that exceed the expiration interval. We should introduce this behavior gradually, ideally with metrics/logging to flag poorly configured RetryPolicies. Adding these metrics/logging is non-trivial and will be explored separately.
What changed?
Why?
How did you test it?
Potential risks
This seems like a reasonable compromise to be more resilient against individual hosts/shards.
Release notes
Documentation Changes