fix: retry inference when providers unavailable on restart#37
fix: retry inference when providers unavailable on restart#37ex3ndr-bot wants to merge 1 commit intomainfrom
Conversation
📝 WalkthroughWalkthroughIn the pending-history completion flow within the agent engine, a single boolean parameter was modified: Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts`:
- Around line 667-670: The retry backoff uses await
delay(INFERENCE_PROVIDER_RETRY_DELAY_MS) which is not abort-aware; replace it
with an abortable wait that listens to the incoming AbortSignal (e.g., create or
use an abortableDelay(ms, signal) helper or await Promise.race([delayPromise,
new Promise((_, rej) => signal.addEventListener('abort', () => rej(new
Error('aborted'))))]) ), so the loop can immediately break when signal.aborted;
update the retry points (the logger.debug call mentioning
inferenceAttempt/INFERENCE_PROVIDER_RETRY_ATTEMPTS and the subsequent await
delay call) and the other identical occurrence to use the abort-aware wait and
handle the abort by exiting the retry loop or throwing a caught AbortError.
- Around line 64-65: The retry loop is off by one: code uses inferenceAttempt <
INFERENCE_PROVIDER_RETRY_ATTEMPTS so with INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3
you get 3 calls but only 2 sleeps (total 1.0s) instead of the intended 1.5s; fix
by increasing INFERENCE_PROVIDER_RETRY_ATTEMPTS to 4 (so you get 3 sleeps *
INFERENCE_PROVIDER_RETRY_DELAY_MS = 1.5s) or alternatively change the loop
condition to use <= INFERENCE_PROVIDER_RETRY_ATTEMPTS; update the constant
INFERENCE_PROVIDER_RETRY_ATTEMPTS (and any related comments) or adjust the loop
that references inferenceAttempt to ensure attempts and sleeps match the
documented retry budget.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bc078907-bc44-4116-9adf-5d000720668a
📒 Files selected for processing (1)
packages/daycare/sources/engine/agents/ops/agentLoopRun.ts
| const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3; | ||
| const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500; |
There was a problem hiding this comment.
Retry budget is one attempt short.
With INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3 and inferenceAttempt < ..., this makes 3 total calls and only 2 sleeps, so the bounded wait is 1.0s rather than the 1.5s described in the PR. That still leaves a restart window where provider registration can miss the last attempt.
Suggested fix
-const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3;
+const INFERENCE_PROVIDER_RETRIES = 3;
const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500;
...
- for (let inferenceAttempt = 0; inferenceAttempt < INFERENCE_PROVIDER_RETRY_ATTEMPTS; inferenceAttempt++) {
+ for (let retry = 0; retry <= INFERENCE_PROVIDER_RETRIES; retry++) {
try {
response = await inferenceRouter.complete(context, inferenceSessionId, {
providersOverride: providersForAgent,
signal: abortSignal,
...
});
break; // Success - exit retry loop
} catch (inferenceError) {
if (isInferenceProviderUnavailableError(inferenceError)) {
- const isLastAttempt = inferenceAttempt === INFERENCE_PROVIDER_RETRY_ATTEMPTS - 1;
+ const isLastAttempt = retry === INFERENCE_PROVIDER_RETRIES;
if (isLastAttempt) {
throw inferenceError;
}
logger.debug(
- `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (attempt ${inferenceAttempt + 1}/${INFERENCE_PROVIDER_RETRY_ATTEMPTS})`
+ `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (retry ${retry + 1}/${INFERENCE_PROVIDER_RETRIES})`
);
await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS);
continue;
}Also applies to: 611-675
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts` around lines 64 -
65, The retry loop is off by one: code uses inferenceAttempt <
INFERENCE_PROVIDER_RETRY_ATTEMPTS so with INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3
you get 3 calls but only 2 sleeps (total 1.0s) instead of the intended 1.5s; fix
by increasing INFERENCE_PROVIDER_RETRY_ATTEMPTS to 4 (so you get 3 sleeps *
INFERENCE_PROVIDER_RETRY_DELAY_MS = 1.5s) or alternatively change the loop
condition to use <= INFERENCE_PROVIDER_RETRY_ATTEMPTS; update the constant
INFERENCE_PROVIDER_RETRY_ATTEMPTS (and any related comments) or adjust the loop
that references inferenceAttempt to ensure attempts and sleeps match the
documented retry budget.
| logger.debug( | ||
| `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (attempt ${inferenceAttempt + 1}/${INFERENCE_PROVIDER_RETRY_ATTEMPTS})` | ||
| ); | ||
| await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS); |
There was a problem hiding this comment.
Make the retry sleep abort-aware.
If cancellation happens during the backoff, await delay(...) holds the loop open until the timer fires. That can keep the typing indicator and shutdown path lagging for the full 500ms on every pending retry.
Suggested fix
- await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS);
+ await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS, abortSignal);
continue;
...
-function delay(ms: number): Promise<void> {
- return new Promise((resolve) => setTimeout(resolve, ms));
+function delay(ms: number, signal?: AbortSignal): Promise<void> {
+ if (signal?.aborted) {
+ return Promise.reject(abortErrorBuild());
+ }
+ return new Promise((resolve, reject) => {
+ const onAbort = () => {
+ clearTimeout(timer);
+ reject(abortErrorBuild());
+ };
+ const timer = setTimeout(() => {
+ signal?.removeEventListener("abort", onAbort);
+ resolve();
+ }, ms);
+ signal?.addEventListener("abort", onAbort, { once: true });
+ });
}Also applies to: 1590-1592
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts` around lines 667
- 670, The retry backoff uses await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS)
which is not abort-aware; replace it with an abortable wait that listens to the
incoming AbortSignal (e.g., create or use an abortableDelay(ms, signal) helper
or await Promise.race([delayPromise, new Promise((_, rej) =>
signal.addEventListener('abort', () => rej(new Error('aborted'))))]) ), so the
loop can immediately break when signal.aborted; update the retry points (the
logger.debug call mentioning inferenceAttempt/INFERENCE_PROVIDER_RETRY_ATTEMPTS
and the subsequent await delay call) and the other identical occurrence to use
the abort-aware wait and handle the abort by exiting the retry loop or throwing
a caught AbortError.
When an agent is restored after server restart, it should only finish in-flight tool calls — NOT start a new inference cycle. The next user message will naturally trigger inference when providers are guaranteed to be ready. Changed completePendingToolCalls() to use stopAfterPendingPhase: true, so agents complete their pending Python execution and then wait for the next user interaction instead of immediately triggering inference.
da0982b to
308f47b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/daycare/sources/engine/agents/agent.ts`:
- Around line 1378-1379: The current change sets agentLoopRun into restoreOnly
(stopAfterPendingPhase: true) causing restore to finish after
completePendingToolCalls() and never resume inference; revert or remove
stopAfterPendingPhase/restoreOnly so agentLoopRun continues into the inference
phase (preserve initialPhase) and instead implement a bounded retry around
provider-availability failures inside agentLoopRun (or the provider-check code
paths hit at lines referenced in ops/agentLoopRun.ts) so the loop can survive
provider-registration races without short-circuiting the whole run.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e521a8ad-622e-4e63-a007-2a0a6fbfe0eb
📒 Files selected for processing (1)
packages/daycare/sources/engine/agents/agent.ts
| initialPhase, | ||
| stopAfterPendingPhase: false | ||
| stopAfterPendingPhase: true |
There was a problem hiding this comment.
Don't stop restore before the resumed inference runs.
Line 1379 changes agentLoopRun into restoreOnly mode. In packages/daycare/sources/engine/agents/ops/agentLoopRun.ts:174-180 and :557-561, that means the loop exits with done right after the pending phase instead of continuing into the inference phase. Since completePendingToolCalls() is the only restore-time execution here, interrupted turns will now finish replaying tool history but never resume the assistant response.
If the goal is to survive provider registration races on restart, keep this path continuing into inference and add the bounded retry around the provider-availability failure instead.
Suggested fix
- stopAfterPendingPhase: true
+ stopAfterPendingPhase: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| initialPhase, | |
| stopAfterPendingPhase: false | |
| stopAfterPendingPhase: true | |
| initialPhase, | |
| stopAfterPendingPhase: false |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/daycare/sources/engine/agents/agent.ts` around lines 1378 - 1379,
The current change sets agentLoopRun into restoreOnly (stopAfterPendingPhase:
true) causing restore to finish after completePendingToolCalls() and never
resume inference; revert or remove stopAfterPendingPhase/restoreOnly so
agentLoopRun continues into the inference phase (preserve initialPhase) and
instead implement a bounded retry around provider-availability failures inside
agentLoopRun (or the provider-check code paths hit at lines referenced in
ops/agentLoopRun.ts) so the loop can survive provider-registration races without
short-circuiting the whole run.
Fix: Stop inference after pending tool calls on restore
Problem
When a server restarts while an agent has in-flight tool calls,
completePendingToolCalls()inagent.tscallsagentLoopRunwithstopAfterPendingPhase: false. This means after finishing the pending tool calls, it immediately starts a new inference cycle — but providers may not be fully initialized yet (they reload asynchronously during startup). This causesInference error: All providers are not availablecrashes.Root Cause
In
agent.tsline 1379,completePendingToolCallspassesstopAfterPendingPhase: falsetoagentLoopRun. Thefalsevalue tells the loop to continue into inference after restoring tool calls, rather than stopping.Fix
Single line change:
When an agent is restored after server restart, it now only finishes in-flight tool calls and stops. The next user message (or scheduled trigger) will naturally start inference when providers are guaranteed to be ready.
Why not retry logic?
An earlier approach added retry/backoff in
agentLoopRun.tsfor provider unavailability. That treated the symptom, not the cause. The real issue is that post-restore agents shouldn't eagerly start new inference at all — they should just finish what was interrupted and wait. This is simpler, more correct, and doesn't add complexity to the inference loop.Summary by CodeRabbit