Skip to content

fix: retry inference when providers unavailable on restart#37

Open
ex3ndr-bot wants to merge 1 commit intomainfrom
fix/inference-restart-error
Open

fix: retry inference when providers unavailable on restart#37
ex3ndr-bot wants to merge 1 commit intomainfrom
fix/inference-restart-error

Conversation

@ex3ndr-bot
Copy link
Copy Markdown
Collaborator

@ex3ndr-bot ex3ndr-bot commented Mar 11, 2026

Fix: Stop inference after pending tool calls on restore

Problem

When a server restarts while an agent has in-flight tool calls, completePendingToolCalls() in agent.ts calls agentLoopRun with stopAfterPendingPhase: false. This means after finishing the pending tool calls, it immediately starts a new inference cycle — but providers may not be fully initialized yet (they reload asynchronously during startup). This causes Inference error: All providers are not available crashes.

Root Cause

In agent.ts line 1379, completePendingToolCalls passes stopAfterPendingPhase: false to agentLoopRun. The false value tells the loop to continue into inference after restoring tool calls, rather than stopping.

Fix

Single line change:

- stopAfterPendingPhase: false
+ stopAfterPendingPhase: true

When an agent is restored after server restart, it now only finishes in-flight tool calls and stops. The next user message (or scheduled trigger) will naturally start inference when providers are guaranteed to be ready.

Why not retry logic?

An earlier approach added retry/backoff in agentLoopRun.ts for provider unavailability. That treated the symptom, not the cause. The real issue is that post-restore agents shouldn't eagerly start new inference at all — they should just finish what was interrupted and wait. This is simpler, more correct, and doesn't add complexity to the inference loop.

Summary by CodeRabbit

  • Bug Fixes
    • Corrected the handling of pending phase completion to ensure the process halts at the appropriate point in the workflow rather than continuing to subsequent phases unexpectedly.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

In the pending-history completion flow within the agent engine, a single boolean parameter was modified: stopAfterPendingPhase was changed from false to true when calling agentLoopRun. This alters the loop's control flow to halt execution after the pending phase completes instead of proceeding to subsequent phases.

Changes

Cohort / File(s) Summary
Agent Loop Control Flow
packages/daycare/sources/engine/agents/agent.ts
Modified stopAfterPendingPhase parameter from false to true in the pending-history completion flow to halt loop execution after pending phase processing.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A pending phase now knows when to rest,
No longer chasing phases with zest,
One little flag, a boolean change,
Stops the agent from its range,
Simpler flows, oh how quite blessed! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title describes retry logic for provider unavailability, but the actual change only modifies stopAfterPendingPhase from false to true in agent loop completion. Update the title to accurately reflect the change, such as 'fix: stop agent loop after pending phase completion' or revert the change if retry logic is the intended fix.
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/inference-restart-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts`:
- Around line 667-670: The retry backoff uses await
delay(INFERENCE_PROVIDER_RETRY_DELAY_MS) which is not abort-aware; replace it
with an abortable wait that listens to the incoming AbortSignal (e.g., create or
use an abortableDelay(ms, signal) helper or await Promise.race([delayPromise,
new Promise((_, rej) => signal.addEventListener('abort', () => rej(new
Error('aborted'))))]) ), so the loop can immediately break when signal.aborted;
update the retry points (the logger.debug call mentioning
inferenceAttempt/INFERENCE_PROVIDER_RETRY_ATTEMPTS and the subsequent await
delay call) and the other identical occurrence to use the abort-aware wait and
handle the abort by exiting the retry loop or throwing a caught AbortError.
- Around line 64-65: The retry loop is off by one: code uses inferenceAttempt <
INFERENCE_PROVIDER_RETRY_ATTEMPTS so with INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3
you get 3 calls but only 2 sleeps (total 1.0s) instead of the intended 1.5s; fix
by increasing INFERENCE_PROVIDER_RETRY_ATTEMPTS to 4 (so you get 3 sleeps *
INFERENCE_PROVIDER_RETRY_DELAY_MS = 1.5s) or alternatively change the loop
condition to use <= INFERENCE_PROVIDER_RETRY_ATTEMPTS; update the constant
INFERENCE_PROVIDER_RETRY_ATTEMPTS (and any related comments) or adjust the loop
that references inferenceAttempt to ensure attempts and sleeps match the
documented retry budget.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc078907-bc44-4116-9adf-5d000720668a

📥 Commits

Reviewing files that changed from the base of the PR and between 2e86238 and da0982b.

📒 Files selected for processing (1)
  • packages/daycare/sources/engine/agents/ops/agentLoopRun.ts

Comment on lines +64 to +65
const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3;
const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Retry budget is one attempt short.

With INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3 and inferenceAttempt < ..., this makes 3 total calls and only 2 sleeps, so the bounded wait is 1.0s rather than the 1.5s described in the PR. That still leaves a restart window where provider registration can miss the last attempt.

Suggested fix
-const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3;
+const INFERENCE_PROVIDER_RETRIES = 3;
 const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500;
...
-                    for (let inferenceAttempt = 0; inferenceAttempt < INFERENCE_PROVIDER_RETRY_ATTEMPTS; inferenceAttempt++) {
+                    for (let retry = 0; retry <= INFERENCE_PROVIDER_RETRIES; retry++) {
                         try {
                             response = await inferenceRouter.complete(context, inferenceSessionId, {
                                 providersOverride: providersForAgent,
                                 signal: abortSignal,
...
                             });
                             break; // Success - exit retry loop
                         } catch (inferenceError) {
                             if (isInferenceProviderUnavailableError(inferenceError)) {
-                                const isLastAttempt = inferenceAttempt === INFERENCE_PROVIDER_RETRY_ATTEMPTS - 1;
+                                const isLastAttempt = retry === INFERENCE_PROVIDER_RETRIES;
                                 if (isLastAttempt) {
                                     throw inferenceError;
                                 }
                                 logger.debug(
-                                    `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (attempt ${inferenceAttempt + 1}/${INFERENCE_PROVIDER_RETRY_ATTEMPTS})`
+                                    `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (retry ${retry + 1}/${INFERENCE_PROVIDER_RETRIES})`
                                 );
                                 await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS);
                                 continue;
                             }

Also applies to: 611-675

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts` around lines 64 -
65, The retry loop is off by one: code uses inferenceAttempt <
INFERENCE_PROVIDER_RETRY_ATTEMPTS so with INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3
you get 3 calls but only 2 sleeps (total 1.0s) instead of the intended 1.5s; fix
by increasing INFERENCE_PROVIDER_RETRY_ATTEMPTS to 4 (so you get 3 sleeps *
INFERENCE_PROVIDER_RETRY_DELAY_MS = 1.5s) or alternatively change the loop
condition to use <= INFERENCE_PROVIDER_RETRY_ATTEMPTS; update the constant
INFERENCE_PROVIDER_RETRY_ATTEMPTS (and any related comments) or adjust the loop
that references inferenceAttempt to ensure attempts and sleeps match the
documented retry budget.

Comment on lines +667 to +670
logger.debug(
`retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (attempt ${inferenceAttempt + 1}/${INFERENCE_PROVIDER_RETRY_ATTEMPTS})`
);
await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the retry sleep abort-aware.

If cancellation happens during the backoff, await delay(...) holds the loop open until the timer fires. That can keep the typing indicator and shutdown path lagging for the full 500ms on every pending retry.

Suggested fix
-                                await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS);
+                                await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS, abortSignal);
                                 continue;
...
-function delay(ms: number): Promise<void> {
-    return new Promise((resolve) => setTimeout(resolve, ms));
+function delay(ms: number, signal?: AbortSignal): Promise<void> {
+    if (signal?.aborted) {
+        return Promise.reject(abortErrorBuild());
+    }
+    return new Promise((resolve, reject) => {
+        const onAbort = () => {
+            clearTimeout(timer);
+            reject(abortErrorBuild());
+        };
+        const timer = setTimeout(() => {
+            signal?.removeEventListener("abort", onAbort);
+            resolve();
+        }, ms);
+        signal?.addEventListener("abort", onAbort, { once: true });
+    });
 }

Also applies to: 1590-1592

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts` around lines 667
- 670, The retry backoff uses await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS)
which is not abort-aware; replace it with an abortable wait that listens to the
incoming AbortSignal (e.g., create or use an abortableDelay(ms, signal) helper
or await Promise.race([delayPromise, new Promise((_, rej) =>
signal.addEventListener('abort', () => rej(new Error('aborted'))))]) ), so the
loop can immediately break when signal.aborted; update the retry points (the
logger.debug call mentioning inferenceAttempt/INFERENCE_PROVIDER_RETRY_ATTEMPTS
and the subsequent await delay call) and the other identical occurrence to use
the abort-aware wait and handle the abort by exiting the retry loop or throwing
a caught AbortError.

When an agent is restored after server restart, it should only finish
in-flight tool calls — NOT start a new inference cycle. The next user
message will naturally trigger inference when providers are guaranteed
to be ready.

Changed completePendingToolCalls() to use stopAfterPendingPhase: true,
so agents complete their pending Python execution and then wait for
the next user interaction instead of immediately triggering inference.
@ex3ndr-bot ex3ndr-bot force-pushed the fix/inference-restart-error branch from da0982b to 308f47b Compare March 11, 2026 21:40
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daycare/sources/engine/agents/agent.ts`:
- Around line 1378-1379: The current change sets agentLoopRun into restoreOnly
(stopAfterPendingPhase: true) causing restore to finish after
completePendingToolCalls() and never resume inference; revert or remove
stopAfterPendingPhase/restoreOnly so agentLoopRun continues into the inference
phase (preserve initialPhase) and instead implement a bounded retry around
provider-availability failures inside agentLoopRun (or the provider-check code
paths hit at lines referenced in ops/agentLoopRun.ts) so the loop can survive
provider-registration races without short-circuiting the whole run.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e521a8ad-622e-4e63-a007-2a0a6fbfe0eb

📥 Commits

Reviewing files that changed from the base of the PR and between da0982b and 308f47b.

📒 Files selected for processing (1)
  • packages/daycare/sources/engine/agents/agent.ts

Comment on lines 1378 to +1379
initialPhase,
stopAfterPendingPhase: false
stopAfterPendingPhase: true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't stop restore before the resumed inference runs.

Line 1379 changes agentLoopRun into restoreOnly mode. In packages/daycare/sources/engine/agents/ops/agentLoopRun.ts:174-180 and :557-561, that means the loop exits with done right after the pending phase instead of continuing into the inference phase. Since completePendingToolCalls() is the only restore-time execution here, interrupted turns will now finish replaying tool history but never resume the assistant response.

If the goal is to survive provider registration races on restart, keep this path continuing into inference and add the bounded retry around the provider-availability failure instead.

Suggested fix
-            stopAfterPendingPhase: true
+            stopAfterPendingPhase: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
initialPhase,
stopAfterPendingPhase: false
stopAfterPendingPhase: true
initialPhase,
stopAfterPendingPhase: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/daycare/sources/engine/agents/agent.ts` around lines 1378 - 1379,
The current change sets agentLoopRun into restoreOnly (stopAfterPendingPhase:
true) causing restore to finish after completePendingToolCalls() and never
resume inference; revert or remove stopAfterPendingPhase/restoreOnly so
agentLoopRun continues into the inference phase (preserve initialPhase) and
instead implement a bounded retry around provider-availability failures inside
agentLoopRun (or the provider-check code paths hit at lines referenced in
ops/agentLoopRun.ts) so the loop can survive provider-registration races without
short-circuiting the whole run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant