fix: retry inference when providers unavailable on restart by ex3ndr-bot · Pull Request #37 · ex3ndr/daycare

ex3ndr-bot · 2026-03-11T20:52:15Z

Fix: Stop inference after pending tool calls on restore

Problem

When a server restarts while an agent has in-flight tool calls, completePendingToolCalls() in agent.ts calls agentLoopRun with stopAfterPendingPhase: false. This means after finishing the pending tool calls, it immediately starts a new inference cycle — but providers may not be fully initialized yet (they reload asynchronously during startup). This causes Inference error: All providers are not available crashes.

Root Cause

In agent.ts line 1379, completePendingToolCalls passes stopAfterPendingPhase: false to agentLoopRun. The false value tells the loop to continue into inference after restoring tool calls, rather than stopping.

Fix

Single line change:

- stopAfterPendingPhase: false
+ stopAfterPendingPhase: true

When an agent is restored after server restart, it now only finishes in-flight tool calls and stops. The next user message (or scheduled trigger) will naturally start inference when providers are guaranteed to be ready.

Why not retry logic?

An earlier approach added retry/backoff in agentLoopRun.ts for provider unavailability. That treated the symptom, not the cause. The real issue is that post-restore agents shouldn't eagerly start new inference at all — they should just finish what was interrupted and wait. This is simpler, more correct, and doesn't add complexity to the inference loop.

Summary by CodeRabbit

Bug Fixes
- Corrected the handling of pending phase completion to ensure the process halts at the appropriate point in the workflow rather than continuing to subsequent phases unexpectedly.

coderabbitai · 2026-03-11T20:52:35Z

📝 Walkthrough

Walkthrough

In the pending-history completion flow within the agent engine, a single boolean parameter was modified: stopAfterPendingPhase was changed from false to true when calling agentLoopRun. This alters the loop's control flow to halt execution after the pending phase completes instead of proceeding to subsequent phases.

Changes

Cohort / File(s)	Summary
Agent Loop Control Flow `packages/daycare/sources/engine/agents/agent.ts`	Modified `stopAfterPendingPhase` parameter from `false` to `true` in the pending-history completion flow to halt loop execution after pending phase processing.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A pending phase now knows when to rest,
No longer chasing phases with zest,
One little flag, a boolean change,
Stops the agent from its range,
Simpler flows, oh how quite blessed! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title describes retry logic for provider unavailability, but the actual change only modifies stopAfterPendingPhase from false to true in agent loop completion.	Update the title to accurately reflect the change, such as 'fix: stop agent loop after pending phase completion' or revert the change if retry logic is the intended fix.
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/inference-restart-error

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts`:
- Around line 667-670: The retry backoff uses await
delay(INFERENCE_PROVIDER_RETRY_DELAY_MS) which is not abort-aware; replace it
with an abortable wait that listens to the incoming AbortSignal (e.g., create or
use an abortableDelay(ms, signal) helper or await Promise.race([delayPromise,
new Promise((_, rej) => signal.addEventListener('abort', () => rej(new
Error('aborted'))))]) ), so the loop can immediately break when signal.aborted;
update the retry points (the logger.debug call mentioning
inferenceAttempt/INFERENCE_PROVIDER_RETRY_ATTEMPTS and the subsequent await
delay call) and the other identical occurrence to use the abort-aware wait and
handle the abort by exiting the retry loop or throwing a caught AbortError.
- Around line 64-65: The retry loop is off by one: code uses inferenceAttempt <
INFERENCE_PROVIDER_RETRY_ATTEMPTS so with INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3
you get 3 calls but only 2 sleeps (total 1.0s) instead of the intended 1.5s; fix
by increasing INFERENCE_PROVIDER_RETRY_ATTEMPTS to 4 (so you get 3 sleeps *
INFERENCE_PROVIDER_RETRY_DELAY_MS = 1.5s) or alternatively change the loop
condition to use <= INFERENCE_PROVIDER_RETRY_ATTEMPTS; update the constant
INFERENCE_PROVIDER_RETRY_ATTEMPTS (and any related comments) or adjust the loop
that references inferenceAttempt to ensure attempts and sleeps match the
documented retry budget.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc078907-bc44-4116-9adf-5d000720668a

📥 Commits

Reviewing files that changed from the base of the PR and between 2e86238 and da0982b.

📒 Files selected for processing (1)

packages/daycare/sources/engine/agents/ops/agentLoopRun.ts

coderabbitai · 2026-03-11T21:01:05Z

packages/daycare/sources/engine/agents/ops/agentLoopRun.ts

+const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3;
+const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500;


⚠️ Potential issue | 🟡 Minor

Retry budget is one attempt short.

With INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3 and inferenceAttempt < ..., this makes 3 total calls and only 2 sleeps, so the bounded wait is 1.0s rather than the 1.5s described in the PR. That still leaves a restart window where provider registration can miss the last attempt.

Suggested fix

-const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3; +const INFERENCE_PROVIDER_RETRIES = 3; const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500; ... - for (let inferenceAttempt = 0; inferenceAttempt < INFERENCE_PROVIDER_RETRY_ATTEMPTS; inferenceAttempt++) { + for (let retry = 0; retry <= INFERENCE_PROVIDER_RETRIES; retry++) { try { response = await inferenceRouter.complete(context, inferenceSessionId, { providersOverride: providersForAgent, signal: abortSignal, ... }); break; // Success - exit retry loop } catch (inferenceError) { if (isInferenceProviderUnavailableError(inferenceError)) { - const isLastAttempt = inferenceAttempt === INFERENCE_PROVIDER_RETRY_ATTEMPTS - 1; + const isLastAttempt = retry === INFERENCE_PROVIDER_RETRIES; if (isLastAttempt) { throw inferenceError; } logger.debug( - `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (attempt ${inferenceAttempt + 1}/${INFERENCE_PROVIDER_RETRY_ATTEMPTS})` + `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (retry ${retry + 1}/${INFERENCE_PROVIDER_RETRIES})` ); await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS); continue; }

Also applies to: 611-675

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts` around lines 64 - 65, The retry loop is off by one: code uses inferenceAttempt < INFERENCE_PROVIDER_RETRY_ATTEMPTS so with INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3 you get 3 calls but only 2 sleeps (total 1.0s) instead of the intended 1.5s; fix by increasing INFERENCE_PROVIDER_RETRY_ATTEMPTS to 4 (so you get 3 sleeps * INFERENCE_PROVIDER_RETRY_DELAY_MS = 1.5s) or alternatively change the loop condition to use <= INFERENCE_PROVIDER_RETRY_ATTEMPTS; update the constant INFERENCE_PROVIDER_RETRY_ATTEMPTS (and any related comments) or adjust the loop that references inferenceAttempt to ensure attempts and sleeps match the documented retry budget.

coderabbitai · 2026-03-11T21:01:05Z

packages/daycare/sources/engine/agents/ops/agentLoopRun.ts

+                                logger.debug(
+                                    `retry: No inference provider available, retrying in ${INFERENCE_PROVIDER_RETRY_DELAY_MS}ms (attempt ${inferenceAttempt + 1}/${INFERENCE_PROVIDER_RETRY_ATTEMPTS})`
+                                );
+                                await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS);


⚠️ Potential issue | 🟡 Minor

Make the retry sleep abort-aware.

If cancellation happens during the backoff, await delay(...) holds the loop open until the timer fires. That can keep the typing indicator and shutdown path lagging for the full 500ms on every pending retry.

Suggested fix

- await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS); + await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS, abortSignal); continue; ... -function delay(ms: number): Promise<void> { - return new Promise((resolve) => setTimeout(resolve, ms)); +function delay(ms: number, signal?: AbortSignal): Promise<void> { + if (signal?.aborted) { + return Promise.reject(abortErrorBuild()); + } + return new Promise((resolve, reject) => { + const onAbort = () => { + clearTimeout(timer); + reject(abortErrorBuild()); + }; + const timer = setTimeout(() => { + signal?.removeEventListener("abort", onAbort); + resolve(); + }, ms); + signal?.addEventListener("abort", onAbort, { once: true }); + }); }

Also applies to: 1590-1592

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/daycare/sources/engine/agents/ops/agentLoopRun.ts` around lines 667 - 670, The retry backoff uses await delay(INFERENCE_PROVIDER_RETRY_DELAY_MS) which is not abort-aware; replace it with an abortable wait that listens to the incoming AbortSignal (e.g., create or use an abortableDelay(ms, signal) helper or await Promise.race([delayPromise, new Promise((_, rej) => signal.addEventListener('abort', () => rej(new Error('aborted'))))]) ), so the loop can immediately break when signal.aborted; update the retry points (the logger.debug call mentioning inferenceAttempt/INFERENCE_PROVIDER_RETRY_ATTEMPTS and the subsequent await delay call) and the other identical occurrence to use the abort-aware wait and handle the abort by exiting the retry loop or throwing a caught AbortError.

When an agent is restored after server restart, it should only finish in-flight tool calls — NOT start a new inference cycle. The next user message will naturally trigger inference when providers are guaranteed to be ready. Changed completePendingToolCalls() to use stopAfterPendingPhase: true, so agents complete their pending Python execution and then wait for the next user interaction instead of immediately triggering inference.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/daycare/sources/engine/agents/agent.ts`:
- Around line 1378-1379: The current change sets agentLoopRun into restoreOnly
(stopAfterPendingPhase: true) causing restore to finish after
completePendingToolCalls() and never resume inference; revert or remove
stopAfterPendingPhase/restoreOnly so agentLoopRun continues into the inference
phase (preserve initialPhase) and instead implement a bounded retry around
provider-availability failures inside agentLoopRun (or the provider-check code
paths hit at lines referenced in ops/agentLoopRun.ts) so the loop can survive
provider-registration races without short-circuiting the whole run.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e521a8ad-622e-4e63-a007-2a0a6fbfe0eb

📥 Commits

Reviewing files that changed from the base of the PR and between da0982b and 308f47b.

📒 Files selected for processing (1)

packages/daycare/sources/engine/agents/agent.ts

coderabbitai · 2026-03-11T21:44:25Z

packages/daycare/sources/engine/agents/agent.ts

            initialPhase,
-            stopAfterPendingPhase: false
+            stopAfterPendingPhase: true


⚠️ Potential issue | 🟠 Major

Don't stop restore before the resumed inference runs.

Line 1379 changes agentLoopRun into restoreOnly mode. In packages/daycare/sources/engine/agents/ops/agentLoopRun.ts:174-180 and :557-561, that means the loop exits with done right after the pending phase instead of continuing into the inference phase. Since completePendingToolCalls() is the only restore-time execution here, interrupted turns will now finish replaying tool history but never resume the assistant response.

If the goal is to survive provider registration races on restart, keep this path continuing into inference and add the bounded retry around the provider-availability failure instead.

Suggested fix

- stopAfterPendingPhase: true + stopAfterPendingPhase: false

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

initialPhase,

stopAfterPendingPhase: false

stopAfterPendingPhase: true

initialPhase,

stopAfterPendingPhase: false

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/daycare/sources/engine/agents/agent.ts` around lines 1378 - 1379, The current change sets agentLoopRun into restoreOnly (stopAfterPendingPhase: true) causing restore to finish after completePendingToolCalls() and never resume inference; revert or remove stopAfterPendingPhase/restoreOnly so agentLoopRun continues into the inference phase (preserve initialPhase) and instead implement a bounded retry around provider-availability failures inside agentLoopRun (or the provider-check code paths hit at lines referenced in ops/agentLoopRun.ts) so the loop can survive provider-registration races without short-circuiting the whole run.

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

ex3ndr-bot force-pushed the fix/inference-restart-error branch from da0982b to 308f47b Compare March 11, 2026 21:40

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry inference when providers unavailable on restart#37

fix: retry inference when providers unavailable on restart#37
ex3ndr-bot wants to merge 1 commit intomainfrom
fix/inference-restart-error

ex3ndr-bot commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const INFERENCE_PROVIDER_RETRY_ATTEMPTS = 3;
		const INFERENCE_PROVIDER_RETRY_DELAY_MS = 500;

Conversation

ex3ndr-bot commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Stop inference after pending tool calls on restore

Problem

Root Cause

Fix

Why not retry logic?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ex3ndr-bot commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading