You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: add captureTurn transport watchdog and runTrackedJob hard timeout
Completes the defense-in-depth strategy for stuck jobs. The plugin's
job state machine is single-writer: only runTrackedJob writes terminal
status. But that writer sits behind a chain of async promises
(transport -> captureTurn -> runner), and any link in that chain can
break silently, freezing state.json at status:"running" forever.
Previous commits handled the read side (dead-PID reconciliation with
TOCTOU guards) and UX transparency (idle surfacing, relative log
timestamps). This commit closes the write side with two additional
layers:
Layer 1 — captureTurn transport watchdog (lib/codex.mjs):
state.completion resolves only via turn/completed or an inferred
250ms final_answer timer. If the app-server (direct or via broker)
disconnects before either signal, the promise hangs forever. Race
it against client.exitPromise: on transport close, treat a seen
final_answer as inferred success, otherwise reject with a clear
"app-server disconnected before the turn completed" error. This
covers the captureTurn-hang failure mode reported in the wild where
codex CLI exits cleanly (exit 0) without sending turn/completed and
the companion waits forever.
Layer 2 — runTrackedJob hard timeout (lib/tracked-jobs.mjs):
Even if every inner watchdog fails, Promise.race the runner against
a configurable hard cap (default 15m, overridable via
CODEX_JOB_TIMEOUT_MS env or per-call timeoutMs). On timeout, write
status:"failed" with timedOut:true and a descriptive error. The job
record also carries timeoutAt so /codex:status can render
"hard timeout in 59s" when a job is also flagged staleLog —
telling the user the escape hatch is imminent instead of leaving
them guessing how long to wait.
Combined coverage:
- worker process dies silently -> dead-PID reconciliation
- app-server disconnects mid-turn -> captureTurn watchdog
- runner hangs in any other way -> hard timeout
- user can always see what happened -> UX layer
Refs upstream openai#243, openai#184, openai#222, openai#183, openai#164.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments