You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(broker-lifecycle): reject crashed brokers via PID-alive probe
isBrokerEndpointReady() only pings the socket for 150ms. If the broker
process has crashed but its socket file lingers (unix domain) or the
listener drops without the probe noticing, the existing session is
trusted and reused — but every downstream task disconnects mid-turn
because the transport subsystem behind the socket is actually gone.
Add isPidAlive() check consulted before trusting the socket ping. If
the PID is dead, tear down and respawn.
Safety around the teardown's SIGTERM:
- verifyBrokerPid() cross-checks session.pid against the on-disk
pid-file contents AND the live process command line via `ps`
(POSIX) before returning true.
- Windows intentionally returns false — tasklist exposes image name
but not command line, and matching node.exe alone is too weak to
rule out recycled-PID foreign processes. Windows rotation still
cleans socket/pidfile; detached old broker eventually exits on its
own since no new client reaches it.
- If verifyBrokerPid() returns false (e.g. stale pid-file, PID gone,
ps lookup fails), killProcess falls back to null — no signal, only
file cleanup, same as trunk behavior.
Age-based rotation for healthy-but-degrading brokers was considered
and dropped in this revision: rotating a still-serving broker can
interrupt a concurrent client's in-flight turn. A proper fix needs
an active health probe (e.g. lightweight RPC round-trip) or graceful
drain. Out of scope for this PR; filed as a follow-up.
0 commit comments