Fix: removeSubscription hangs after Runloop crashes#1712
Conversation
dc5eebf to
40b316d
Compare
a31bd27 to
cb6f6c9
Compare
erikvanoosten
left a comment
There was a problem hiding this comment.
I found the problem that caused partition streams to dead lock. Please see the comments for a solution.
The new unit test fails now though. Do you want to take another look at that?
| private[internal] def offerAndAwaitCommand[E <: Throwable, A](f: Promise[E, A] => RunloopCommand): Task[A] = | ||
| runloopDone.isDone.flatMap { | ||
| case true => runloopDone.await | ||
| case false => | ||
| (Promise | ||
| .make[E, A] | ||
| .flatMap { promise => | ||
| commandQueue.offer(f(promise)) *> promise.await | ||
| }: Task[A]) | ||
| .raceFirst(runloopDone.await) | ||
| } |
There was a problem hiding this comment.
During normal (non-error) shut down, runloop does not end the streams directly, only via the unsubscribe command. Therefore, skipping the command is problematic, the tests fail on it.
| private[internal] def offerAndAwaitCommand[E <: Throwable, A](f: Promise[E, A] => RunloopCommand): Task[A] = | |
| runloopDone.isDone.flatMap { | |
| case true => runloopDone.await | |
| case false => | |
| (Promise | |
| .make[E, A] | |
| .flatMap { promise => | |
| commandQueue.offer(f(promise)) *> promise.await | |
| }: Task[A]) | |
| .raceFirst(runloopDone.await) | |
| } | |
| private[internal] def offerAndAwaitCommand[E <: Throwable, A](f: Promise[E, A] => RunloopCommand): Task[A] = | |
| runloopDone.await.raceFirst { | |
| for { | |
| promise <- Promise.make[E, A] | |
| _ <- commandQueue.offer(f(promise)) | |
| r <- promise.await | |
| } yield r | |
| } |
| // Signal runloopDone FIRST so that removeSubscription calls in partition | ||
| // stream finalizers can return immediately without offering to commandQueue. | ||
| _ <- runloopDone.failCause(cause).ignore |
There was a problem hiding this comment.
With the changes in offerAndAwaitCommand I think this should go last. Either way, with the other changes, the comment is no longer applicable.
cb6f6c9 to
e6b5499
Compare
|
Hi @erikvanoosten, thank you for finding the real root cause — the dataQueue.poll not racing with interruptionPromise was indeed the missing piece. All your suggestions have
Regarding offerAndAwaitCommand — we kept the isDone check because our live tests hang without it, even with your raceFirst-only approach. We believe the reason is that removeSubscription is called from inside ZIO.addFinalizer, which runs uninterruptibly. When runloopDone.await wins the race, raceFirst interrupts the right fiber and waits for its finalizers. But the right fiber is inside an uninterruptible context, so the interrupt signal is masked and promise.await blocks forever — raceFirst never returns. The isDone check avoids this entirely: if the runloop is already done, we never start the right fiber, so there is nothing to interrupt. Regarding the unit test — the RunloopSpec test removeSubscription does not hang after Runloop crashes passes with all your changes applied. The test runs outside a finalizer context so it does not hit the uninterruptible issue. |
|
@acrow Thanks for that analysis. I agree that the problem is that removeSubscription is called from inside ZIO.addFinalizer. I experimented with wrapping the call with // RunloopAccess subscribe
_ <- ZIO.addFinalizer {
ZIO.interruptible {
withRunloopZIO(requireRunning = false)(_.removeSubscription(subscription).ignore) <*
diagnostics.emit(DiagnosticEvent.SubscriptionFinalized)
}
}and that works a lot better, but there are still failing tests. In addition, it doesn't solve the race condition. The race condition can occur because Thinking again, this brings other problems... We are just moving the race condition; how do we make sure the runloop actually processes the queued commands? This might work:
Note: with this the race condition doesn't go away, we just make it less likely to result in a problem. Step 2 should be done last inside The only way to really make the race condition go away is by having 2-way communication between We'll need to do step 2. also at end of the runloop (the second place where we call WDYT? A side note: private def shouldPoll(state: State): UIO[Boolean] =
for {
runloopIsDone <- runloopDone.isDone
pendingCommitCount <- committer.pendingCommitCount
} yield !runloopIsDone &&
state.subscriptionState.isSubscribed &&
(state.pendingRequests.nonEmpty || pendingCommitCount > 0 || state.assignedStreams.isEmpty) |
|
Hi @erikvanoosten, thank you for the detailed analysis and the suggestions. Both changes have been applied in the latest commit:
We tried to write a unit test that specifically exercises the race window (command queued while catchAllCause is running), but the window is too narrow to hit reliably in a unit test without injecting artificial delays into catchAllCause. The existing test removeSubscription does not hang after Runloop crashes still passes and validates the overall no-hang property. Please let us know if you'd like a different test approach. |
|
I finally got some time again (not much though) to look at this PR. Did you run the unit tests? When I run them on my laptop some of the test time out. I tried to debug the issue but I have not found the problem yet. Do the tests ( |
092bbbb to
16f3c8f
Compare
When the Runloop crashes, scope finalizers for active subscriptions call removeSubscription -> offerAndAwaitCommand. The commandQueue may be concurrently shutting down, causing commandQueue.offer to block and leaving removeSubscription hung indefinitely. This in turn blocks flatMapPar from completing its cleanup, so consumeWith never returns. Fix: add runloopDone: Promise[Throwable, Nothing] that is completed (failed with the runloop cause) when the runloop exits. offerAndAwaitCommand checks runloopDone.isDone first — if done, it skips the offer entirely and returns runloopDone.await directly. For the not-yet-done path it races promise.await against runloopDone.await so callers unblock as soon as the runloop exits regardless of which happens first. Also adds dataQueue.offer(Exit.fail) in halt() so partition stream fibers that are between polls detect the failure on their next poll cycle. RunloopAccess: removeSubscription.ignore instead of .orDie since failure is expected when the runloop has already crashed.
16f3c8f to
e2a00a4
Compare
…msBySubscription hang tests
|
Oh smart! You swapped the place where I have some small tweaks standing by. I'll try to find time later today to send those over. |
|
Hi @erikvanoosten, yes we found the issue and have pushed a fix. The root cause was raceFirst in offerAndAwaitCommand. When runloopDone.await wins the race, raceFirst tries to interrupt the losing fiber (promise.await). But We replaced raceFirst with a two-pronged approach:
There is still a narrow race window between commandQueue.offer and the isDone check, but it is effectively closed because runloopDone.failCause is called before drainCommandQueue in catchAllCause — so either drainCommandQueue sees the command, or the isDone check fires. We also added two simulation tests to RunloopSpec that wrap removeSubscription and endStreamsBySubscription in ZIO.uninterruptible, replicating the ZIO.addFinalizer context. Results: all 8 RunloopSpec tests and all 44 ConsumerSpec integration tests pass. |
Problem
When the Runloop crashes (e.g. after TopicAuthorizationException exhausts retries), scope finalizers for active subscriptions call removeSubscription →
offerAndAwaitCommand. At this point the Runloop.make scope finalizer may be concurrently shutting down the commandQueue. This causes commandQueue.offer to
block, leaving removeSubscription hung indefinitely. In turn, flatMapPar waits for the inner fiber to finish, runDrain never returns, and Consumer.consumeWith hangs forever.
Fix
Add runloopDone: Promise[Throwable, Nothing] that is completed (failed with the runloop cause) when the runloop exits. offerAndAwaitCommand checks
runloopDone.isDone first:
Also adds dataQueue.offer(Exit.fail(cause)) in halt() so partition stream fibers that are between polls detect the failure on their next poll cycle.
RunloopAccess: changed removeSubscription.orDie to .ignore since failure is expected when the runloop has already crashed.
Tests
Added RunloopSpec test: removeSubscription does not hang after Runloop crashes — verifies that removeSubscription completes promptly (does not time out) after the runloop crashes.