Skip to content

Transport appears to be deadlocked and not able to send messages when one publisher confirm is lost #1768

@SzymonPobiega

Description

@SzymonPobiega

Describe the bug

Description

In rare cases, an outgoing Publish call never completes and the application appears to be waiting “forever” for the publish to be acknowledged/confirmed. This is most visible when publishing via a long-lived running endpoint instance (static IEndpointInstance / IMessageSession) rather than via the IMessageHandlerContext passed into a handler.

This looks like the transport can enter a state where it is waiting on a publisher confirm (or a confirmation-tracking task) that never completes, with no transport-level timeout to fail fast and recycle the channel.

Suspected root cause

  1. RabbitMQ transport uses a confirmation-enabled channel with confirmation tracking enabled.
  2. An outgoing publish operation awaits a task that is expected to complete when the broker confirms the publish.
  3. In rare cases (e.g. packet loss, broker edge case), that confirmation completion may never happen even though the TCP connection remains open.
  4. Because there is no bounded timeout, the Task returned from publish can remain pending indefinitely.
  5. Because the channel is shared/long-lived, a single stuck confirmation can cause long-lived endpoint API calls to hang “forever.”

Even if RabbitMQ itself is reliable, over time “rare” edge cases eventually occur (especially when dispatching many messages).

Impact / severity

  • Causes application threads to deadlock/starve waiting on Publish/Send.
  • Service stays online but blocked from doing any work until restarted.
  • This is particularly problematic for our high-throughput systems where “rare indefinite hang” is unacceptable; we need a way to fail fast and let recoverability kick in.

Expected behavior

Throw an exception if a publisher confirm does not arrive within set period

Actual behavior

Endpoints appears unresponsive

Versions

All versions up to 8.x (included). Versions starting from 9.0 are not affected because the transport uses SDKs async APIs

Steps to reproduce

Use RabbitMQ transport with somewhat flaky connection to the broker so that some of the publisher confirms are missing

Relevant log output

Additional Information

Workarounds

Possible solutions

Additional information

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions