Skip to content

bug(node): misleading ERROR log when outbound dial races inbound auth at startup #177

Description

@kitplummer

Symptom

When two nodes both have each other in PEAT_NODE_PEERS, both dial simultaneously at startup. The first node to try (say node-a) exhausts its 3 outbound retry attempts before the peer's iroh endpoint is ready and logs:

ERROR peat_node: failed to connect to peer: connect_and_authenticate failed after 3 attempts: failed to connect to peer: timed out peer="<id>@host:port"

Immediately after, the peer's simultaneous inbound dial arrives, authenticates via formation key, and the connection succeeds:

INFO peat_node::node: connected to peer peer="<id>"

File delivery proceeds normally. The ERROR is a red herring.

Root cause

connect_and_authenticate retries 3 times then logs at ERROR regardless of whether the peer is already connected (or about to connect) via an inbound dial. The retry budget is exhausted before the peer's endpoint is fully bound, but the bidirectional PEAT_NODE_PEERS configuration means the connection lands via the other direction.

Fix

Downgrade the "exhausted retries" log from ERROR to WARN (or DEBUG) when the peer subsequently connects via inbound within a short window. Alternatively, check connected_peers before logging ERROR — if the peer is already authenticated, the dial failure is informational, not an error.

Workaround

The connection always succeeds in practice with bidirectional PEAT_NODE_PEERS. Users can ignore the ERROR and watch for INFO peat_node::node: connected to peer instead. Documented in the attachment quickstart README as expected startup noise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions