Skip to content

feat(node): peer-status heartbeat + log the full sync stack#169

Merged
kitplummer merged 3 commits into
mainfrom
feat/peer-status-and-logging
Jun 19, 2026
Merged

feat(node): peer-status heartbeat + log the full sync stack#169
kitplummer merged 3 commits into
mainfrom
feat/peer-status-and-logging

Conversation

@kitplummer

Copy link
Copy Markdown
Collaborator

Why

Cross-network attachment-delivery failures were opaque: the file's distribution document syncs between nodes, but the file never lands on disk, and the logs gave no way to tell where it broke. Two gaps caused that:

  1. The default log filter didn't cover the receive path. It was peat_node=info,peat_mesh=info — which excludes peat_protocol, the crate that owns the attachment send/receive watchers (targeting + blob fetch). A failed delivery produced no logs at all.

  2. Nothing ever reported peer connectivity. There was no way to see which peers a node was connected to, or — crucially — which peers it could target for a distribution.

What

  • Default log filter now covers the whole sync stack: adds peat_protocol=info (attachment watchers — targeting + blob fetch) and iroh=warn (surfaces QUIC dial / connection failures without info-level packet spam). RUST_LOG still overrides the entire default.

  • Periodic peer status heartbeat (every 30s, also fires immediately at startup):

    peer status connected_count=1 known_count=1 connected_peers=[a1b2c3…] known_peers=[a1b2c3…]
    

    Two sets, deliberately distinct:

    • connected_peers — live CRDT-sync connections (from the transport).
    • known_peers — peers this node dialed; the exact set resolve_targets uses for distribution targeting and fetch_blob uses for blob-provider lookup.

    A receiver absent from a sender's known_peers is precisely why a synced attachment document never becomes a delivered file. known > connected means a dial is failing. Both are now visible at a glance.

Context

While building this I confirmed an upstream design limitation worth flagging: attachment bytes are direct-peer-only on both axes — targeting (resolve_targets, all scopes resolve through known_peers()) and fetch (fetch_blob pulls only from the receiver's known_peers). There is no multi-hop / relayed blob delivery today. That's tracked separately (see linked issue); this PR is the observability that makes the current behavior diagnosable, not the fix for it.

Verification

  • cargo build, cargo clippy --workspace --all-targets -- -D warnings, cargo fmt -- --check — clean.
  • Booted the node locally and confirmed the heartbeat emits at startup:
    INFO peat_node::node: peer status connected_count=0 known_count=0 connected_peers=[] known_peers=[]
    
    (0/0 here — no peers configured in the smoke run; with peers it lists their short ids.)

Two observability gaps made attachment-delivery failures opaque:

1. The default log filter was peat_node=info,peat_mesh=info — it did NOT
   include peat_protocol, which owns the attachment send/receive watchers
   (targeting + blob fetch). A failed delivery produced no logs at all.
   Default now adds peat_protocol=info and iroh=warn (QUIC dial failures).

2. No line ever reported who a node was connected to. Added a 30s
   peer-status heartbeat logging connected_peers (live transport sync) vs
   known_peers (the dialed set used for distribution targeting and blob
   provider lookup). A receiver absent from a sender's known_peers is the
   root cause of doc-synced-but-file-never-delivered; now it's visible.
@kitplummer

Copy link
Copy Markdown
Collaborator Author

Tracking issue for the underlying multi-hop limitation: #170

@peat-bot peat-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peat QA Review (SHA: 0e99cf2)

Scope: pure observability change — adds a 30 s peer status heartbeat that logs connected_peers (transport) vs known_peers (blob store / distribution targeting set), and expands the default RUST_LOG filter to cover peat_protocol=info (attachment send/receive watchers) and iroh=warn (QUIC dial failures). No proto, crypto, mesh-pin, watcher-TLS, Helm, or Zarf surface is touched.

[WARNING] .claude/scheduled_tasks.lock committed by accident

.claude/scheduled_tasks.lock is a per-session runtime lockfile from a local tool — it contains a sessionId, PID, and wall-clock timestamps for this developer's machine and has no business in source control. The repo's .gitignore doesn't currently cover .claude/, so this got swept in along with the intended changes.

Action: remove .claude/scheduled_tasks.lock from the PR and add .claude/ (or at minimum .claude/scheduled_tasks.lock) to .gitignore so this doesn't recur. Not a security issue, but it's noise that will keep coming back across future PRs from any contributor running the same tooling.

Notes (no action required)

The peer-status spawn in src/node.rs is idiomatic — Weak<_> upgrade-or-exit mirrors the existing reconnect watchdog, MissedTickBehavior::Skip is the right call for an operator-facing heartbeat, and the first tick() firing immediately matches the PR description's "fires immediately at startup, then on the interval." No test gate triggered: this isn't a new RPC on PeatSidecar, so the tests/grpc_test.rs requirement doesn't apply, and the change has no algorithmic surface for tests/node_test.rs to assert against beyond "the task is spawned." The PR description's cargo build / clippy -D warnings / fmt --check + local boot evidence is appropriate for an observability-only change.

The multi-host compose already sets PEAT_NODE_PEERS on both sides, but never
explained WHY both are mandatory — so a one-directional config silently breaks
delivery (doc syncs, file never written). Spell out that known_peers is
populated only by outbound dials, that sender targeting and receiver fetch both
read it, and that this is two dials over one iroh connection (not two
connections). Reference the peer-status heartbeat for diagnosing it and link the
upstream tracking issue (#170).

Also widen the example's RUST_LOG from peat_node=info to the full sync stack
(peat_protocol/peat_mesh/iroh) so the targeting/fetch logs and peer-status line
are actually visible — the old value overrode them.

@peat-bot peat-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peat QA Review (SHA: e254486)

Incremental review since 0e99cf2.

Delta scope: docs-only — examples/compose/attachments/README.md and docker-compose.multi-host.yml gain a callout explaining that attachment delivery requires a two-way dial (each node listing the other in PEAT_NODE_PEERS), and the example's RUST_LOG is widened from peat_node=info to the full sync-stack filter (peat_node=info,peat_mesh=info,peat_protocol=info,iroh=warn) so the example self-documents on pre-v0.4.6 images. No proto, crypto, mesh-pin, watcher-TLS, Helm, Zarf, or Rust source surface is touched by this delta.

Re-verification of prior findings

[WARNING] .claude/scheduled_tasks.lock committed by accident — still unaddressed

The file is still present at e254486 (git diff origin/main..e254486 --name-only lists it), and .gitignore still contains only /target and .DS_Store — no .claude/ entry. The delta since 0e99cf2 didn't touch either.

Restating from the prior review: this is a per-session runtime lockfile from a local tool (sessionId / PID / wall-clock for one developer's machine) and has no business in source control.

Action: remove .claude/scheduled_tasks.lock from the PR and add .claude/ (or at minimum .claude/scheduled_tasks.lock) to .gitignore so this doesn't recur across future PRs from any contributor running the same tooling.

Notes on the delta (no action required)

The README/compose callouts correctly describe the observed behavior: known_peers is populated only on dial-out, attachment bytes flow direct-peer-only on both axes (resolve_targets over known_peers() for targeting, fetch_blob over known_peers() for provider lookup), so a single-side listing produces the "document syncs but file never lands" symptom. Cross-referencing peat-node#170 keeps the limitation tracked. The example's expanded RUST_LOG matches the new built-in default added in this PR's src/main.rs change (already covered by the prior review) and the parenthetical "v0.4.6+ uses this as the built-in default" framing is accurate and self-contained. Pure documentation — no new test gate or surface-tier review is triggered.

Removes a per-session runtime lockfile from local scheduling tooling that was
accidentally committed (QA finding on #169), and adds a specific .gitignore
entry so it can't recur. Scoped to the lockfile only — the pre-existing,
intentionally-tracked .claude/settings.json is left untouched.
@kitplummer

Copy link
Copy Markdown
Collaborator Author

Addressed (fac27a1): removed the accidentally-committed .claude/scheduled_tasks.lock (per-session runtime lockfile) via git rm --cached and added a specific .gitignore entry so it can't recur. Scoped to the lockfile — the pre-existing, intentionally-tracked .claude/settings.json is left untouched.

@peat-bot peat-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peat QA Review (SHA: fac27a1)

Incremental review since e254486.

Delta scope: housekeeping only. .claude/scheduled_tasks.lock is removed from the tree and .gitignore gains a specific ignore entry for it (with a one-line comment explaining why). No proto, crypto, mesh-pin, watcher-TLS, Helm, Zarf, or Rust source surface is touched by this delta.

Re-verification of prior findings

Resolved: [WARNING] .claude/scheduled_tasks.lock committed by accident

git diff e254486..fac27a1 shows the file deletion and a new .gitignore stanza covering .claude/scheduled_tasks.lock. The prior review explicitly allowed "(or at minimum .claude/scheduled_tasks.lock)" as the gitignore scope, so the narrower entry is acceptable. The accompanying comment correctly identifies the file as per-session runtime state from local scheduling tooling. The remaining prior-review notes were no-action observations on the docs delta; nothing in this delta affects them.

New findings

No findings.

@kitplummer kitplummer merged commit 3b93826 into main Jun 19, 2026
14 checks passed
@kitplummer kitplummer deleted the feat/peer-status-and-logging branch June 19, 2026 15:11
kitplummer added a commit that referenced this pull request Jun 19, 2026
Observability-only release of #169 (peer-status heartbeat + full sync-stack logging). No dependency change (peat-mesh rc.40) — gives operators diagnostics for two-node container testing, independent of the in-flight RC chain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants