feat(node): peer-status heartbeat + log the full sync stack#169
Conversation
Two observability gaps made attachment-delivery failures opaque: 1. The default log filter was peat_node=info,peat_mesh=info — it did NOT include peat_protocol, which owns the attachment send/receive watchers (targeting + blob fetch). A failed delivery produced no logs at all. Default now adds peat_protocol=info and iroh=warn (QUIC dial failures). 2. No line ever reported who a node was connected to. Added a 30s peer-status heartbeat logging connected_peers (live transport sync) vs known_peers (the dialed set used for distribution targeting and blob provider lookup). A receiver absent from a sender's known_peers is the root cause of doc-synced-but-file-never-delivered; now it's visible.
|
Tracking issue for the underlying multi-hop limitation: #170 |
peat-bot
left a comment
There was a problem hiding this comment.
Peat QA Review (SHA: 0e99cf2)
Scope: pure observability change — adds a 30 s peer status heartbeat that logs connected_peers (transport) vs known_peers (blob store / distribution targeting set), and expands the default RUST_LOG filter to cover peat_protocol=info (attachment send/receive watchers) and iroh=warn (QUIC dial failures). No proto, crypto, mesh-pin, watcher-TLS, Helm, or Zarf surface is touched.
[WARNING] .claude/scheduled_tasks.lock committed by accident
.claude/scheduled_tasks.lock is a per-session runtime lockfile from a local tool — it contains a sessionId, PID, and wall-clock timestamps for this developer's machine and has no business in source control. The repo's .gitignore doesn't currently cover .claude/, so this got swept in along with the intended changes.
Action: remove .claude/scheduled_tasks.lock from the PR and add .claude/ (or at minimum .claude/scheduled_tasks.lock) to .gitignore so this doesn't recur. Not a security issue, but it's noise that will keep coming back across future PRs from any contributor running the same tooling.
Notes (no action required)
The peer-status spawn in src/node.rs is idiomatic — Weak<_> upgrade-or-exit mirrors the existing reconnect watchdog, MissedTickBehavior::Skip is the right call for an operator-facing heartbeat, and the first tick() firing immediately matches the PR description's "fires immediately at startup, then on the interval." No test gate triggered: this isn't a new RPC on PeatSidecar, so the tests/grpc_test.rs requirement doesn't apply, and the change has no algorithmic surface for tests/node_test.rs to assert against beyond "the task is spawned." The PR description's cargo build / clippy -D warnings / fmt --check + local boot evidence is appropriate for an observability-only change.
The multi-host compose already sets PEAT_NODE_PEERS on both sides, but never explained WHY both are mandatory — so a one-directional config silently breaks delivery (doc syncs, file never written). Spell out that known_peers is populated only by outbound dials, that sender targeting and receiver fetch both read it, and that this is two dials over one iroh connection (not two connections). Reference the peer-status heartbeat for diagnosing it and link the upstream tracking issue (#170). Also widen the example's RUST_LOG from peat_node=info to the full sync stack (peat_protocol/peat_mesh/iroh) so the targeting/fetch logs and peer-status line are actually visible — the old value overrode them.
peat-bot
left a comment
There was a problem hiding this comment.
Peat QA Review (SHA: e254486)
Incremental review since 0e99cf2.
Delta scope: docs-only — examples/compose/attachments/README.md and docker-compose.multi-host.yml gain a callout explaining that attachment delivery requires a two-way dial (each node listing the other in PEAT_NODE_PEERS), and the example's RUST_LOG is widened from peat_node=info to the full sync-stack filter (peat_node=info,peat_mesh=info,peat_protocol=info,iroh=warn) so the example self-documents on pre-v0.4.6 images. No proto, crypto, mesh-pin, watcher-TLS, Helm, Zarf, or Rust source surface is touched by this delta.
Re-verification of prior findings
[WARNING] .claude/scheduled_tasks.lock committed by accident — still unaddressed
The file is still present at e254486 (git diff origin/main..e254486 --name-only lists it), and .gitignore still contains only /target and .DS_Store — no .claude/ entry. The delta since 0e99cf2 didn't touch either.
Restating from the prior review: this is a per-session runtime lockfile from a local tool (sessionId / PID / wall-clock for one developer's machine) and has no business in source control.
Action: remove .claude/scheduled_tasks.lock from the PR and add .claude/ (or at minimum .claude/scheduled_tasks.lock) to .gitignore so this doesn't recur across future PRs from any contributor running the same tooling.
Notes on the delta (no action required)
The README/compose callouts correctly describe the observed behavior: known_peers is populated only on dial-out, attachment bytes flow direct-peer-only on both axes (resolve_targets over known_peers() for targeting, fetch_blob over known_peers() for provider lookup), so a single-side listing produces the "document syncs but file never lands" symptom. Cross-referencing peat-node#170 keeps the limitation tracked. The example's expanded RUST_LOG matches the new built-in default added in this PR's src/main.rs change (already covered by the prior review) and the parenthetical "v0.4.6+ uses this as the built-in default" framing is accurate and self-contained. Pure documentation — no new test gate or surface-tier review is triggered.
Removes a per-session runtime lockfile from local scheduling tooling that was accidentally committed (QA finding on #169), and adds a specific .gitignore entry so it can't recur. Scoped to the lockfile only — the pre-existing, intentionally-tracked .claude/settings.json is left untouched.
|
Addressed (fac27a1): removed the accidentally-committed |
peat-bot
left a comment
There was a problem hiding this comment.
Peat QA Review (SHA: fac27a1)
Incremental review since e254486.
Delta scope: housekeeping only. .claude/scheduled_tasks.lock is removed from the tree and .gitignore gains a specific ignore entry for it (with a one-line comment explaining why). No proto, crypto, mesh-pin, watcher-TLS, Helm, Zarf, or Rust source surface is touched by this delta.
Re-verification of prior findings
Resolved: [WARNING] .claude/scheduled_tasks.lock committed by accident
git diff e254486..fac27a1 shows the file deletion and a new .gitignore stanza covering .claude/scheduled_tasks.lock. The prior review explicitly allowed "(or at minimum .claude/scheduled_tasks.lock)" as the gitignore scope, so the narrower entry is acceptable. The accompanying comment correctly identifies the file as per-session runtime state from local scheduling tooling. The remaining prior-review notes were no-action observations on the docs delta; nothing in this delta affects them.
New findings
No findings.
Observability-only release of #169 (peer-status heartbeat + full sync-stack logging). No dependency change (peat-mesh rc.40) — gives operators diagnostics for two-node container testing, independent of the in-flight RC chain.
Why
Cross-network attachment-delivery failures were opaque: the file's distribution document syncs between nodes, but the file never lands on disk, and the logs gave no way to tell where it broke. Two gaps caused that:
The default log filter didn't cover the receive path. It was
peat_node=info,peat_mesh=info— which excludespeat_protocol, the crate that owns the attachment send/receive watchers (targeting + blob fetch). A failed delivery produced no logs at all.Nothing ever reported peer connectivity. There was no way to see which peers a node was connected to, or — crucially — which peers it could target for a distribution.
What
Default log filter now covers the whole sync stack: adds
peat_protocol=info(attachment watchers — targeting + blob fetch) andiroh=warn(surfaces QUIC dial / connection failures without info-level packet spam).RUST_LOGstill overrides the entire default.Periodic
peer statusheartbeat (every 30s, also fires immediately at startup):Two sets, deliberately distinct:
connected_peers— live CRDT-sync connections (from the transport).known_peers— peers this node dialed; the exact setresolve_targetsuses for distribution targeting andfetch_blobuses for blob-provider lookup.A receiver absent from a sender's
known_peersis precisely why a synced attachment document never becomes a delivered file.known>connectedmeans a dial is failing. Both are now visible at a glance.Context
While building this I confirmed an upstream design limitation worth flagging: attachment bytes are direct-peer-only on both axes — targeting (
resolve_targets, all scopes resolve throughknown_peers()) and fetch (fetch_blobpulls only from the receiver'sknown_peers). There is no multi-hop / relayed blob delivery today. That's tracked separately (see linked issue); this PR is the observability that makes the current behavior diagnosable, not the fix for it.Verification
cargo build,cargo clippy --workspace --all-targets -- -D warnings,cargo fmt -- --check— clean.