[iris/finelog] Federated global-finelog log relay + authenticated ingress#6871
Conversation
…ress Adds the write side of the federation global-log plane (design §6): each cluster's controller durably forwards its local finelog to one shared **global finelog** under cluster-namespaced keys, and finelog gains a **mandatory authenticated ingress** front. ## Log forwarding A new `finelog.forwarder.LogForwarder` (in finelog, generic) tails a source finelog and re-appends each batch to a target under a caller-supplied key prefix (`/c/<cluster_id>`). Durability leans on the source finelog already being durable (parquet + retention): the forwarder persists only a forward **watermark** in a small JSON state file — not a second copy of the data — and advances it only after an ack'd `PushLogs` (which returns once the global store has durably persisted the batch). A crash or transient egress failure re-forwards the in-flight batch (at-least-once; duplicate lines are bounded to the failure boundary — federation degrades observability, never job correctness). First run seeds the watermark at the source's current cursor, so enabling it ships new logs without backfilling the whole retention window. The iris controller wires this up thinly (`controller/global_finelog.py`): it supplies its local client, a per-cluster delegation credential, and a state path on its persistent dir; the forwarding/namespacing/watermark logic lives entirely in finelog. Inert until `global_finelog` is configured — no forwarder thread, no egress, byte-identical single-cluster behavior. ## Authenticated ingress (new security surface) finelog's server (`lib/finelog/rust/src/server/auth.rs`) now gates every RPC with an ordered stack of typed auth **layers** — `cidr` (admit by transport-peer network) and `jwt` (HS256 bearer verified against per-cluster delegation keys) — that is **default-deny** and **always installed**. An unconfigured server falls back to allow-localhost (loopback only), never open. HS256 verification uses vetted RustCrypto primitives (constant-time HMAC, no `ring`), rejects `alg:none`/algorithm-confusion, and tolerates a small `exp` clock-skew leeway. The `/debug/*` admin routes are gated by the same policy. It is configured by a single `FINELOG_AUTH_POLICY` JSON layer list, which the finelog deploy config, GCP bootstrap script, and k8s manifest all plumb (k8s refuses to inline jwt secrets — those must come through a Secret). Each relaying controller authenticates its pushes with a short-lived HS256 delegation JWT, minted via the same `JwtTokenManager` the control plane uses but signed with a **dedicated** per-cluster key the global store verifies — so the shared store can verify a cluster's tokens without gaining the power to mint control-plane tokens. Because auth is always installed, the three committed local finelog configs (`marin`, `marin-dev`, `cw-us-east-02a`) gain an explicit `cidr` layer (RFC1918 + loopback) so intra-cluster/pod ingest keeps working — reachable from the private network, never the public internet. ## Cost Relaying every cluster's logs to one region is deliberate cross-region **egress** — the trade for a uniform, always-available, cross-cluster-queryable log surface. finelog already batches/compresses and tiers cold segments to object storage, which bounds steady-state cost. The read side (surfacing a federated job's logs through the parent: `job_id`→peer-namespace remap + read-through auth) is deferred to a follow-up (#6862). Part of #6718 — PR4b of the federation rollout (global finelog + relay + authed ingress).
|
Claude finished @rjpower's task in 4m 44s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. I ran a multi-agent correctness pass over the diff (2 bug/security agents, 1 CLAUDE.md-compliance agent, 1 test-quality agent) and validated each candidate finding against the real source. The security-critical surfaces hold up:
Tests are behavior-focused (real embedded servers, observable outcomes, optional-dep skip guards) with no slop/tautology/call-count assertions. The PR description is compliant — it leads with what changed, and its section headings are informative for a ~1.9k-line PR rather than template scaffold. One minor, non-blocking note (not flagged as a violation): |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc2cf8d85e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| source=_ALL_KEYS_PREFIX, | ||
| match_scope=logging_pb2.MATCH_SCOPE_PREFIX, |
There was a problem hiding this comment.
Avoid re-forwarding already namespaced relay logs
If the global finelog is also this cluster's configured source log server, or if the source ever contains /c/<cluster> relay keys, tailing every key under / makes the forwarder read its own forwarded rows and push them again as /c/<cluster>/c/<cluster>/... on the next tick. That can create unbounded duplicate/amplified log traffic for the home/global-store cluster; exclude the relay namespace from the source scan or reject source==target wiring.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🤖 Fixed in 766c86e: the forwarder now skips source entries already under its key prefix, so a source==target (or otherwise shared) store can't loop /c/x/... → /c/x/c/x/.... The watermark still advances past the skipped rows, and the tick's forwarded count now reports pushed (not fetched) entries. Added a regression test (test_forwarder_skips_already_namespaced_keys_no_amplification) that drives the forwarder against a single store as both source and target and asserts no doubly-namespaced rows appear.
Replace the two top-level fields (log_server_config for the local finelog + global_finelog for the relay target) with a single finelog: block — finelog.config names the local store, finelog.relay (optional) points at the shared global finelog and turns on the forwarder. Drop the redundant global_finelog.cluster: the delegation JWT's identity is the cluster's own name (the finelog verifier attributes identity from its configured key label, not the token subject), so it was doubly restated.
Guard against re-forwarding rows this forwarder produced: skip source entries already under the key prefix, so a source==target (or otherwise shared) store can't loop /c/x/user -> /c/x/c/x/user unboundedly. The watermark still advances past skipped rows, and the tick's forwarded count now reports pushed entries, not fetched. Document that the watermark is intentionally local best-effort state (fsync'd for same-host restart, not part of any external checkpoint): a host replacement reseeds at source-max, accepting a bounded gap of un-forwarded logs rather than coupling the relay to the controller's checkpoint. Observability, not correctness.
Remove the finelog.relay pydantic round-trip and the default-is-None assert (they test serialization, not behavior), and trim the auth-layer test to its one useful assertion: the exact FINELOG_AUTH_POLICY JSON the Rust server parses.
Address review findings on the relay path: - Drain the source within a tick instead of shipping one batch per poll interval, so a cluster producing faster than batch_lines/interval can't outrun the forwarder until source retention drops un-forwarded rows. - The self-amplification guard matches the prefix with a trailing "/", so a sibling namespace (/c/x-dev) is no longer false-skipped by /c/x. - Validate that a cluster enabling finelog.relay sets a slash-free top-level name (the /c/<name> log namespace) at config load, not as a controller crash-loop on the remote host.
Collapse the local finelog config and the relay target into one flat finelog: block — config / relay_address / delegation_key / static_token. Setting relay_address turns on the forwarder; empty leaves the log plane single-cluster. Drop the /c/<cluster> key-rewriting from the relay: the forwarder is now a plain durable replicator (same keys, source -> target), and cross-cluster namespacing is deferred — it will be handled on the write side (tasks/controller write the namespaced key directly) or via a finelog cluster column, rather than rewritten mid-relay. Removes cluster_namespace_prefix and the forwarder's key_prefix + self-amplification guard; keeps the backlog drain.
Adds the write side of the federation global-log plane (design §6): each cluster's controller durably forwards its local finelog to one shared global finelog, and finelog gains a mandatory authenticated ingress front.
Log forwarding
A new
finelog.forwarder.LogForwarder(in finelog, generic) tails a source finelog and re-appends each new batch to a target under the same keys. Durability leans on the source finelog already being durable (parquet + retention): the forwarder persists only a forward watermark in a small JSON state file — not a second copy of the data — and advances it only after an ack'dPushLogs(which returns once the global store has durably persisted the batch). A crash or transient egress failure re-forwards the in-flight batch (at-least-once; duplicate lines are bounded to the failure boundary — federation degrades observability, never job correctness). First run seeds the watermark at the source's current cursor, so enabling it ships new logs without backfilling the whole retention window; a busy cluster's backlog drains within a tick rather than one batch per poll interval.The iris controller wires this up thinly (
controller/finelog_relay.py): it supplies its local client, a per-cluster delegation credential, and a state path on its persistent dir; the forwarding/watermark logic lives entirely in finelog. A cluster's finelog is one flat config block —finelog.confignames the local store;finelog.relay_address(optional) points at the shared global one, withdelegation_key/static_tokenfor the credential — and settingrelay_addressis what turns the forwarder on. Inert until then: no forwarder thread, no egress, byte-identical single-cluster behavior.Cross-cluster log namespacing is deferred. Bare per-cluster keys (
/user/<job>/...) collide when many clusters land in one store; rather than have the relay rewrite keys mid-stream, namespacing will be done on the write side (tasks/controller write the namespaced key directly) or via a finelogclustercolumn, in a follow-up. This PR ships the durable relay + auth mechanism; a single cluster relaying to a global store works today.Authenticated ingress (new security surface)
finelog's server (
lib/finelog/rust/src/server/auth.rs) now gates every RPC with an ordered stack of typed auth layers —cidr(admit by transport-peer network) andjwt(HS256 bearer verified against per-cluster delegation keys) — that is default-deny and always installed. An unconfigured server falls back to allow-localhost (loopback only), never open. HS256 verification uses vetted RustCrypto primitives (constant-time HMAC, noring), rejectsalg:none/algorithm-confusion, and tolerates a smallexpclock-skew leeway. The/debug/*admin routes are gated by the same policy. It is configured by a singleFINELOG_AUTH_POLICYJSON layer list, which the finelog deploy config, GCP bootstrap script, and k8s manifest all plumb (k8s refuses to inline jwt secrets — those must come through a Secret).Each relaying controller authenticates its pushes with a short-lived HS256 delegation JWT, minted via the same
JwtTokenManagerthe control plane uses but signed with a dedicated per-cluster key the global store verifies — so the shared store can verify a cluster's tokens without gaining the power to mint control-plane tokens.Because auth is always installed, the three committed local finelog configs (
marin,marin-dev,cw-us-east-02a) gain an explicitcidrlayer (RFC1918 + loopback) so intra-cluster/pod ingest keeps working — reachable from the private network, never the public internet.Cost
Relaying every cluster's logs to one region is deliberate cross-region egress — the trade for a uniform, always-available, cross-cluster-queryable log surface. finelog already batches/compresses and tiers cold segments to object storage, which bounds steady-state cost.
The read side (surfacing a federated job's logs through the parent:
job_id→peer-namespace remap + read-through auth) is deferred to a follow-up (#6862).Part of #6718 — PR4b of the federation rollout (global finelog + relay + authed ingress).