[Design] Iris multi-backend: Model D (federation peers + local backend contract)#6718
Conversation
Proposes multi-backend mode for Iris: one controller (iris.oa.dev) fronting many clusters (GCP TPU pools, multiple CoreWeave/k8s) behind one job namespace and dashboard. A single authoritative root DB owns all state; a per-cluster recoverable-cache "agent" reconciles desired vs. actual over one Poll() wire, with four safety mechanisms (root epoch, monotonic launch-lease, execution self-fence, ack-gated terminal retention + CAS) in place of a distributed database. Design-only, for early feedback. design.md is the 1-pager; research.md holds prior art + the current seam + the codex review log; spec.md pins the proto, config schema, DB migration, file layout, and auth.
|
Claude finished @rjpower's task in 2m 37s —— View job Code review — iris multi-backend designThis is a design-only PR (three Markdown docs, no code), so the usual bug/CLAUDE.md pass doesn't apply in the normal sense. I read all three docs end-to-end, spot-checked the load-bearing code references against PR description: clean — leads with what the change is, links the three artifacts, no Testing/template/checkbox/self-credit boilerplate. (The Code references verified accurate:
Things to nail down before building1.
The CAS on 2. 3. Agent-side gang admission atomicity. These are clarifications, not objections — the core "single authoritative DB + recoverable-cache agent, safety from four bounded mechanisms" structure is sound and the |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ca478d3df
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| capabilities: ClassVar[frozenset[BackendCapability]] # {CLUSTER_VIEW} — root never places workers | ||
| autoscaler: Autoscaler | None # None; capacity rides over Poll, not Iris autoscaler | ||
|
|
||
| def schedule(self, snapshot: ScheduleInput) -> ScheduleResult: ... |
There was a problem hiding this comment.
Add a backend-placement result before reusing schedule()
When RemoteTaskBackend.schedule() is defined as the root meta-scheduler, the existing ScheduleResult cannot persist the task→backend decision it needs to make: today _commit_schedule_decisions only applies result.assignments via ops.task.assign, whose Assignment is a task→worker pair and is dropped unless the worker_id resolves to a healthy worker. In the multi-backend path this leaves no commit path for jobs.backend_id/tasks.backend_id or for selecting which CLUSTER_VIEW dispatch drain owns the task, so remote-backend tasks would either remain pending or be treated as invalid worker assignments. The contract needs an explicit backend-placement result and commit path rather than the existing task→worker ScheduleResult alone.
Useful? React with 👍 / 👎.
S1 (recoverability): invariant holds; ports need substrate stamping (#6721), service endpoints stay root-authoritative as a lease (#6722). S2 (two-level placement): no meaningful fidelity loss; pinned the CapacitySummary (pruned static + pending_leases, added free-bin maxima, split gang), plus set-membership matcher + --backend-stripping fixes. S3 (fence timing): a dedicated short lease gives ~20-30s re-placement vs ~10min today; k8s needs a lease-sidecar self-fence (activeDeadlineSeconds insufficient). S4 (transport): dial-home Connect works; interactive over Poll + fast-follow is ~0.5x cadence, held stream opt-in.
…tart (#6723) On worker restart an adopted container kept running but its host-port reservations were lost: TaskAttempt.adopt() rebuilt ports={} and never re-marked the PortAllocator, which started empty. The worker could then re-hand those in-use ports (e.g. 30000/30001) to the next scheduled task, causing a bind clash. The port set had no substrate footprint, so it could not be recovered. This change gives it one and restores it on adopt: - Stamp the allocated host ports as an iris.ports Docker label at container create (ContainerConfig.ports). - Recover the ports into DiscoveredContainer.ports during discover_containers(). - Add PortAllocator.reserve() to re-mark a known port set as taken. - Have adopt() restore attempt.ports and reserve() them before any new work is scheduled. Independent of multi-backend, and a prerequisite for the per-cluster agent recoverable-cache port recovery in #6718. Fixes #6721
k8s needs no pod self-fence: the apiserver is a durable, independently-reachable authority and the agent is recoverable, so any live agent reconciles an undesired pod away (today's poll-and-delete), idempotent across agents via attempt_uid pod naming. A brief reroute overlap is benign (fresh attempt -> fresh output path); a lease sidecar survives only as an opt-in hard-fence for external-side-effect jobs. Validate locally with kind, not gated infra. Self-fence is now scoped to worker-daemon (the worker is the sole authority over its own process during a partition). Operator cordon/drain: a sparse root-owned worker_policy overlay pushed down in the Poll desired-state; the agent skips draining workers; k8s reuses native node cordon and a worker may persist the flag locally. Resolved both prior open questions.
Regroup `cluster/backends/` by concern into sibling trees, with no behavior change: - `backends/` — the pure `TaskBackend` implementations (`rpc/`, `k8s/tasks.py`) plus the shared protocol seam (`protocols.py`, `types.py`). - `platforms/` (new) — runtime substrate drivers: `WorkerInfraProvider` impls (gcp/manual workers), cloud-API wrappers (`GcpService`, `CloudK8sService`), worker bootstrap, remote-exec health. admin bring-up/ops: `ControllerProvider` impls, `vm_lifecycle.py`, controller bootstrap, the provider `factory`. Part of the multi-backend design (#6718).
…-renew) (#6728) Make service endpoints a lease instead of task-lifetime-bound state. **Problem.** Endpoints were registered once and lived as long as their task row: liveness was a function of FK CASCADE + attempt freshness, not of whether the registrant is still serving. A task that crashes hard leaves a served-but-dead endpoint until the row is pruned. Spike S1 (#6718) flagged endpoints as the one piece of worker-adjacent state that isn't a pure function of the recovery sources, so it must stay root-authoritative; a lease makes that robust. **Approach.** - Endpoints carry a lease (`endpoints.lease_deadline_ms`). Registration grants a deadline; re-registering with the same `endpoint_id` renews it. A row past its deadline is hidden from reads immediately and swept by the pruner, independent of the FK CASCADE. - A new `EndpointService` owns the leased registry. `EndpointClient` combines the RPC stub with background renewal: `register` registers and renews (at 1/3 of the lease, 10s→1m backoff on a failed renewal); `unregister`/`close` stop renewing and delete, waiting out any in-flight renewal so an explicit delete can't be undone mid-RPC. A crashed task stops renewing and its endpoint expires on its own. - The registrant chooses its lease: `RegisterEndpointRequest.lease_duration`, clamped server-side to `[3m, 72h]`. `EndpointClient` requests 10m; an unset field selects the 72h default. - The legacy `ControllerService.{Register,Unregister,List}Endpoint` RPCs delegate in-process to the same backend (no localhost round-trip or re-auth), so existing clients keep working unchanged — and since they leave `lease_duration` unset, they keep the 72h lease. **Rollout.** Because the lease is client-chosen, a renewing client auto-enrolls in the short (10m) lease as it upgrades, while a client that doesn't renew keeps 72h — no server-side flip that would shorten the lease out from under it. Migration 0031 leaves existing rows NULL (never-expiring) rather than backfilling, so loading the new schema can't retroactively expire a live endpoint. This supersedes #6729 (a server-side drop to ~10m); the task/attempt culling can be removed once the unset-lease (old-client) registration rate hits zero. Part of #6722 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
…riant, thread-affinity, phasing)
…-scoped store, minimal P1, commit-validation fork)
…6788) A worker-daemon backend held a raw `ControllerDB` (via `BackendRuntime` → `DbWorkerSource`) and exposed `db`/`endpoints`/`worker_attrs` as public properties, so the controller↔backend boundary held only by convention. Replace that with a typed, operation-scoped `BackendWorkerStore`: the backend reads its owned workers, builds its scheduling/reconcile snapshots, resolves a worker address, and reaps its dead workers through a fixed set of named operations instead of a raw DB handle. In-process the store runs those operations as scale-group-scoped reads/writes over the shared controller DB (so the scheduler's task↔worker joins survive); a remote backend will later swap the same operations for RPCs. This is the structural seam the multi-backend "controller of controllers" needs — an in-process backend and a remote Iris become the same thing modulo transport. Faithful wrap, byte-identical (N=1 controller suite green, no expected-value changes): the store's read methods are the verbatim `DbWorkerSource` bodies, `reap_workers` is the verbatim `teardown_dead_workers` algorithm, and the store still closes over the backend's own `WorkerHealthTracker` (assembled in `bind_runtime`). `WorkerSource` is deleted; `BackendRuntime`'s fields are unchanged. Phase 1 of the encapsulated-sub-controller backend contract (design and spec in #6718). `teardown`, registration, worker views, autoscaler persistence, and pruning stay on the controller for later phases. Part of #6718
Spec the end state explicitly: the controller becomes a thin proxy/router and each backend a self-contained sub-controller that owns its workers, attrs, endpoints, liveness, autoscaler, slices, and its tasks' execution state, and commits its own effects. Add the ownership-partition table and the storage model: backends share one ControllerDB instance (locking opaque) until remote; table ownership moves to the backend well before any DB file split, which is deferred entirely to the remote phase. Reorder the phasing accordingly — pull the per-backend pruner split to the front (it unblocks moving the worker_attrs/endpoints projections into the backend), then projection ownership, then the commit-ownership pivot, then published status, autoscaler single-writer, and remote. Soften the per-phase gate from byte-identical to semantic-behavior-preserving. Add service_and_construction_options.md: the codex-reviewed analysis of service topology vs construction ownership that motivates this reframing.
…6795) The controller's background prune loop reached into every backend's health tracker and worker-attributes projection to delete stale DEAD workers. Make each backend garbage-collect its own dead workers instead. `prune_dead_workers` is added to the `BackendWorkerStore` protocol — implemented on `DbBackendWorkerStore`, which already holds the `db`, `health` tracker, and `worker_attrs` it needs — and to the `TaskBackend` protocol, delegated by `RpcTaskBackend` and a no-op on the Kubernetes backend (it tracks no Iris workers). `prune_old_data` now takes the backends collection and sums each backend's own GC, so the controller keeps only the cross-backend prune concerns: terminal jobs, orphan slices, and expired endpoints. This continues the `BackendWorkerStore` ownership transfer (P3): the controller moves toward a thin router while each backend owns its workers, attributes, and liveness. The worker prune still runs on the controller's background prune thread — it touches only worker rows, attributes, and tracker entries, never the autoscaler — and preserves the cutoff semantics, the one-delete-per-transaction-plus-`pause` cadence, the `PruneResult.workers_deleted` count, and the `worker_pruned` audit event. The `prune_old_data` replay golden is unchanged. Design note: `prune_old_data` takes the backends collection (`self._backends.values()`) rather than a list of stores, since the controller holds backends and each backend already encapsulates its store. Part of #6718.
Records the decision: a remote Iris cluster is a federation peer that owns its own DAG and backends (whole jobs are handed off to it), not a TaskBackend. Backends are local execution substrates that share the controller's one DAG. Consequence: the backend contract does not have to be remote-safe -- no per-backend DB-file split, no RemoteBackendWorkerStore. Remaining backend work is local hygiene (Track 1: the WorkerJobService split, a published worker projection, autoscaler single-writer). Remote becomes a separate federation project (Track 2). The controller-side DAG-fold commit pivot is stood down. Job trees are locked to the parent's peer, which is largely self-enforcing: a running task's client targets the controller that launched it, so children spawned on a peer submit back to that peer. Adds delegation_model.md (the decision doc, twice codex-reviewed, with the rejected Model A/B/C alternatives); compacts spec.md to today->end-state and realigns design.md.
…options doc Model D is the decided architecture, so the iris_multi_backend proposal -- the "single root DB + per-cluster poll agent" model, where a remote cluster is a backend reconciling into the root DB -- is superseded and removed. It is preserved as the multi-backend-proposal / multi-backend-control-plane weaver artifacts and in git history. Also drops iris_backend_contract/service_and_construction_options.md: its live conclusion (bootstrap builds the store, no two-phase bind_runtime) is already in spec.md, and its remote-service analysis is superseded by federation. The remaining docs (delegation_model, design, spec, research) are the current Model D plan.
…for federated tasks Federation (Model D, Track 2) hands whole jobs off to a peer cluster. Those rows live in the local jobs/tasks tables so listings render, but the local scheduler must never act on them. Add a child_cluster column (server_default "", parallel to backend_id) to jobs and tasks: "" means locally owned, "<peer>" means handed off to that peer (backend_id then ""). Enforce the exclusion at one source. Define a local_tasks Core selectable (tasks WHERE child_cluster = "") and repoint every control-plane tasks reader to it: routing/scheduling, capacity/preemption, budget/admission, the direct- provider dispatch drain, the reconcile snapshot loaders, the timeout scan, and the worker-bound reconcile rows. Federated rows are structurally invisible to the fold, so a newly written control-plane query is safe by default and must go out of its way (raw tasks) to see one. Two partial indexes (... WHERE child_cluster = "") keep the hot pending/state scans free at read time. The pruner excludes federated jobs — a peer tombstone is their only deletion path. User-facing reads (job/task detail, list) keep reading raw tasks so federated rows still render. Add the federated_jobs / federation_sync_state / federated_tasks join sidecars (job/task state stays in the main rows) and migration 0034 for the upgrade path. Add JobStatus.child_cluster and TaskStatus.child_cluster. No peer can be configured yet, so zero federated rows exist in production and this is behavior-preserving: the replay goldens change only by additive empty sidecars and child_cluster: "" fields. New tests inject synthetic federated rows and assert they are never routed, dispatched, finalized, timed out, counted as local budget/admission spend, or pruned. Part of #6718.
…for federated tasks (#6821) Federation hands whole jobs off to a peer cluster. Those rows live in the local jobs/tasks tables so listings render, but the local scheduler must never act on them. This adds a `child_cluster` column (server_default `""`, parallel to `backend_id`) to `jobs` and `tasks`: `""` means locally owned, `"<peer>"` means handed off to that peer (with `backend_id` then `""`). The exclusion is enforced at one source rather than by convention across ~30 call sites. A `local_tasks` SQLAlchemy Core selectable (`tasks WHERE child_cluster = ''`) is defined once, and every control-plane tasks reader is repointed to it: - routing/scheduling (pending-task projection), capacity/preemption, budget/admission - the direct-provider dispatch drain — the canonical silent break: a federated PENDING/ASSIGNED row would otherwise be promoted and run locally - the reconcile snapshot loaders, the execution-timeout scan, and the worker-bound reconcile rows Federated rows are structurally invisible to the fold, so a newly written control-plane query is safe by default and must go out of its way (raw `tasks`) to see one. Two partial indexes (`… WHERE child_cluster = ''`) keep the hot pending/state scans free at read time — SQLite flattens the subquery and drives off them. User-facing reads (job/task detail, list) keep reading raw `tasks` so federated rows still render. The pruner excludes federated jobs, since a peer tombstone is their only deletion path. Also adds the `federated_jobs` / `federation_sync_state` / `federated_tasks` join sidecars (job/task state stays authoritative in the main rows), migration `0034` for the upgrade path, and `JobStatus.child_cluster` / `TaskStatus.child_cluster`. No peer can be configured yet, so zero federated rows exist and this is behavior-preserving — the replay goldens change only by additive empty sidecars and `child_cluster: ""` fields. New tests inject synthetic federated rows and assert they are never routed, dispatched, finalized, timed out, counted as local budget/admission spend, or pruned, while the parallel local rows still flow through the same readers. Part of #6718.
…#6814) Concrete design for the multi-cluster federation project. PR #6718 decided the architectural cut — **Model D: a remote Iris cluster is not a backend, it is a federation peer** the parent hands whole root jobs to and then passively caches — but deferred the federation project itself as greenfield "Track 2." This is Track 2, grounded against today's code. The organizing idea is ownership. A **backend** is what a controller *drives*: a `TaskBackend` sharing the one DAG, tasks are local rows with `backend_id`, effects folded in memory. A **peer** is what a controller *delegates to*: a full remote Iris with its own DAG/DB/backends/finelog, reached by RPC, its status cached. The backend seam never learns federation exists and federation never touches the DAG fold; the two meet only at the router, which gains a second target kind. The doc traces a user's program end to end through the multi-cluster world and pins every insertion point: - **Tracking (`jobs.child_cluster`).** A new column, sibling of `backend_id`, as the local/federated discriminator — on `jobs` only, because a federated root has no local task rows (Model D forbids mirroring remote tasks). A `federated_jobs` sidecar holds the mutable handle/cache state (`remote_job_id`, idempotency key, cached status, spend, sync cursor), keeping the hot `jobs` table lean. - **Sync.** Submission is a synchronous idempotent handoff backed by a durable handle with retry — not a queue. Status is a parent-driven incremental pull, so the peer stays a vanilla Iris unaware it is federated; terminal handles stop syncing and become the permanent passive cache. - **Logs.** The child controller is the sole reachable ingress to its siloed workers, so workers ship to the child's finelog intra-cluster unchanged, and the parent proxies log/exec queries through the peer controller at read time (generalizing the existing `EndpointProxy` the dashboard already uses) — no cross-region log shipping. - **Visualization.** A `cluster` scope dimension cloned from the shipped `backend_id` dashboard UI (column, detail row, scope selector), plus a `ListPeers`/Clusters overview distinct from `ListBackends` because ownership differs. - **Rollout.** Additive and invisible to single-cluster deployments (`child_cluster` always empty, no peers configured), staged so each PR is inert until the next. Design only — no implementation. Full doc: `.agents/projects/iris_federation/design.md`. Part of #6718.
Add the `iris.cluster.federation` package: a controller's federation peers — remote Iris clusters it may delegate whole jobs to — become configurable and observable, without anything yet executing on them. This is PR2 of the Iris multi-cluster federation rollout, building on the fold-exclusion schema seam from PR1 (#6821). - **`peers:` config** on `IrisClusterConfig` declares identity + reachability + trust, never capabilities (those are dynamic). Trust reuses the rigging `credentials_for` path, so a peer connection presents the same client credentials every cross-cluster client does — no second credential system. - **`FederationPeer`** wraps one authenticated `RemoteClusterClient` per peer. **`FederationManager`** owns the registry and a background capability heartbeat that forwards each peer's backends — the underlying cluster's static topology (kind, advertised devices) and dynamic state (availability, worker/task counts). The heartbeat reuses the peer's existing `ListBackends` read; there is no bespoke capability RPC or invented marker vocabulary. - **`ListPeers` / `PeerSummary`** RPC, kept distinct from `ListBackends` (backends are what a controller runs; peers are what it delegates to). `PeerSummary.backends` carries the forwarded `BackendSummary` list. - A submit-time **`PeerRouter`**, separate from the static meta-scheduler index, that selects local execution for every job — so nothing is handed off. The master derisking lever holds: federation is inert until a `peers:` entry exists. With no peers, no connections are built, the heartbeat never starts, and every path is a no-op, so a single-cluster deployment is byte-identical. Even with a peer configured, submissions still route and dispatch locally. The backend seam never learns federation exists, and federation adds nothing to the DAG fold. Part of #6718
…6835) Turn the federation router's peer arm live and add the first federated job that runs on a remote Iris cluster and reports back. Routing decides at submit time: prefer-local when any local backend is feasible, else match the job's constraints against a peer's live capability heartbeat and hand off to the first that can host it, and an explicit `cluster=<peer>` directive forces that peer (mutually exclusive with a local `backend` pin; the peer must be configured). A job no local backend can host is no longer fatal if a peer can take it. A handed-off job persists as a SENT `federated_jobs` handle (no local tasks) under a deterministic `remote_job_id` that folds this cluster's id into the job-name component, is delivered synchronously to the peer's `LaunchJob` with an explicit `FederationHandoff` field recording the requester and owner principal, and flips to `HANDED_OFF` on ack. A failed delivery is not fatal — the handle persists in `PENDING_HANDOFF` and the sync loop re-drives it under the same id (the peer's KEEP policy dedups), so handoff is exactly-once across a parent crash. Only a root job is handed off, and the `FederationHandoff` field is honored only from a trusted (admin) peer — a non-admin that sets it is denied, so it cannot forge a handoff to run a job as another user. The peer records the received job as a RECEIVED `federated_jobs` row naming the requester (the same table holds both directions, discriminated by `direction`), and writes an append-only `federation_changelog` in the same transaction as each job/task mutation. Each changelog row is stamped with the requester it belongs to, so `FederationSync` reports a requester only its own jobs without a join — and a tombstone survives the job delete. A second background loop pulls each peer once per tick via a new `FederationSync` RPC: the peer returns the jobs changed since the requester's cursor — or its full active set when the cursor is stale — and the parent mirrors job/task state onto its handle, applies tombstones, and advances the cursor in one transaction. A stale cursor (first contact, or below the retained changelog window) drives a full-resync set-replacement, so a job the parent never saw tombstoned is still reclaimed. Cancel bumps a versioned cancel intent and routes an idempotent `TerminateJob(remote_job_id)` to the peer; a peer `NOT_FOUND` (already terminal-and-pruned) satisfies the cancel, a transient failure is re-driven by the sync loop, and a cancel-before-handoff is terminated locally and never delivered. Federation stays inert until a `peers:` entry exists: with no peers, neither loop starts, routing is local, and the changelog is never written, so a single-cluster deployment is byte-identical apart from one new empty table and a `direction` discriminator column on `federated_jobs` (migration 0035). The backend seam is untouched — a peer is a remote cluster the job is handed to, not a backend. Two pieces are deferred to #6843: cross-cluster spend caps for federated roots (a zero-local-task federated root is not yet charged against the user's budget), and pruning the append-only `federation_changelog` (the sync staleness path already handles a pruned changelog, so a pruner needs no protocol change). Part of #6718 — PR3 of the multi-cluster federation rollout (follows #6821 PR1 and #6826 PR2).
…ler (#6846) An on-demand `ProfileTask`/`ExecInContainer` against a federated job cannot be resolved locally: the target task lives in the peer's DAG, so the parent holds no worker or attempt rows for it. Left alone, `_backend_for_id("")` falls back to the representative local backend and the request would run against the wrong cluster. Both handlers now branch before local `task -> worker` resolution: if the target task's root job resolves to a SENT federated handle, the request is forwarded to the owning peer controller, which does its own resolution and returns the result verbatim. The task id is rewritten onto the peer's remote root job — the tilde-joined `JobName` from `encode_remote_job_id` — preserving the task index and any attempt qualifier (new `JobName.with_root_job`). Forwarding goes through a public `FederationManager.proxy_profile`/`proxy_exec` helper, never the private peer dict, mirroring `cancel_federated`. The backend seam never learns federation exists; there is no `RemoteTaskBackend`. The peer stays authoritative: a proxied call to a task it has since moved or finished surfaces the peer's live answer or `NOT_FOUND`, and a tombstoned handle (its rows already gone) resolves `NOT_FOUND` locally with no peer round-trip. The parent remains the trust boundary — the client never reaches the peer. With no peers configured the branch never fires, so a single-cluster deployment is a byte-identical no-op. The tilde convention this proxy rewrites onto is now a first-class, reversible API rather than an ad-hoc string. `JobName.federated_remote_root(cluster_id, root)` builds the `/<user>/<cluster>~<name>` remote root and `split_federated_root` recovers `(cluster_id, root)` losslessly, with the delimiter a single `FEDERATION_DELIMITER` constant that `encode_remote_job_id` delegates to. To keep `cluster~name` unambiguous, `~` is reserved: `launch_job` rejects it in a job's own name unless the submit is a received handoff (whose root name IS the encoded id). Only the leaf name is checked, so a child spawned on a peer under a `~`-bearing federated root still submits. Two placement choices worth flagging for review: - The branch sits *after* the local task read, so the parent's mirror gates existence: a not-yet-synced task returns `NOT_FOUND` rather than proxying blindly. This is symmetric with the tombstone case and avoids a wasted peer round-trip for unknown/invalid task indices; the window where a task is running on the peer but unmirrored is negligible (an unmirrored task is still pending, hence not exec-able). - The parent→peer exec hop mirrors the worker backend's timeout contract: a negative ("no caller limit") timeout is capped at `EXEC_IN_CONTAINER_MAX_TIMEOUT` on the peer, so the proxy deadline clears that cap rather than collapsing to the margin. `get_process_status` is scoped out: it has no per-task target today, and a federated job's workers are peer-siloed, so a task-target process-status API is net-new rather than a proxy branch. Filed as #6845. Part of #6718 — PR4a of the federation rollout (exec/profile proxy; process-status follow-up filed). --------- Co-authored-by: rjpower <rjpower@users.noreply.github.com>
…ress (#6871) Adds the write side of the federation global-log plane (design §6): each cluster's controller durably forwards its local finelog to one shared **global finelog**, and finelog gains a **mandatory authenticated ingress** front. ## Log forwarding A new `finelog.forwarder.LogForwarder` (in finelog, generic) tails a source finelog and re-appends each new batch to a target under the same keys. Durability leans on the source finelog already being durable (parquet + retention): the forwarder persists only a forward **watermark** in a small JSON state file — not a second copy of the data — and advances it only after an ack'd `PushLogs` (which returns once the global store has durably persisted the batch). A crash or transient egress failure re-forwards the in-flight batch (at-least-once; duplicate lines are bounded to the failure boundary — federation degrades observability, never job correctness). First run seeds the watermark at the source's current cursor, so enabling it ships new logs without backfilling the whole retention window; a busy cluster's backlog drains within a tick rather than one batch per poll interval. The iris controller wires this up thinly (`controller/finelog_relay.py`): it supplies its local client, a per-cluster delegation credential, and a state path on its persistent dir; the forwarding/watermark logic lives entirely in finelog. A cluster's finelog is one flat config block — `finelog.config` names the local store; `finelog.relay_address` (optional) points at the shared global one, with `delegation_key`/`static_token` for the credential — and setting `relay_address` is what turns the forwarder on. Inert until then: no forwarder thread, no egress, byte-identical single-cluster behavior. **Cross-cluster log namespacing is deferred.** Bare per-cluster keys (`/user/<job>/...`) collide when many clusters land in one store; rather than have the relay rewrite keys mid-stream, namespacing will be done on the write side (tasks/controller write the namespaced key directly) or via a finelog `cluster` column, in a follow-up. This PR ships the durable relay + auth mechanism; a single cluster relaying to a global store works today. ## Authenticated ingress (new security surface) finelog's server (`lib/finelog/rust/src/server/auth.rs`) now gates every RPC with an ordered stack of typed auth **layers** — `cidr` (admit by transport-peer network) and `jwt` (HS256 bearer verified against per-cluster delegation keys) — that is **default-deny** and **always installed**. An unconfigured server falls back to allow-localhost (loopback only), never open. HS256 verification uses vetted RustCrypto primitives (constant-time HMAC, no `ring`), rejects `alg:none`/algorithm-confusion, and tolerates a small `exp` clock-skew leeway. The `/debug/*` admin routes are gated by the same policy. It is configured by a single `FINELOG_AUTH_POLICY` JSON layer list, which the finelog deploy config, GCP bootstrap script, and k8s manifest all plumb (k8s refuses to inline jwt secrets — those must come through a Secret). Each relaying controller authenticates its pushes with a short-lived HS256 delegation JWT, minted via the same `JwtTokenManager` the control plane uses but signed with a **dedicated** per-cluster key the global store verifies — so the shared store can verify a cluster's tokens without gaining the power to mint control-plane tokens. Because auth is always installed, the three committed local finelog configs (`marin`, `marin-dev`, `cw-us-east-02a`) gain an explicit `cidr` layer (RFC1918 + loopback) so intra-cluster/pod ingest keeps working — reachable from the private network, never the public internet. ## Cost Relaying every cluster's logs to one region is deliberate cross-region **egress** — the trade for a uniform, always-available, cross-cluster-queryable log surface. finelog already batches/compresses and tiers cold segments to object storage, which bounds steady-state cost. The read side (surfacing a federated job's logs through the parent: `job_id`→peer-namespace remap + read-through auth) is deferred to a follow-up (#6862). Part of #6718 — PR4b of the federation rollout (global finelog + relay + authed ingress).
…olumn (#6881) A global finelog collects pushes relayed from many federated clusters, whose per-cluster log keys (`/user/<job>/<task>:<attempt>`, `/system/...`) collide once more than one cluster forwards into the same store. This namespaces them by origin with a `cluster` column on the log row. Every finelog writer is authenticated, so the origin cluster is a trusted field on the push: `PushLogsRequest` gains a `cluster` field, and `push_logs` stamps the column from it. A relay sets it to the cluster it forwards for; a local single-cluster push leaves it empty. No writer needs to change its keys, and the forwarder keeps forwarding bare keys. The `cluster` column is added to the `log` schema as nullable and additive: - an already-registered `log` namespace is evolved on boot — the schema is registered/merged into the catalog *before* rehydrate, so the engine opens once from the evolved catalog schema with no live-engine rebuild (which would need a runtime `block_on` illegal on the async `main` task). The evolution preserves the namespace's persisted storage policy; - segments written before the column existed null-fill it on read. FetchLogs gains an optional `cluster` filter: non-empty restricts to one origin (the parent's federated read filters `cluster = <peer>`), empty is unfiltered so a local single-cluster read behaves exactly as before. This is the Rust server side. The Python `logging.proto` copy and its generated stubs are intentionally left unchanged — the Python wire contract lands with its consumer, the federated read client (#6862). Part of #6718. Foundation for the federated read-remap (#6862). Fixes #6876.
Design for multi-backend Iris — one controller at
iris.oa.devfronting GCP TPU pools, Kubernetes GPU clusters, and remote CoreWeave clusters — under the Model D architecture:RemoteBackendWorkerStore.This reframes an earlier "single root DB + per-cluster poll agent" proposal (where remote was a backend reconciling into the root DB); that exploration is preserved as the
multi-backend-proposal/multi-backend-control-planeweaver artifacts. Design-only — no implementation here.TaskBackendcontrol fold, the WorkerJobService split,BackendWorkerStore, the two work tracksWork split: Track 1 (local backend hygiene — WorkerJobService split, published worker projection, autoscaler single-writer) proceeds now; Track 2 (federation) is a separate greenfield project for when a real remote cluster is on the table.
Discussion welcome — see Open Questions in
spec.mdand the decision doc.