Skip to content

[Design] Iris multi-backend: Model D (federation peers + local backend contract)#6718

Merged
rjpower merged 10 commits into
mainfrom
weaver/iris-implement-multi-backend-sup
Jul 1, 2026
Merged

[Design] Iris multi-backend: Model D (federation peers + local backend contract)#6718
rjpower merged 10 commits into
mainfrom
weaver/iris-implement-multi-backend-sup

Conversation

@rjpower

@rjpower rjpower commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Design for multi-backend Iris — one controller at iris.oa.dev fronting GCP TPU pools, Kubernetes GPU clusters, and remote CoreWeave clusters — under the Model D architecture:

  • Backends are local execution substrates (GCP, k8s). Tasks on a backend live in the controller's one job DAG; the backend authors effects, the controller folds them. Cheap and correct for the local case.
  • Remote Iris clusters are federation peers, not backends. A remote cluster is a full Iris that owns its own DAG + backends; whole root jobs are handed off to it and it reports status + spend back. Consequence: the backend contract does not have to be remote-safe — no per-backend DB-file split, no RemoteBackendWorkerStore.
  • Job trees are locked to the parent's peer, which is largely self-enforcing: a running task's client targets the controller that launched it, so children spawned on a peer submit back to that peer.

This reframes an earlier "single root DB + per-cluster poll agent" proposal (where remote was a backend reconciling into the root DB); that exploration is preserved as the multi-backend-proposal / multi-backend-control-plane weaver artifacts. Design-only — no implementation here.

  • delegation_model.mdthe decision: Model A (controller owns all DAGs) vs B/C (every backend a mini-Iris) vs D (federation ≠ backend), 2× codex-reviewed
  • design.md — the "why" (local backend contract under Model D)
  • spec.md — the contract: TaskBackend control fold, the WorkerJobService split, BackendWorkerStore, the two work tracks
  • research.md — the code-seam trace behind the design

Work split: Track 1 (local backend hygiene — WorkerJobService split, published worker projection, autoscaler single-writer) proceeds now; Track 2 (federation) is a separate greenfield project for when a real remote cluster is on the table.

Discussion welcome — see Open Questions in spec.md and the decision doc.

Proposes multi-backend mode for Iris: one controller (iris.oa.dev) fronting many clusters (GCP TPU pools, multiple CoreWeave/k8s) behind one job namespace and dashboard. A single authoritative root DB owns all state; a per-cluster recoverable-cache "agent" reconciles desired vs. actual over one Poll() wire, with four safety mechanisms (root epoch, monotonic launch-lease, execution self-fence, ack-gated terminal retention + CAS) in place of a distributed database.

Design-only, for early feedback. design.md is the 1-pager; research.md holds prior art + the current seam + the codex review log; spec.md pins the proto, config schema, DB migration, file layout, and auth.
@rjpower rjpower added agent-generated Created by automation/agent design Design doc / 1-pager for review labels Jun 26, 2026
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @rjpower's task in 2m 37s —— View job


Code review — iris multi-backend design

This is a design-only PR (three Markdown docs, no code), so the usual bug/CLAUDE.md pass doesn't apply in the normal sense. I read all three docs end-to-end, spot-checked the load-bearing code references against origin/main, and reviewed the safety model for internal consistency. No blocking issues — the design is coherent and unusually well-specified. A few things worth nailing down before implementation, plus one verification note.

PR description: clean — leads with what the change is, links the three artifacts, no Testing/template/checkbox/self-credit boilerplate. (The ## Testing section inside design.md is appropriate doc content, not a PR-body violation.)

Code references verified accurate:

  • The DB-less TaskBackend claim is confirmed verbatim — the Protocol docstring says implementations "never touch the controller database" (backend.py:340).
  • The TaskAttempt.adopt() port claim holds: adopt threads a port_allocator but rebuilds from DiscoveredContainer without re-reserving the discovered ports (task_attempt.py:303-353) — so the "stamp ports on the substrate + re-reserve on adopt" requirement is real and well-grounded.

Things to nail down before building

1. acted_root_epoch vs. terminal observations across a root restart. spec.md §1 says the root applies an observation "only if (attempt_uid, desired_generation) is still the task's current attempt under the current root_epoch," and AttemptObservation.acted_root_epoch is "the epoch the agent acted under (root rejects non-current)." Consider: an attempt launches and terminates under epoch N; the root restarts → epoch N+1 before the terminal is acked. On the next Poll the agent re-reports that retained terminal — but with which acted_root_epoch?

  • If it reports N (the epoch it genuinely acted/launched under, which is what the field name says), the root rejects it as non-current — permanently — and the real terminal status is stranded; the attempt looks alive forever.
  • If it re-stamps to N+1 (the epoch it now sees), then the field isn't "the epoch the agent acted under" at all — it's just root_epoch_seen, and the rename is misleading.

The CAS on attempt_uid already fences stale desired-state actions across an epoch bump, so it's worth spelling out exactly what the epoch check on observations adds beyond that, and how a pre-restart terminal reconciles after a leadership change. This is the one place the four-mechanism story felt under-determined.

2. desired_generation is redundant in the observation CAS key. §0 establishes attempt_uid is fresh per attempt and already UNIQUE, so it alone identifies both the attempt and its generation — (attempt_uid, desired_generation) carries no more information than attempt_uid for the observation CAS. desired_generation clearly earns its keep for ordering desired-state edits (a removal can't be undone by a late upsert), but for the observation-apply path the pair reads as belt-and-suspenders. Either state explicitly that it's a defensive cross-check, or drop it from that key to avoid implying the two fields can disagree for a single attempt_uid.

3. Agent-side gang admission atomicity. research.md lists "atomic gang" admission as a borrowed Armada property, and §9 rules out cross-backend gangs — but §3.1 (meta-scheduler) and §2 (agent) don't restate that the agent's local schedule(local) must admit a gang all-or-nothing. Since the root routes a gang job to a backend off a stale largest_gang summary and two gang jobs can race onto the same backend, the agent is the layer that must avoid wedging on a partially-placed gang (PENDING-until-capacity is mentioned, but the atomicity guarantee that backs it isn't). Worth one sentence in the spec so the implementer doesn't reach for incremental per-task placement.

These are clarifications, not objections — the core "single authoritative DB + recoverable-cache agent, safety from four bounded mechanisms" structure is sound and the codex-log provenance shows the sharp edges (launch-lease gating the launch, two-phase reroute, response-applied conservative lease renewal) were already found and closed.
· branch weaver/iris-implement-multi-backend-sup

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ca478d3df

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

capabilities: ClassVar[frozenset[BackendCapability]] # {CLUSTER_VIEW} — root never places workers
autoscaler: Autoscaler | None # None; capacity rides over Poll, not Iris autoscaler

def schedule(self, snapshot: ScheduleInput) -> ScheduleResult: ...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add a backend-placement result before reusing schedule()

When RemoteTaskBackend.schedule() is defined as the root meta-scheduler, the existing ScheduleResult cannot persist the task→backend decision it needs to make: today _commit_schedule_decisions only applies result.assignments via ops.task.assign, whose Assignment is a task→worker pair and is dropped unless the worker_id resolves to a healthy worker. In the multi-backend path this leaves no commit path for jobs.backend_id/tasks.backend_id or for selecting which CLUSTER_VIEW dispatch drain owns the task, so remote-backend tasks would either remain pending or be treated as invalid worker assignments. The contract needs an explicit backend-placement result and commit path rather than the existing task→worker ScheduleResult alone.

Useful? React with 👍 / 👎.

S1 (recoverability): invariant holds; ports need substrate stamping (#6721), service endpoints stay root-authoritative as a lease (#6722). S2 (two-level placement): no meaningful fidelity loss; pinned the CapacitySummary (pruned static + pending_leases, added free-bin maxima, split gang), plus set-membership matcher + --backend-stripping fixes. S3 (fence timing): a dedicated short lease gives ~20-30s re-placement vs ~10min today; k8s needs a lease-sidecar self-fence (activeDeadlineSeconds insufficient). S4 (transport): dial-home Connect works; interactive over Poll + fast-follow is ~0.5x cadence, held stream opt-in.
rjpower pushed a commit that referenced this pull request Jun 26, 2026
…tart (#6723)

On worker restart an adopted container kept running but its host-port
reservations were lost: TaskAttempt.adopt() rebuilt ports={} and never
re-marked the PortAllocator, which started empty. The worker could then
re-hand those in-use ports (e.g. 30000/30001) to the next scheduled
task, causing a bind clash.

The port set had no substrate footprint, so it could not be recovered.
This change gives it one and restores it on adopt:

- Stamp the allocated host ports as an iris.ports Docker label at
container create (ContainerConfig.ports).
- Recover the ports into DiscoveredContainer.ports during
discover_containers().
- Add PortAllocator.reserve() to re-mark a known port set as taken.
- Have adopt() restore attempt.ports and reserve() them before any new
work is scheduled.

Independent of multi-backend, and a prerequisite for the per-cluster
agent recoverable-cache port recovery in #6718.

Fixes #6721
k8s needs no pod self-fence: the apiserver is a durable, independently-reachable authority and the agent is recoverable, so any live agent reconciles an undesired pod away (today's poll-and-delete), idempotent across agents via attempt_uid pod naming. A brief reroute overlap is benign (fresh attempt -> fresh output path); a lease sidecar survives only as an opt-in hard-fence for external-side-effect jobs. Validate locally with kind, not gated infra. Self-fence is now scoped to worker-daemon (the worker is the sole authority over its own process during a partition).

Operator cordon/drain: a sparse root-owned worker_policy overlay pushed down in the Poll desired-state; the agent skips draining workers; k8s reuses native node cordon and a worker may persist the flag locally. Resolved both prior open questions.
rjpower added a commit that referenced this pull request Jun 28, 2026
Regroup `cluster/backends/` by concern into sibling trees, with no
behavior change:

- `backends/` — the pure `TaskBackend` implementations (`rpc/`,
`k8s/tasks.py`) plus the shared protocol seam (`protocols.py`,
`types.py`).
- `platforms/` (new) — runtime substrate drivers: `WorkerInfraProvider`
impls (gcp/manual workers), cloud-API wrappers (`GcpService`,
`CloudK8sService`), worker bootstrap, remote-exec health. admin
bring-up/ops: `ControllerProvider` impls, `vm_lifecycle.py`, controller
bootstrap, the provider `factory`.
Part of the multi-backend design (#6718).
rjpower added a commit that referenced this pull request Jun 29, 2026
…-renew) (#6728)

Make service endpoints a lease instead of task-lifetime-bound state.

**Problem.** Endpoints were registered once and lived as long as their
task row:
liveness was a function of FK CASCADE + attempt freshness, not of
whether the
registrant is still serving. A task that crashes hard leaves a
served-but-dead
endpoint until the row is pruned. Spike S1 (#6718) flagged endpoints as
the one
piece of worker-adjacent state that isn't a pure function of the
recovery
sources, so it must stay root-authoritative; a lease makes that robust.

**Approach.**
- Endpoints carry a lease (`endpoints.lease_deadline_ms`). Registration
grants a
deadline; re-registering with the same `endpoint_id` renews it. A row
past its
deadline is hidden from reads immediately and swept by the pruner,
independent
  of the FK CASCADE.
- A new `EndpointService` owns the leased registry. `EndpointClient`
combines the
RPC stub with background renewal: `register` registers and renews (at
1/3 of
the lease, 10s→1m backoff on a failed renewal); `unregister`/`close`
stop
renewing and delete, waiting out any in-flight renewal so an explicit
delete
can't be undone mid-RPC. A crashed task stops renewing and its endpoint
expires
  on its own.
- The registrant chooses its lease:
`RegisterEndpointRequest.lease_duration`,
clamped server-side to `[3m, 72h]`. `EndpointClient` requests 10m; an
unset
  field selects the 72h default.
- The legacy `ControllerService.{Register,Unregister,List}Endpoint` RPCs
delegate
in-process to the same backend (no localhost round-trip or re-auth), so
existing
clients keep working unchanged — and since they leave `lease_duration`
unset,
  they keep the 72h lease.

**Rollout.** Because the lease is client-chosen, a renewing client
auto-enrolls
in the short (10m) lease as it upgrades, while a client that doesn't
renew keeps
72h — no server-side flip that would shorten the lease out from under
it.
Migration 0031 leaves existing rows NULL (never-expiring) rather than
backfilling, so loading the new schema can't retroactively expire a live
endpoint. This supersedes #6729 (a server-side drop to ~10m); the
task/attempt
culling can be removed once the unset-lease (old-client) registration
rate hits
zero.

Part of #6722

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
rjpower added a commit that referenced this pull request Jun 30, 2026
…6788)

A worker-daemon backend held a raw `ControllerDB` (via `BackendRuntime`
→
`DbWorkerSource`) and exposed `db`/`endpoints`/`worker_attrs` as public
properties, so the controller↔backend boundary held only by convention.
Replace
that with a typed, operation-scoped `BackendWorkerStore`: the backend
reads its
owned workers, builds its scheduling/reconcile snapshots, resolves a
worker
address, and reaps its dead workers through a fixed set of named
operations
instead of a raw DB handle.

In-process the store runs those operations as scale-group-scoped
reads/writes
over the shared controller DB (so the scheduler's task↔worker joins
survive); a
remote backend will later swap the same operations for RPCs. This is the
structural seam the multi-backend "controller of controllers" needs — an
in-process backend and a remote Iris become the same thing modulo
transport.

Faithful wrap, byte-identical (N=1 controller suite green, no
expected-value
changes): the store's read methods are the verbatim `DbWorkerSource`
bodies,
`reap_workers` is the verbatim `teardown_dead_workers` algorithm, and
the store
still closes over the backend's own `WorkerHealthTracker` (assembled in
`bind_runtime`). `WorkerSource` is deleted; `BackendRuntime`'s fields
are
unchanged.

Phase 1 of the encapsulated-sub-controller backend contract (design and
spec in
#6718). `teardown`, registration, worker views, autoscaler persistence,
and
pruning stay on the controller for later phases.

Part of #6718
Spec the end state explicitly: the controller becomes a thin proxy/router and
each backend a self-contained sub-controller that owns its workers, attrs,
endpoints, liveness, autoscaler, slices, and its tasks' execution state, and
commits its own effects. Add the ownership-partition table and the storage
model: backends share one ControllerDB instance (locking opaque) until remote;
table ownership moves to the backend well before any DB file split, which is
deferred entirely to the remote phase.

Reorder the phasing accordingly — pull the per-backend pruner split to the
front (it unblocks moving the worker_attrs/endpoints projections into the
backend), then projection ownership, then the commit-ownership pivot, then
published status, autoscaler single-writer, and remote. Soften the per-phase
gate from byte-identical to semantic-behavior-preserving.

Add service_and_construction_options.md: the codex-reviewed analysis of
service topology vs construction ownership that motivates this reframing.
rjpower added a commit that referenced this pull request Jun 30, 2026
…6795)

The controller's background prune loop reached into every backend's
health tracker and worker-attributes projection to delete stale DEAD
workers. Make each backend garbage-collect its own dead workers instead.

`prune_dead_workers` is added to the `BackendWorkerStore` protocol —
implemented on `DbBackendWorkerStore`, which already holds the `db`,
`health` tracker, and `worker_attrs` it needs — and to the `TaskBackend`
protocol, delegated by `RpcTaskBackend` and a no-op on the Kubernetes
backend (it tracks no Iris workers). `prune_old_data` now takes the
backends collection and sums each backend's own GC, so the controller
keeps only the cross-backend prune concerns: terminal jobs, orphan
slices, and expired endpoints.

This continues the `BackendWorkerStore` ownership transfer (P3): the
controller moves toward a thin router while each backend owns its
workers, attributes, and liveness. The worker prune still runs on the
controller's background prune thread — it touches only worker rows,
attributes, and tracker entries, never the autoscaler — and preserves
the cutoff semantics, the one-delete-per-transaction-plus-`pause`
cadence, the `PruneResult.workers_deleted` count, and the
`worker_pruned` audit event. The `prune_old_data` replay golden is
unchanged.

Design note: `prune_old_data` takes the backends collection
(`self._backends.values()`) rather than a list of stores, since the
controller holds backends and each backend already encapsulates its
store.

Part of #6718.
rjpower added 2 commits July 1, 2026 13:31
Records the decision: a remote Iris cluster is a federation peer that owns its
own DAG and backends (whole jobs are handed off to it), not a TaskBackend.
Backends are local execution substrates that share the controller's one DAG.

Consequence: the backend contract does not have to be remote-safe -- no
per-backend DB-file split, no RemoteBackendWorkerStore. Remaining backend work
is local hygiene (Track 1: the WorkerJobService split, a published worker
projection, autoscaler single-writer). Remote becomes a separate federation
project (Track 2). The controller-side DAG-fold commit pivot is stood down.

Job trees are locked to the parent's peer, which is largely self-enforcing: a
running task's client targets the controller that launched it, so children
spawned on a peer submit back to that peer.

Adds delegation_model.md (the decision doc, twice codex-reviewed, with the
rejected Model A/B/C alternatives); compacts spec.md to today->end-state and
realigns design.md.
…options doc

Model D is the decided architecture, so the iris_multi_backend proposal -- the
"single root DB + per-cluster poll agent" model, where a remote cluster is a
backend reconciling into the root DB -- is superseded and removed. It is
preserved as the multi-backend-proposal / multi-backend-control-plane weaver
artifacts and in git history.

Also drops iris_backend_contract/service_and_construction_options.md: its live
conclusion (bootstrap builds the store, no two-phase bind_runtime) is already in
spec.md, and its remote-service analysis is superseded by federation.

The remaining docs (delegation_model, design, spec, research) are the current
Model D plan.
@rjpower rjpower changed the title [Design] iris_multi_backend [Design] Iris multi-backend: Model D (federation peers + local backend contract) Jul 1, 2026
@rjpower rjpower merged commit fda97ee into main Jul 1, 2026
30 checks passed
@rjpower rjpower deleted the weaver/iris-implement-multi-backend-sup branch July 1, 2026 13:41
rjpower added a commit that referenced this pull request Jul 1, 2026
…for federated tasks

Federation (Model D, Track 2) hands whole jobs off to a peer cluster. Those
rows live in the local jobs/tasks tables so listings render, but the local
scheduler must never act on them. Add a child_cluster column (server_default
"", parallel to backend_id) to jobs and tasks: "" means locally owned,
"<peer>" means handed off to that peer (backend_id then "").

Enforce the exclusion at one source. Define a local_tasks Core selectable
(tasks WHERE child_cluster = "") and repoint every control-plane tasks reader
to it: routing/scheduling, capacity/preemption, budget/admission, the direct-
provider dispatch drain, the reconcile snapshot loaders, the timeout scan, and
the worker-bound reconcile rows. Federated rows are structurally invisible to
the fold, so a newly written control-plane query is safe by default and must go
out of its way (raw tasks) to see one. Two partial indexes (... WHERE
child_cluster = "") keep the hot pending/state scans free at read time. The
pruner excludes federated jobs — a peer tombstone is their only deletion path.
User-facing reads (job/task detail, list) keep reading raw tasks so federated
rows still render.

Add the federated_jobs / federation_sync_state / federated_tasks join sidecars
(job/task state stays in the main rows) and migration 0034 for the upgrade
path. Add JobStatus.child_cluster and TaskStatus.child_cluster.

No peer can be configured yet, so zero federated rows exist in production and
this is behavior-preserving: the replay goldens change only by additive empty
sidecars and child_cluster: "" fields. New tests inject synthetic federated
rows and assert they are never routed, dispatched, finalized, timed out,
counted as local budget/admission spend, or pruned.

Part of #6718.
rjpower added a commit that referenced this pull request Jul 1, 2026
…for federated tasks (#6821)

Federation hands whole jobs off to a peer cluster. Those rows live in
the local jobs/tasks tables so listings render, but the local scheduler
must never act on them. This adds a `child_cluster` column
(server_default `""`, parallel to `backend_id`) to `jobs` and `tasks`:
`""` means locally owned, `"<peer>"` means handed off to that peer (with
`backend_id` then `""`).

The exclusion is enforced at one source rather than by convention across
~30 call sites. A `local_tasks` SQLAlchemy Core selectable (`tasks WHERE
child_cluster = ''`) is defined once, and every control-plane tasks
reader is repointed to it:

- routing/scheduling (pending-task projection), capacity/preemption,
budget/admission
- the direct-provider dispatch drain — the canonical silent break: a
federated PENDING/ASSIGNED row would otherwise be promoted and run
locally
- the reconcile snapshot loaders, the execution-timeout scan, and the
worker-bound reconcile rows

Federated rows are structurally invisible to the fold, so a newly
written control-plane query is safe by default and must go out of its
way (raw `tasks`) to see one. Two partial indexes (`… WHERE
child_cluster = ''`) keep the hot pending/state scans free at read time
— SQLite flattens the subquery and drives off them. User-facing reads
(job/task detail, list) keep reading raw `tasks` so federated rows still
render. The pruner excludes federated jobs, since a peer tombstone is
their only deletion path.

Also adds the `federated_jobs` / `federation_sync_state` /
`federated_tasks` join sidecars (job/task state stays authoritative in
the main rows), migration `0034` for the upgrade path, and
`JobStatus.child_cluster` / `TaskStatus.child_cluster`.

No peer can be configured yet, so zero federated rows exist and this is
behavior-preserving — the replay goldens change only by additive empty
sidecars and `child_cluster: ""` fields. New tests inject synthetic
federated rows and assert they are never routed, dispatched, finalized,
timed out, counted as local budget/admission spend, or pruned, while the
parallel local rows still flow through the same readers.

Part of #6718.
rjpower added a commit that referenced this pull request Jul 1, 2026
…#6814)

Concrete design for the multi-cluster federation project. PR #6718
decided the architectural cut — **Model D: a remote Iris cluster is not
a backend, it is a federation peer** the parent hands whole root jobs to
and then passively caches — but deferred the federation project itself
as greenfield "Track 2." This is Track 2, grounded against today's code.

The organizing idea is ownership. A **backend** is what a controller
*drives*: a `TaskBackend` sharing the one DAG, tasks are local rows with
`backend_id`, effects folded in memory. A **peer** is what a controller
*delegates to*: a full remote Iris with its own DAG/DB/backends/finelog,
reached by RPC, its status cached. The backend seam never learns
federation exists and federation never touches the DAG fold; the two
meet only at the router, which gains a second target kind.

The doc traces a user's program end to end through the multi-cluster
world and pins every insertion point:

- **Tracking (`jobs.child_cluster`).** A new column, sibling of
`backend_id`, as the local/federated discriminator — on `jobs` only,
because a federated root has no local task rows (Model D forbids
mirroring remote tasks). A `federated_jobs` sidecar holds the mutable
handle/cache state (`remote_job_id`, idempotency key, cached status,
spend, sync cursor), keeping the hot `jobs` table lean.
- **Sync.** Submission is a synchronous idempotent handoff backed by a
durable handle with retry — not a queue. Status is a parent-driven
incremental pull, so the peer stays a vanilla Iris unaware it is
federated; terminal handles stop syncing and become the permanent
passive cache.
- **Logs.** The child controller is the sole reachable ingress to its
siloed workers, so workers ship to the child's finelog intra-cluster
unchanged, and the parent proxies log/exec queries through the peer
controller at read time (generalizing the existing `EndpointProxy` the
dashboard already uses) — no cross-region log shipping.
- **Visualization.** A `cluster` scope dimension cloned from the shipped
`backend_id` dashboard UI (column, detail row, scope selector), plus a
`ListPeers`/Clusters overview distinct from `ListBackends` because
ownership differs.
- **Rollout.** Additive and invisible to single-cluster deployments
(`child_cluster` always empty, no peers configured), staged so each PR
is inert until the next.

Design only — no implementation. Full doc:
`.agents/projects/iris_federation/design.md`.

Part of #6718.
rjpower added a commit that referenced this pull request Jul 1, 2026
Add the `iris.cluster.federation` package: a controller's federation
peers —
remote Iris clusters it may delegate whole jobs to — become configurable
and
observable, without anything yet executing on them. This is PR2 of the
Iris
multi-cluster federation rollout, building on the fold-exclusion schema
seam
from PR1 (#6821).

- **`peers:` config** on `IrisClusterConfig` declares identity +
reachability +
trust, never capabilities (those are dynamic). Trust reuses the rigging
  `credentials_for` path, so a peer connection presents the same client
credentials every cross-cluster client does — no second credential
system.
- **`FederationPeer`** wraps one authenticated `RemoteClusterClient` per
peer.
  **`FederationManager`** owns the registry and a background capability
heartbeat that forwards each peer's backends — the underlying cluster's
static
  topology (kind, advertised devices) and dynamic state (availability,
worker/task counts). The heartbeat reuses the peer's existing
`ListBackends`
read; there is no bespoke capability RPC or invented marker vocabulary.
- **`ListPeers` / `PeerSummary`** RPC, kept distinct from `ListBackends`
  (backends are what a controller runs; peers are what it delegates to).
  `PeerSummary.backends` carries the forwarded `BackendSummary` list.
- A submit-time **`PeerRouter`**, separate from the static
meta-scheduler index,
  that selects local execution for every job — so nothing is handed off.

The master derisking lever holds: federation is inert until a `peers:`
entry
exists. With no peers, no connections are built, the heartbeat never
starts, and
every path is a no-op, so a single-cluster deployment is byte-identical.
Even
with a peer configured, submissions still route and dispatch locally.
The
backend seam never learns federation exists, and federation adds nothing
to the
DAG fold.

Part of #6718
rjpower added a commit that referenced this pull request Jul 2, 2026
…6835)

Turn the federation router's peer arm live and add the first federated
job that runs on a remote Iris cluster and reports back.

Routing decides at submit time: prefer-local when any local backend is
feasible, else match the job's constraints against a peer's live
capability heartbeat and hand off to the first that can host it, and an
explicit `cluster=<peer>` directive forces that peer (mutually exclusive
with a local `backend` pin; the peer must be configured). A job no local
backend can host is no longer fatal if a peer can take it.

A handed-off job persists as a SENT `federated_jobs` handle (no local
tasks) under a deterministic `remote_job_id` that folds this cluster's
id into the job-name component, is delivered synchronously to the peer's
`LaunchJob` with an explicit `FederationHandoff` field recording the
requester and owner principal, and flips to `HANDED_OFF` on ack. A
failed delivery is not fatal — the handle persists in `PENDING_HANDOFF`
and the sync loop re-drives it under the same id (the peer's KEEP policy
dedups), so handoff is exactly-once across a parent crash. Only a root
job is handed off, and the `FederationHandoff` field is honored only
from a trusted (admin) peer — a non-admin that sets it is denied, so it
cannot forge a handoff to run a job as another user.

The peer records the received job as a RECEIVED `federated_jobs` row
naming the requester (the same table holds both directions,
discriminated by `direction`), and writes an append-only
`federation_changelog` in the same transaction as each job/task
mutation. Each changelog row is stamped with the requester it belongs
to, so `FederationSync` reports a requester only its own jobs without a
join — and a tombstone survives the job delete. A second background loop
pulls each peer once per tick via a new `FederationSync` RPC: the peer
returns the jobs changed since the requester's cursor — or its full
active set when the cursor is stale — and the parent mirrors job/task
state onto its handle, applies tombstones, and advances the cursor in
one transaction. A stale cursor (first contact, or below the retained
changelog window) drives a full-resync set-replacement, so a job the
parent never saw tombstoned is still reclaimed.

Cancel bumps a versioned cancel intent and routes an idempotent
`TerminateJob(remote_job_id)` to the peer; a peer `NOT_FOUND` (already
terminal-and-pruned) satisfies the cancel, a transient failure is
re-driven by the sync loop, and a cancel-before-handoff is terminated
locally and never delivered.

Federation stays inert until a `peers:` entry exists: with no peers,
neither loop starts, routing is local, and the changelog is never
written, so a single-cluster deployment is byte-identical apart from one
new empty table and a `direction` discriminator column on
`federated_jobs` (migration 0035). The backend seam is untouched — a
peer is a remote cluster the job is handed to, not a backend.

Two pieces are deferred to #6843: cross-cluster spend caps for federated
roots (a zero-local-task federated root is not yet charged against the
user's budget), and pruning the append-only `federation_changelog` (the
sync staleness path already handles a pruned changelog, so a pruner
needs no protocol change).

Part of #6718 — PR3 of the multi-cluster federation rollout (follows
#6821 PR1 and #6826 PR2).
rjpower added a commit that referenced this pull request Jul 2, 2026
…ler (#6846)

An on-demand `ProfileTask`/`ExecInContainer` against a federated job
cannot be resolved locally: the target task lives in the peer's DAG, so
the parent holds no worker or attempt rows for it. Left alone,
`_backend_for_id("")` falls back to the representative local backend and
the request would run against the wrong cluster.

Both handlers now branch before local `task -> worker` resolution: if
the target task's root job resolves to a SENT federated handle, the
request is forwarded to the owning peer controller, which does its own
resolution and returns the result verbatim. The task id is rewritten
onto the peer's remote root job — the tilde-joined `JobName` from
`encode_remote_job_id` — preserving the task index and any attempt
qualifier (new `JobName.with_root_job`). Forwarding goes through a
public `FederationManager.proxy_profile`/`proxy_exec` helper, never the
private peer dict, mirroring `cancel_federated`. The backend seam never
learns federation exists; there is no `RemoteTaskBackend`.

The peer stays authoritative: a proxied call to a task it has since
moved or finished surfaces the peer's live answer or `NOT_FOUND`, and a
tombstoned handle (its rows already gone) resolves `NOT_FOUND` locally
with no peer round-trip. The parent remains the trust boundary — the
client never reaches the peer. With no peers configured the branch never
fires, so a single-cluster deployment is a byte-identical no-op.

The tilde convention this proxy rewrites onto is now a first-class,
reversible API rather than an ad-hoc string.
`JobName.federated_remote_root(cluster_id, root)` builds the
`/<user>/<cluster>~<name>` remote root and `split_federated_root`
recovers `(cluster_id, root)` losslessly, with the delimiter a single
`FEDERATION_DELIMITER` constant that `encode_remote_job_id` delegates
to. To keep `cluster~name` unambiguous, `~` is reserved: `launch_job`
rejects it in a job's own name unless the submit is a received handoff
(whose root name IS the encoded id). Only the leaf name is checked, so a
child spawned on a peer under a `~`-bearing federated root still
submits.

Two placement choices worth flagging for review:

- The branch sits *after* the local task read, so the parent's mirror
gates existence: a not-yet-synced task returns `NOT_FOUND` rather than
proxying blindly. This is symmetric with the tombstone case and avoids a
wasted peer round-trip for unknown/invalid task indices; the window
where a task is running on the peer but unmirrored is negligible (an
unmirrored task is still pending, hence not exec-able).
- The parent→peer exec hop mirrors the worker backend's timeout
contract: a negative ("no caller limit") timeout is capped at
`EXEC_IN_CONTAINER_MAX_TIMEOUT` on the peer, so the proxy deadline
clears that cap rather than collapsing to the margin.

`get_process_status` is scoped out: it has no per-task target today, and
a federated job's workers are peer-siloed, so a task-target
process-status API is net-new rather than a proxy branch. Filed as
#6845.

Part of #6718 — PR4a of the federation rollout (exec/profile proxy;
process-status follow-up filed).

---------

Co-authored-by: rjpower <rjpower@users.noreply.github.com>
rjpower added a commit that referenced this pull request Jul 2, 2026
…ress (#6871)

Adds the write side of the federation global-log plane (design §6): each
cluster's controller durably forwards its local finelog to one shared
**global finelog**, and finelog gains a **mandatory authenticated
ingress** front.

## Log forwarding
A new `finelog.forwarder.LogForwarder` (in finelog, generic) tails a
source finelog and re-appends each new batch to a target under the same
keys. Durability leans on the source finelog already being durable
(parquet + retention): the forwarder persists only a forward
**watermark** in a small JSON state file — not a second copy of the data
— and advances it only after an ack'd `PushLogs` (which returns once the
global store has durably persisted the batch). A crash or transient
egress failure re-forwards the in-flight batch (at-least-once; duplicate
lines are bounded to the failure boundary — federation degrades
observability, never job correctness). First run seeds the watermark at
the source's current cursor, so enabling it ships new logs without
backfilling the whole retention window; a busy cluster's backlog drains
within a tick rather than one batch per poll interval.

The iris controller wires this up thinly
(`controller/finelog_relay.py`): it supplies its local client, a
per-cluster delegation credential, and a state path on its persistent
dir; the forwarding/watermark logic lives entirely in finelog. A
cluster's finelog is one flat config block — `finelog.config` names the
local store; `finelog.relay_address` (optional) points at the shared
global one, with `delegation_key`/`static_token` for the credential —
and setting `relay_address` is what turns the forwarder on. Inert until
then: no forwarder thread, no egress, byte-identical single-cluster
behavior.

**Cross-cluster log namespacing is deferred.** Bare per-cluster keys
(`/user/<job>/...`) collide when many clusters land in one store; rather
than have the relay rewrite keys mid-stream, namespacing will be done on
the write side (tasks/controller write the namespaced key directly) or
via a finelog `cluster` column, in a follow-up. This PR ships the
durable relay + auth mechanism; a single cluster relaying to a global
store works today.

## Authenticated ingress (new security surface)
finelog's server (`lib/finelog/rust/src/server/auth.rs`) now gates every
RPC with an ordered stack of typed auth **layers** — `cidr` (admit by
transport-peer network) and `jwt` (HS256 bearer verified against
per-cluster delegation keys) — that is **default-deny** and **always
installed**. An unconfigured server falls back to allow-localhost
(loopback only), never open. HS256 verification uses vetted RustCrypto
primitives (constant-time HMAC, no `ring`), rejects
`alg:none`/algorithm-confusion, and tolerates a small `exp` clock-skew
leeway. The `/debug/*` admin routes are gated by the same policy. It is
configured by a single `FINELOG_AUTH_POLICY` JSON layer list, which the
finelog deploy config, GCP bootstrap script, and k8s manifest all plumb
(k8s refuses to inline jwt secrets — those must come through a Secret).

Each relaying controller authenticates its pushes with a short-lived
HS256 delegation JWT, minted via the same `JwtTokenManager` the control
plane uses but signed with a **dedicated** per-cluster key the global
store verifies — so the shared store can verify a cluster's tokens
without gaining the power to mint control-plane tokens.

Because auth is always installed, the three committed local finelog
configs (`marin`, `marin-dev`, `cw-us-east-02a`) gain an explicit `cidr`
layer (RFC1918 + loopback) so intra-cluster/pod ingest keeps working —
reachable from the private network, never the public internet.

## Cost
Relaying every cluster's logs to one region is deliberate cross-region
**egress** — the trade for a uniform, always-available,
cross-cluster-queryable log surface. finelog already batches/compresses
and tiers cold segments to object storage, which bounds steady-state
cost.

The read side (surfacing a federated job's logs through the parent:
`job_id`→peer-namespace remap + read-through auth) is deferred to a
follow-up (#6862).

Part of #6718 — PR4b of the federation rollout (global finelog + relay +
authed ingress).
rjpower pushed a commit that referenced this pull request Jul 2, 2026
…olumn (#6881)

A global finelog collects pushes relayed from many federated clusters,
whose per-cluster log keys (`/user/<job>/<task>:<attempt>`,
`/system/...`) collide once more than one cluster forwards into the same
store. This namespaces them by origin with a `cluster` column on the log
row.

Every finelog writer is authenticated, so the origin cluster is a
trusted field on the push: `PushLogsRequest` gains a `cluster` field,
and `push_logs` stamps the column from it. A relay sets it to the
cluster it forwards for; a local single-cluster push leaves it empty. No
writer needs to change its keys, and the forwarder keeps forwarding bare
keys.

The `cluster` column is added to the `log` schema as nullable and
additive:

- an already-registered `log` namespace is evolved on boot — the schema
is registered/merged into the catalog *before* rehydrate, so the engine
opens once from the evolved catalog schema with no live-engine rebuild
(which would need a runtime `block_on` illegal on the async `main`
task). The evolution preserves the namespace's persisted storage policy;
- segments written before the column existed null-fill it on read.

FetchLogs gains an optional `cluster` filter: non-empty restricts to one
origin (the parent's federated read filters `cluster = <peer>`), empty
is unfiltered so a local single-cluster read behaves exactly as before.

This is the Rust server side. The Python `logging.proto` copy and its
generated stubs are intentionally left unchanged — the Python wire
contract lands with its consumer, the federated read client (#6862).

Part of #6718. Foundation for the federated read-remap (#6862).

Fixes #6876.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent design Design doc / 1-pager for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant