[iris] Move per-backend dead-worker pruning into the backend store by rjpower · Pull Request #6795 · marin-community/marin

rjpower · 2026-06-30T22:34:40Z

The controller's background prune loop reached into every backend's health tracker and worker-attributes projection to delete stale DEAD workers. Make each backend garbage-collect its own dead workers instead.

prune_dead_workers is added to the BackendWorkerStore protocol — implemented on DbBackendWorkerStore, which already holds the db, health tracker, and worker_attrs it needs — and to the TaskBackend protocol, delegated by RpcTaskBackend and a no-op on the Kubernetes backend (it tracks no Iris workers). prune_old_data now takes the backends collection and sums each backend's own GC, so the controller keeps only the cross-backend prune concerns: terminal jobs, orphan slices, and expired endpoints.

This continues the BackendWorkerStore ownership transfer (P3): the controller moves toward a thin router while each backend owns its workers, attributes, and liveness. The worker prune still runs on the controller's background prune thread — it touches only worker rows, attributes, and tracker entries, never the autoscaler — and preserves the cutoff semantics, the one-delete-per-transaction-plus-pause cadence, the PruneResult.workers_deleted count, and the worker_pruned audit event. The prune_old_data replay golden is unchanged.

Design note: prune_old_data takes the backends collection (self._backends.values()) rather than a list of stores, since the controller holds backends and each backend already encapsulates its store.

Part of #6718.

The controller's prune loop reached into every backend's health tracker and worker-attributes projection to delete stale DEAD workers. Make each backend garbage-collect its own dead workers instead: add prune_dead_workers to the BackendWorkerStore protocol (implemented on DbBackendWorkerStore, which already holds db, health, and worker_attrs) and to the TaskBackend protocol, delegating from RpcTaskBackend and no-opping on the Kubernetes backend (it tracks no Iris workers). prune_old_data now takes the backends collection and sums each backend's own GC, keeping only cross-backend concerns (terminal jobs, orphan slices, expired endpoints) in the controller. This continues the BackendWorkerStore ownership transfer: the controller becomes a thin router and each backend owns its workers, attributes, and liveness. The worker prune still runs on the controller's background prune thread with the same cutoff, one-delete-per-transaction cadence, count, and worker_pruned audit event; the replay golden for prune_old_data is unchanged.

claude · 2026-06-30T22:34:55Z

Claude finished @rjpower's task in 2m 14s —— View job

Code review

Read the PR diff against origin/main
Analyze the moved pruning logic for correctness
Check protocol/implementation consistency across backends
Review tests and golden replay
Post review feedback

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What I verified:

Logic parity — DbBackendWorkerStore.prune_dead_workers preserves the original cadence exactly: one remove_worker (with its attribute CASCADE) per transaction, worker_pruned audit event per delete, time.sleep(pause) between deletes, and early stop on stop_event. The loop guard while stop_event is None or not stop_event.is_set() matches the old while not _stopped(stop_event) semantics.
_find_prunable_worker moved cleanly — relocated from pruner.py to backend_store.py with no duplicate left behind; WorkerUsability/WorkerId/WorkerHealthTracker are all imported there, and the removed WorkerUsability import in pruner.py is no longer referenced.
Controller-side equivalence — the old _worker_daemon_healths() filtered to WORKER_DAEMON backends with non-null trackers; the new self._backends.values() includes the k8s CLUSTER_VIEW backend, but its prune_dead_workers is a no-op returning 0, so the summed result is identical.
Protocol consistency — prune_dead_workers added to both TaskBackend and BackendWorkerStore protocols and implemented across every conformer: RpcTaskBackend (delegates, same _store is not None assert as teardown), K8sTaskProvider (no-op), and the test fakes (FakeProvider, FakeDirectProvider, _ScriptedProvider, _UnreachableProvider). threading is imported in each touched file.
Residual imports — _stopped and time.sleep are still used by _prune_terminal_jobs/_prune_orphan_slices, so removing them from _prune_dead_workers left no dead imports.
Test helper — worker_daemon_backends_for_prune wires a FakeProvider to state's db/health/worker_attrs (health set before bind_runtime, which threads it through store_from_runtime), so the single backend prunes the same rows the old direct-tracker call did. BackendRuntime.budget_defaults → defaults mapping matches the existing store_from_runtime, and UserBudgetDefaults is imported. The unchanged prune_old_data replay golden is consistent with this being behavior-preserving.

The PR description is also well-formed — it leads with what the change does and the design note is informative rather than boilerplate; no "Testing" section or template scaffold.
• weaver/iris-mb-3-pruner-split

rjpower added 2 commits June 30, 2026 22:31

[iris] Trim prune-test backend helper docstring to its contract

04a966d

rjpower added the agent-generated Created by automation/agent label Jun 30, 2026

rjpower merged commit a3a8312 into main Jun 30, 2026
34 checks passed

rjpower deleted the weaver/iris-mb-3-pruner-split branch June 30, 2026 22:44

claude Bot mentioned this pull request Jul 2, 2026

[iris] Drop controller-side worker status overlay once backends own their DB reads #6823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[iris] Move per-backend dead-worker pruning into the backend store#6795

[iris] Move per-backend dead-worker pruning into the backend store#6795
rjpower merged 2 commits into
mainfrom
weaver/iris-mb-3-pruner-split

rjpower commented Jun 30, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rjpower commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rjpower commented Jun 30, 2026 •

edited

Loading

claude Bot commented Jun 30, 2026 •

edited

Loading