Skip to content

[iris] Move per-backend dead-worker pruning into the backend store#6795

Merged
rjpower merged 2 commits into
mainfrom
weaver/iris-mb-3-pruner-split
Jun 30, 2026
Merged

[iris] Move per-backend dead-worker pruning into the backend store#6795
rjpower merged 2 commits into
mainfrom
weaver/iris-mb-3-pruner-split

Conversation

@rjpower

@rjpower rjpower commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

The controller's background prune loop reached into every backend's health tracker and worker-attributes projection to delete stale DEAD workers. Make each backend garbage-collect its own dead workers instead.

prune_dead_workers is added to the BackendWorkerStore protocol — implemented on DbBackendWorkerStore, which already holds the db, health tracker, and worker_attrs it needs — and to the TaskBackend protocol, delegated by RpcTaskBackend and a no-op on the Kubernetes backend (it tracks no Iris workers). prune_old_data now takes the backends collection and sums each backend's own GC, so the controller keeps only the cross-backend prune concerns: terminal jobs, orphan slices, and expired endpoints.

This continues the BackendWorkerStore ownership transfer (P3): the controller moves toward a thin router while each backend owns its workers, attributes, and liveness. The worker prune still runs on the controller's background prune thread — it touches only worker rows, attributes, and tracker entries, never the autoscaler — and preserves the cutoff semantics, the one-delete-per-transaction-plus-pause cadence, the PruneResult.workers_deleted count, and the worker_pruned audit event. The prune_old_data replay golden is unchanged.

Design note: prune_old_data takes the backends collection (self._backends.values()) rather than a list of stores, since the controller holds backends and each backend already encapsulates its store.

Part of #6718.

rjpower added 2 commits June 30, 2026 22:31
The controller's prune loop reached into every backend's health tracker and
worker-attributes projection to delete stale DEAD workers. Make each backend
garbage-collect its own dead workers instead: add prune_dead_workers to the
BackendWorkerStore protocol (implemented on DbBackendWorkerStore, which already
holds db, health, and worker_attrs) and to the TaskBackend protocol, delegating
from RpcTaskBackend and no-opping on the Kubernetes backend (it tracks no Iris
workers). prune_old_data now takes the backends collection and sums each
backend's own GC, keeping only cross-backend concerns (terminal jobs, orphan
slices, expired endpoints) in the controller.

This continues the BackendWorkerStore ownership transfer: the controller becomes
a thin router and each backend owns its workers, attributes, and liveness. The
worker prune still runs on the controller's background prune thread with the
same cutoff, one-delete-per-transaction cadence, count, and worker_pruned audit
event; the replay golden for prune_old_data is unchanged.
@rjpower rjpower added the agent-generated Created by automation/agent label Jun 30, 2026
@claude

claude Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @rjpower's task in 2m 14s —— View job


Code review

  • Read the PR diff against origin/main
  • Analyze the moved pruning logic for correctness
  • Check protocol/implementation consistency across backends
  • Review tests and golden replay
  • Post review feedback

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What I verified:

  • Logic parityDbBackendWorkerStore.prune_dead_workers preserves the original cadence exactly: one remove_worker (with its attribute CASCADE) per transaction, worker_pruned audit event per delete, time.sleep(pause) between deletes, and early stop on stop_event. The loop guard while stop_event is None or not stop_event.is_set() matches the old while not _stopped(stop_event) semantics.
  • _find_prunable_worker moved cleanly — relocated from pruner.py to backend_store.py with no duplicate left behind; WorkerUsability/WorkerId/WorkerHealthTracker are all imported there, and the removed WorkerUsability import in pruner.py is no longer referenced.
  • Controller-side equivalence — the old _worker_daemon_healths() filtered to WORKER_DAEMON backends with non-null trackers; the new self._backends.values() includes the k8s CLUSTER_VIEW backend, but its prune_dead_workers is a no-op returning 0, so the summed result is identical.
  • Protocol consistencyprune_dead_workers added to both TaskBackend and BackendWorkerStore protocols and implemented across every conformer: RpcTaskBackend (delegates, same _store is not None assert as teardown), K8sTaskProvider (no-op), and the test fakes (FakeProvider, FakeDirectProvider, _ScriptedProvider, _UnreachableProvider). threading is imported in each touched file.
  • Residual imports_stopped and time.sleep are still used by _prune_terminal_jobs/_prune_orphan_slices, so removing them from _prune_dead_workers left no dead imports.
  • Test helperworker_daemon_backends_for_prune wires a FakeProvider to state's db/health/worker_attrs (health set before bind_runtime, which threads it through store_from_runtime), so the single backend prunes the same rows the old direct-tracker call did. BackendRuntime.budget_defaultsdefaults mapping matches the existing store_from_runtime, and UserBudgetDefaults is imported. The unchanged prune_old_data replay golden is consistent with this being behavior-preserving.

The PR description is also well-formed — it leads with what the change does and the design note is informative rather than boilerplate; no "Testing" section or template scaffold.
weaver/iris-mb-3-pruner-split

@rjpower rjpower merged commit a3a8312 into main Jun 30, 2026
34 checks passed
@rjpower rjpower deleted the weaver/iris-mb-3-pruner-split branch June 30, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant