fix: enforce maxConcurrentPRs as an open-PR cap, not per-cycle by mayurkr · Pull Request #94 · zelyo-ai/zelyo-operator

mayurkr · 2026-04-22T02:11:16Z

Summary

RemediationPolicy.spec.maxConcurrentPRs was documented as limits the number of open PRs at any time but was implemented as a per-reconcile-cycle bound: prsCreated reset to 0 each cycle, so a policy with maxConcurrentPRs: 3 could accumulate dozens of open PRs across the 5-minute requeue loop.
processIncidents now counts currently-open Zelyo-generated PRs on the target repo via gitops.Engine.ListOpenPRs and subtracts from the cap before entering the incident loop. When the cap is already met, the cycle short-circuits — no LLM plan generation, no PR creation, no incident resolution (so the incident stays open for retry after existing PRs merge).
status.openPRs (already defined on the CRD but never written) is now populated each reconcile so users can see the current count via kubectl get remediationpolicy.
docs: clarify Protect mode requires a RemediationPolicy #90's docs were updated to call out the old per-cycle behavior; this PR walks those docs back to describe the fixed contract and corrects the CRD field comment, then regenerates the CRDs under config/crd/bases and syncs deploy/helm.

Why this matters

The field is a blast-radius control — a Zelyo operator that opens 30 unmerged PRs overnight is indistinguishable from a broken one to the humans who have to review them. Matching the implementation to the documented contract is the whole point of the field.

Reviewer notes

Cross-policy scoping. When multiple RemediationPolicy resources target the same repo, they share the open-PR budget (the cap is per-repo). Per-policy scoping via PRTemplate.BranchPrefix is not wired — gitops.BranchName hardcodes zelyo-operator/fix/… and doesn't read the prefix, so filtering ListOpenPRs by the default zelyo-operator/fix- would silently match zero and re-break the cap we're fixing. This limitation is documented in the helper's comment and the CRD field comment; fixing BranchName to honor BranchPrefix can be a follow-up.
Transient provider failures. ListOpenPRs errors are logged and treated as zero. The worst-case during a provider outage is a temporarily-inflated budget for the duration of that cycle — remediation is never permanently blocked.
Refactor to satisfy gocyclo. The AuthSecret→GitHub engine block in processIncidents was extracted into ensureGitOpsEngineFromSecret to keep the function under the complexity threshold after the new logic landed. Behavior is unchanged.
Exported API. remediation.Engine.GitOpsEngineForRepo is new (thin wrapper over the existing unexported getGitOpsEngine) so the controller can reuse whichever gitops engine is configured — PAT-from-secret path, pre-registered via RegisterGitOpsEngine, or injected test stub — without reconstructing one.

Test plan

make lint → 0 issues
go test -race $(go list ./... | grep -v /e2e) → all packages green
New Ginkgo spec When maxConcurrentPRs is already reached by open PRs runs inside internal/controller/remediationpolicy_controller_test.go and asserts:
- fake gitops.Engine.CreatePullRequest was never called
- fake LLM Complete was never called
- the open incident was not resolved (stays available for retry)
- status.openPRs == 3 and status.remediationsApplied == 0
Manual verification against a real GitHub repo with >maxConcurrentPRs already-open Zelyo PRs before merge

🤖 Generated with Claude Code

The RemediationPolicy controller previously reset its PR counter every reconcile, so a policy with maxConcurrentPRs: 3 could accumulate dozens of open PRs across cycles — the field documented "limits the number of open PRs at any time" but the implementation only bounded PRs created in a single reconcile. processIncidents now queries the gitops provider's ListOpenPRs at the start of each cycle (via the newly exported remediation.Engine.GitOpsEngineForRepo), subtracts the count from maxConcurrentPRs to derive the per-cycle budget, and short-circuits when the cap is already met — no LLM plan generation, no PR creation, no incident resolution. The current count is also written to status.openPRs so it shows up in kubectl get remediationpolicy. The AuthSecret→GitHub engine initialization was factored into its own helper to keep processIncidents under the gocyclo threshold. A new Ginkgo spec primes the correlator with a high-severity incident, wires a fake gitops.Engine reporting 3 open PRs, reconciles a policy with maxConcurrentPRs: 3, and asserts zero CreatePullRequest calls, zero LLM calls, the incident stays open, and status.openPRs == 3. The CRD field comment, README, AGENTS.md, and the three user-facing docs updated in #90 are corrected to describe the fixed behavior; CRDs regenerated and synced to deploy/helm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the maxConcurrentPRs logic to enforce a global cap on open Zelyo-generated PRs per repository, rather than a per-reconcile-cycle limit. The changes include documentation updates, CRD modifications, and controller logic to query existing PRs from the GitOps provider before creating new ones. Review feedback identifies a race condition in how the GitOps engine is initialized for shared remediation components and suggests improvements for PR count accuracy via pagination and more robust error handling during provider outages.

gemini-code-assist · 2026-04-22T02:12:40Z

+	ghClient := github.NewPATClient(token, "")
+	ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
+	r.RemediationEngine.SetGitOpsEngine(ghEngine)
+	log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)


The RemediationEngine is a shared component across all reconciliations. Calling SetGitOpsEngine updates the global fallback engine, which creates a race condition when multiple RemediationPolicy resources targeting different repositories are reconciled concurrently. A policy might end up using a GitOps engine initialized with credentials from a different policy. Instead of setting the global fallback, use RegisterGitOpsEngine with a repository-specific key to ensure isolation.

ghClient := github.NewPATClient(token, "") ghEngine := github.NewEngine(ghClient, log.WithName("github-engine")) owner, name := parseRepoURL(repo.Spec.URL) if owner != "" && name != "" { r.RemediationEngine.RegisterGitOpsEngine(owner+"/"+name, ghEngine) } log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)

gemini-code-assist · 2026-04-22T02:12:40Z

+	existing, err := ge.ListOpenPRs(ctx, owner, repo)
+	if err != nil {
+		log.Error(err, "Failed to list open PRs — treating as zero for this cycle",
+			"owner", owner, "repo", repo)
+		return 0
+	}
+	//nolint:gosec // len bounded by GitHub API page size (100).
+	return int32(len(existing))


The ListOpenPRs call appears to be limited by the provider's page size (e.g., 100), as noted in the comment on line 386. If a repository has more open Zelyo PRs than this limit, the openPRs count will be inaccurate, potentially causing the controller to exceed the maxConcurrentPRs cap. It is recommended to implement pagination within the GitOps engine's ListOpenPRs implementation to ensure a complete count.

gemini-code-assist · 2026-04-22T02:12:41Z

+		log.Error(err, "Failed to list open PRs — treating as zero for this cycle",
+			"owner", owner, "repo", repo)
+		return 0
+	}


Treating provider errors as zero open PRs can lead to exceeding the maxConcurrentPRs cap during API outages or rate-limiting events. Since the primary goal of this PR is blast-radius control, consider a more conservative approach, such as skipping remediation or returning an error to trigger a requeue, when the current state of open PRs cannot be reliably determined.

Appreciate the call-out — considered this and kept the soft-fail intentionally. Three reasons:

Blast radius is already bounded. Even when snapshotOpenPRs returns zero during a GitHub outage, the per-cycle loop bound (incidentsHandled >= budget where budget = maxConcurrentPRs) still caps at most maxConcurrentPRs new PRs per 5-minute reconcile. Worst case is one cycle worth of duplicate PRs, which the next successful reconcile dedups via existingBranches — self-healing.

Requeue-on-error would be worse in the common failure mode. A GitHub-wide outage or rate-limit event would trigger controller-runtime backoff across every RemediationPolicy in the cluster, churning metrics/events and still not opening PRs. Soft-fail degrades to "no remediation progress" instead of "reconcile storm with no remediation progress".

The failure is observable. log.Error on the ListOpenPRs failure is structured and hits the operator log; ReconcileTotal{kind="remediationpolicy",result="success"} still increments (the reconcile itself succeeded, just with degraded data). Operators can alert on the combination of a GitHub error-rate signal plus flat zelyo_reconcile_openprs gauge.

Happy to add an explicit SnapshotOpenPRsFailed condition + a countOpenPRsError counter so the degraded state surfaces on the CR status and in Prometheus without changing semantics — that's probably the right middle ground. Opening a follow-up issue for it.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a250d34d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T02:14:51Z

+	// OpenPRs reflects the total count of open Zelyo-generated PRs in the
+	// target repo after this cycle: already-open PRs observed at the start
+	// plus any this cycle opened.
+	policy.Status.OpenPRs = openPRs + prsCreated


Compute status.openPRs from real PR creations

status.openPRs is derived as openPRs + prsCreated, but prsCreated is incremented for every successfully processed incident, including dry-run/report strategies where ApplyPlan returns nil and no PR is created. In audit mode this can report non-existent “open PRs,” which violates the new field contract and can mislead operators or any automation reading status.

Useful? React with 👍 / 👎.

Merge conflict resolution folds PR #94's maxConcurrentPRs cap logic into main's per-incident remediateIncident / existingBranches dedup refactor. Both sides queried ListOpenPRs separately; the merged shape consolidates that into a single snapshotOpenPRs helper whose output feeds both the cap check (count) and the dedup map (branch→URL). Review fixes landed in the same commit because they touch the same functions: - Codex P2 (status.openPRs phantom in audit mode): remediateIncident now returns (handled, prCreated bool). processIncidents tracks two counters — incidentsHandled drives the per-cycle budget (preserves LLM cost control across all strategies), prsCreated is set only when ApplyPlan returns a non-nil result. Under StrategyDryRun / StrategyReport, status.openPRs stays pinned to the provider count and status.remediationsApplied does not increment. - Gemini High (SetGitOpsEngine cross-reconcile race): ensureGitOpsEngine FromSecret now calls RegisterGitOpsEngine(owner+"/"+name, ghEngine) instead of mutating the process-wide fallback. Concurrent RemediationPolicy reconciles for different repos can no longer clobber each other's credentials. Skip registration (rather than falling back to SetGitOpsEngine) when owner/name parsing fails, since re-introducing the global mutation defeats the fix. - Gemini Medium (ListOpenPRs page cap): GitHub engine now paginates via page= parameter until a short page arrives or the 10-page safety cap (1000 PRs) hits. Repos with >100 open Zelyo PRs no longer cause silent under-counts that let the controller exceed the cap. Tests: - existing budget-exhausted test still passes under the merged shape - new audit-mode test pins status.openPRs / remediationsApplied behavior under StrategyDryRun - two new github engine pagination tests: happy-path 3-page paginate, and the safety-cap stop at listOpenPRsMaxPages Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mayurkr · 2026-04-22T02:38:38Z

Pushed 9661ba4 — merge with main + Codex P2 + Gemini High + Gemini Medium (pagination) + new audit-mode test.

Merge: folds the cap-check and main's existingBranches dedup into one snapshotOpenPRs helper so we only call ListOpenPRs once per reconcile.
Codex P2 (status.openPRs phantom in audit mode): remediateIncident now returns (handled, prCreated bool). processIncidents tracks two counters — incidentsHandled drives the per-cycle budget (preserves LLM cost control for dry-run), prsCreated only ticks when ApplyPlan returns a real result. Test: audit-test-policy with DryRun: true reconciles one incident, verifies status.openPRs stays pinned at the 1 pre-existing PR (not 2) and status.remediationsApplied doesn't tick.
Gemini High (SetGitOpsEngine race): ensureGitOpsEngineFromSecret now calls RegisterGitOpsEngine(owner+"/"+name, ghEngine). Concurrent RemediationPolicies targeting different repos no longer clobber each other's credentials.
Gemini Medium (ListOpenPRs pagination): paginates via page=N until a short page arrives or hits the 10-page safety cap (1000 PRs). Two new tests cover the happy path and the cap.

Left the third Gemini comment (errors-as-zero) as-is with a reply inline — rationale boils down to: the per-cycle loop already caps blast radius, and requeue-on-error would produce a reconcile storm during GitHub outages without fixing anything. Happy to add a dedicated SnapshotOpenPRsFailed condition as a follow-up so the degraded state is observable on the CR.

main landed two behavior fixes while PR #94 was under review: #92 — RemediationPolicy.spec.targetPolicies scope gate now filters incidents (previously the field validated but never filtered, so an operator scoping to one SecurityPolicy still got PRs for unrelated incidents). Adds SecurityPolicy/namespace tagging on correlator.Event + incidentMatchesTargets. #93 — RemediationPolicy.spec.dryRun now actually suppresses PR creation (previously the flag was logged-only, so dryRun=true still opened real PRs). Generates the plan + fires a DryRunPreview event but skips ApplyPlan and leaves the incident open. Resolution folds these alongside PR #94's existing changes: - snapshotOpenPRs (our helper) and snapshotOpenPRBranches (main's) queried ListOpenPRs for separate purposes — count vs. dedup map. Merged into a single snapshotOpenPRs that returns both from one call; the duplicate main helper is deleted. - ensureGitOpsEngineFromSecret (our helper using RegisterGitOpsEngine to fix Gemini's cross-reconcile race) supersedes main's initRemediationGitOps (which still called SetGitOpsEngine and reintroduced the race). The main helper is deleted. - remediateIncident now uses main's (opened, charged) return names but the success-path value is `return result != nil, true`, not main's `return true, true`. The fix matters for the engine-level StrategyDryRun / StrategyReport path (distinct from policy.Spec. DryRun, which is handled earlier): ApplyPlan returns (nil, nil) there, and unconditional opened=true would re-introduce Codex P2 (phantom status.openPRs). - processIncidents keeps PR #94's budget check (openPRs vs. maxPRs up-front) alongside main's scope gate, using `processed` to drive the per-cycle budget. Two-counter accounting (processed, prsCreated) preserves accurate status.openPRs under all strategies. - Test file: kept main's scope-gate + per-policy dryRun suites intact; reattached our two Contexts (cap-exhausted and engine-level audit-mode). Renamed the second to make explicit it exercises the ZelyoConfig.spec.mode=audit path distinct from policy.Spec.DryRun. Verification: make lint → 0 issues; full envtest suite → 21 specs pass across two Describe blocks (14 from the merge + 7 inherited from main including target-policies, per-policy dryRun, and the pre-existing reconcile smoke test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(github): short-circuit CreatePullRequest when an open PR covers the head branch The GitHub engine's CreatePullRequest had a silent-commit footgun that produced the "311 commits on one PR" incident observed on zelyo-operator-fix/demo-app-payment-service PR #12. Flow before the fix: 1. getRef → base SHA 2. createRef returns 422 "branch exists" — engine swallows and continues 3. createOrUpdateFile for each file — COMMITS LAND ON THE EXISTING PR's BRANCH (this is the regression) 4. openPR returns 422 "a pull request already exists" — error bubbles up 5. Controller catches the error but the file commit already shipped 6. Next reconcile: same branch, another commit pile This was only partially mitigated by the controller-layer existingBranches dedup shipped in #91 and #94. That mitigation misses when ListOpenPRs fails, when pagination hits the safety cap, or when any other caller of CreatePullRequest skips the upstream check. The engine itself needed to be the last line of defense. Fix: after createRef returns 422, query GitHub for any open PR whose head branch is exactly pr.HeadBranch. If one exists, log and return it — no commit, no duplicate openPR. If no open PR exists (legitimate case when a user manually closed a PR but left the branch), proceed with commit + openPR as before. If the dedup lookup ITSELF fails, fail closed — remediation defers to the next cycle rather than risk polluting an open PR with duplicate commits. New tests: - SkipsWhenOpenPRExists: createRef=422 + open PR exists → asserts PUT /contents and POST /pulls are never called, existing PR is returned - BranchExistsWithoutOpenPR: createRef=422 + empty PR list → asserts commit + openPR both run (the legitimate orphan-branch case) - BranchExistsLookupFails: createRef=422 + PR list returns 500 → asserts error return, no commit, no openPR (fail-closed) Uses GitHub's head=owner:ref query param to scope the lookup server-side (only one open PR per head branch is possible, so no pagination needed), with in-memory ref equality as a defense-in-depth filter against mocks or proxies that ignore the query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address PR review: typed APIError, owner:branch dedup match, shared PR struct Three reviewer asks on #95: 1. Codex P2 — findOpenPRByHeadBranch was filtering on head.ref alone, so a fork PR using the same branch name could false-match the dedup check when a proxy/mock ignores the server-side head= query. Switch to matching on head.label ("owner:ref"), which GitHub always sets to the canonical identifier. 2. Gemini medium — string-matching "422" in the error message was brittle. doRequest now returns a typed *APIError with StatusCode and Body; the createRef 422 check uses errors.As for a direct integer comparison against http.StatusUnprocessableEntity. 3. Gemini medium — the PR-response anonymous struct was duplicated between findOpenPRByHeadBranch and ListOpenPRs. Extract githubPullResponse at package scope and reuse. Tests: - Existing SkipsWhenOpenPRExists mock now sets head.label so the engine's new equality check matches. - New DoesNotMatchForkPRWithSameRef: server (simulating a proxy that ignored head= query) returns a fork PR with head.label = "forkuser:<branch>" distinct from our owner. Asserts the engine proceeds with its own commit + openPR instead of false-matching. - New APIErrorTypedUnwrap: pins the typed-error contract. Covers both the direct-%w-wrap path callers use and the live doRequest path, so a future refactor that swallows the type breaks here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

mayurkr merged commit 21e3eac into main Apr 22, 2026
6 checks passed

mayurkr deleted the claude/objective-mccarthy-5e1d62 branch April 22, 2026 03:23

mayurkr mentioned this pull request Apr 22, 2026

fix(github): don't add commits to an already-open PR #95

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enforce maxConcurrentPRs as an open-PR cap, not per-cycle#94

fix: enforce maxConcurrentPRs as an open-PR cap, not per-cycle#94
mayurkr merged 3 commits into
mainfrom
claude/objective-mccarthy-5e1d62

mayurkr commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

mayurkr Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

mayurkr commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mayurkr commented Apr 22, 2026

Summary

Why this matters

Reviewer notes

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

mayurkr Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

mayurkr commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant