fix: enforce maxConcurrentPRs as an open-PR cap, not per-cycle#94
Conversation
The RemediationPolicy controller previously reset its PR counter every reconcile, so a policy with maxConcurrentPRs: 3 could accumulate dozens of open PRs across cycles — the field documented "limits the number of open PRs at any time" but the implementation only bounded PRs created in a single reconcile. processIncidents now queries the gitops provider's ListOpenPRs at the start of each cycle (via the newly exported remediation.Engine.GitOpsEngineForRepo), subtracts the count from maxConcurrentPRs to derive the per-cycle budget, and short-circuits when the cap is already met — no LLM plan generation, no PR creation, no incident resolution. The current count is also written to status.openPRs so it shows up in kubectl get remediationpolicy. The AuthSecret→GitHub engine initialization was factored into its own helper to keep processIncidents under the gocyclo threshold. A new Ginkgo spec primes the correlator with a high-severity incident, wires a fake gitops.Engine reporting 3 open PRs, reconciles a policy with maxConcurrentPRs: 3, and asserts zero CreatePullRequest calls, zero LLM calls, the incident stays open, and status.openPRs == 3. The CRD field comment, README, AGENTS.md, and the three user-facing docs updated in #90 are corrected to describe the fixed behavior; CRDs regenerated and synced to deploy/helm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the maxConcurrentPRs logic to enforce a global cap on open Zelyo-generated PRs per repository, rather than a per-reconcile-cycle limit. The changes include documentation updates, CRD modifications, and controller logic to query existing PRs from the GitOps provider before creating new ones. Review feedback identifies a race condition in how the GitOps engine is initialized for shared remediation components and suggests improvements for PR count accuracy via pagination and more robust error handling during provider outages.
| ghClient := github.NewPATClient(token, "") | ||
| ghEngine := github.NewEngine(ghClient, log.WithName("github-engine")) | ||
| r.RemediationEngine.SetGitOpsEngine(ghEngine) | ||
| log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name) |
There was a problem hiding this comment.
The RemediationEngine is a shared component across all reconciliations. Calling SetGitOpsEngine updates the global fallback engine, which creates a race condition when multiple RemediationPolicy resources targeting different repositories are reconciled concurrently. A policy might end up using a GitOps engine initialized with credentials from a different policy. Instead of setting the global fallback, use RegisterGitOpsEngine with a repository-specific key to ensure isolation.
ghClient := github.NewPATClient(token, "")
ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
owner, name := parseRepoURL(repo.Spec.URL)
if owner != "" && name != "" {
r.RemediationEngine.RegisterGitOpsEngine(owner+"/"+name, ghEngine)
}
log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)| existing, err := ge.ListOpenPRs(ctx, owner, repo) | ||
| if err != nil { | ||
| log.Error(err, "Failed to list open PRs — treating as zero for this cycle", | ||
| "owner", owner, "repo", repo) | ||
| return 0 | ||
| } | ||
| //nolint:gosec // len bounded by GitHub API page size (100). | ||
| return int32(len(existing)) |
There was a problem hiding this comment.
The ListOpenPRs call appears to be limited by the provider's page size (e.g., 100), as noted in the comment on line 386. If a repository has more open Zelyo PRs than this limit, the openPRs count will be inaccurate, potentially causing the controller to exceed the maxConcurrentPRs cap. It is recommended to implement pagination within the GitOps engine's ListOpenPRs implementation to ensure a complete count.
| log.Error(err, "Failed to list open PRs — treating as zero for this cycle", | ||
| "owner", owner, "repo", repo) | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Treating provider errors as zero open PRs can lead to exceeding the maxConcurrentPRs cap during API outages or rate-limiting events. Since the primary goal of this PR is blast-radius control, consider a more conservative approach, such as skipping remediation or returning an error to trigger a requeue, when the current state of open PRs cannot be reliably determined.
There was a problem hiding this comment.
Appreciate the call-out — considered this and kept the soft-fail intentionally. Three reasons:
- Blast radius is already bounded. Even when
snapshotOpenPRsreturns zero during a GitHub outage, the per-cycle loop bound (incidentsHandled >= budgetwherebudget = maxConcurrentPRs) still caps at mostmaxConcurrentPRsnew PRs per 5-minute reconcile. Worst case is one cycle worth of duplicate PRs, which the next successful reconcile dedups viaexistingBranches— self-healing. - Requeue-on-error would be worse in the common failure mode. A GitHub-wide outage or rate-limit event would trigger controller-runtime backoff across every
RemediationPolicyin the cluster, churning metrics/events and still not opening PRs. Soft-fail degrades to "no remediation progress" instead of "reconcile storm with no remediation progress". - The failure is observable.
log.Erroron the ListOpenPRs failure is structured and hits the operator log;ReconcileTotal{kind="remediationpolicy",result="success"}still increments (the reconcile itself succeeded, just with degraded data). Operators can alert on the combination of a GitHub error-rate signal plus flatzelyo_reconcile_openprsgauge.
Happy to add an explicit SnapshotOpenPRsFailed condition + a countOpenPRsError counter so the degraded state surfaces on the CR status and in Prometheus without changing semantics — that's probably the right middle ground. Opening a follow-up issue for it.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a250d34d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // OpenPRs reflects the total count of open Zelyo-generated PRs in the | ||
| // target repo after this cycle: already-open PRs observed at the start | ||
| // plus any this cycle opened. | ||
| policy.Status.OpenPRs = openPRs + prsCreated |
There was a problem hiding this comment.
Compute status.openPRs from real PR creations
status.openPRs is derived as openPRs + prsCreated, but prsCreated is incremented for every successfully processed incident, including dry-run/report strategies where ApplyPlan returns nil and no PR is created. In audit mode this can report non-existent “open PRs,” which violates the new field contract and can mislead operators or any automation reading status.
Useful? React with 👍 / 👎.
Merge conflict resolution folds PR #94's maxConcurrentPRs cap logic into main's per-incident remediateIncident / existingBranches dedup refactor. Both sides queried ListOpenPRs separately; the merged shape consolidates that into a single snapshotOpenPRs helper whose output feeds both the cap check (count) and the dedup map (branch→URL). Review fixes landed in the same commit because they touch the same functions: - Codex P2 (status.openPRs phantom in audit mode): remediateIncident now returns (handled, prCreated bool). processIncidents tracks two counters — incidentsHandled drives the per-cycle budget (preserves LLM cost control across all strategies), prsCreated is set only when ApplyPlan returns a non-nil result. Under StrategyDryRun / StrategyReport, status.openPRs stays pinned to the provider count and status.remediationsApplied does not increment. - Gemini High (SetGitOpsEngine cross-reconcile race): ensureGitOpsEngine FromSecret now calls RegisterGitOpsEngine(owner+"/"+name, ghEngine) instead of mutating the process-wide fallback. Concurrent RemediationPolicy reconciles for different repos can no longer clobber each other's credentials. Skip registration (rather than falling back to SetGitOpsEngine) when owner/name parsing fails, since re-introducing the global mutation defeats the fix. - Gemini Medium (ListOpenPRs page cap): GitHub engine now paginates via page= parameter until a short page arrives or the 10-page safety cap (1000 PRs) hits. Repos with >100 open Zelyo PRs no longer cause silent under-counts that let the controller exceed the cap. Tests: - existing budget-exhausted test still passes under the merged shape - new audit-mode test pins status.openPRs / remediationsApplied behavior under StrategyDryRun - two new github engine pagination tests: happy-path 3-page paginate, and the safety-cap stop at listOpenPRsMaxPages Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed 9661ba4 — merge with main + Codex P2 + Gemini High + Gemini Medium (pagination) + new audit-mode test.
Left the third Gemini comment (errors-as-zero) as-is with a reply inline — rationale boils down to: the per-cycle loop already caps blast radius, and requeue-on-error would produce a reconcile storm during GitHub outages without fixing anything. Happy to add a dedicated |
main landed two behavior fixes while PR #94 was under review: #92 — RemediationPolicy.spec.targetPolicies scope gate now filters incidents (previously the field validated but never filtered, so an operator scoping to one SecurityPolicy still got PRs for unrelated incidents). Adds SecurityPolicy/namespace tagging on correlator.Event + incidentMatchesTargets. #93 — RemediationPolicy.spec.dryRun now actually suppresses PR creation (previously the flag was logged-only, so dryRun=true still opened real PRs). Generates the plan + fires a DryRunPreview event but skips ApplyPlan and leaves the incident open. Resolution folds these alongside PR #94's existing changes: - snapshotOpenPRs (our helper) and snapshotOpenPRBranches (main's) queried ListOpenPRs for separate purposes — count vs. dedup map. Merged into a single snapshotOpenPRs that returns both from one call; the duplicate main helper is deleted. - ensureGitOpsEngineFromSecret (our helper using RegisterGitOpsEngine to fix Gemini's cross-reconcile race) supersedes main's initRemediationGitOps (which still called SetGitOpsEngine and reintroduced the race). The main helper is deleted. - remediateIncident now uses main's (opened, charged) return names but the success-path value is `return result != nil, true`, not main's `return true, true`. The fix matters for the engine-level StrategyDryRun / StrategyReport path (distinct from policy.Spec. DryRun, which is handled earlier): ApplyPlan returns (nil, nil) there, and unconditional opened=true would re-introduce Codex P2 (phantom status.openPRs). - processIncidents keeps PR #94's budget check (openPRs vs. maxPRs up-front) alongside main's scope gate, using `processed` to drive the per-cycle budget. Two-counter accounting (processed, prsCreated) preserves accurate status.openPRs under all strategies. - Test file: kept main's scope-gate + per-policy dryRun suites intact; reattached our two Contexts (cap-exhausted and engine-level audit-mode). Renamed the second to make explicit it exercises the ZelyoConfig.spec.mode=audit path distinct from policy.Spec.DryRun. Verification: make lint → 0 issues; full envtest suite → 21 specs pass across two Describe blocks (14 from the merge + 7 inherited from main including target-policies, per-policy dryRun, and the pre-existing reconcile smoke test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(github): short-circuit CreatePullRequest when an open PR covers the head branch The GitHub engine's CreatePullRequest had a silent-commit footgun that produced the "311 commits on one PR" incident observed on zelyo-operator-fix/demo-app-payment-service PR #12. Flow before the fix: 1. getRef → base SHA 2. createRef returns 422 "branch exists" — engine swallows and continues 3. createOrUpdateFile for each file — COMMITS LAND ON THE EXISTING PR's BRANCH (this is the regression) 4. openPR returns 422 "a pull request already exists" — error bubbles up 5. Controller catches the error but the file commit already shipped 6. Next reconcile: same branch, another commit pile This was only partially mitigated by the controller-layer existingBranches dedup shipped in #91 and #94. That mitigation misses when ListOpenPRs fails, when pagination hits the safety cap, or when any other caller of CreatePullRequest skips the upstream check. The engine itself needed to be the last line of defense. Fix: after createRef returns 422, query GitHub for any open PR whose head branch is exactly pr.HeadBranch. If one exists, log and return it — no commit, no duplicate openPR. If no open PR exists (legitimate case when a user manually closed a PR but left the branch), proceed with commit + openPR as before. If the dedup lookup ITSELF fails, fail closed — remediation defers to the next cycle rather than risk polluting an open PR with duplicate commits. New tests: - SkipsWhenOpenPRExists: createRef=422 + open PR exists → asserts PUT /contents and POST /pulls are never called, existing PR is returned - BranchExistsWithoutOpenPR: createRef=422 + empty PR list → asserts commit + openPR both run (the legitimate orphan-branch case) - BranchExistsLookupFails: createRef=422 + PR list returns 500 → asserts error return, no commit, no openPR (fail-closed) Uses GitHub's head=owner:ref query param to scope the lookup server-side (only one open PR per head branch is possible, so no pagination needed), with in-memory ref equality as a defense-in-depth filter against mocks or proxies that ignore the query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * address PR review: typed APIError, owner:branch dedup match, shared PR struct Three reviewer asks on #95: 1. Codex P2 — findOpenPRByHeadBranch was filtering on head.ref alone, so a fork PR using the same branch name could false-match the dedup check when a proxy/mock ignores the server-side head= query. Switch to matching on head.label ("owner:ref"), which GitHub always sets to the canonical identifier. 2. Gemini medium — string-matching "422" in the error message was brittle. doRequest now returns a typed *APIError with StatusCode and Body; the createRef 422 check uses errors.As for a direct integer comparison against http.StatusUnprocessableEntity. 3. Gemini medium — the PR-response anonymous struct was duplicated between findOpenPRByHeadBranch and ListOpenPRs. Extract githubPullResponse at package scope and reuse. Tests: - Existing SkipsWhenOpenPRExists mock now sets head.label so the engine's new equality check matches. - New DoesNotMatchForkPRWithSameRef: server (simulating a proxy that ignored head= query) returns a fork PR with head.label = "forkuser:<branch>" distinct from our owner. Asserts the engine proceeds with its own commit + openPR instead of false-matching. - New APIErrorTypedUnwrap: pins the typed-error contract. Covers both the direct-%w-wrap path callers use and the live doRequest path, so a future refactor that swallows the type breaks here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
RemediationPolicy.spec.maxConcurrentPRswas documented as limits the number of open PRs at any time but was implemented as a per-reconcile-cycle bound:prsCreatedreset to 0 each cycle, so a policy withmaxConcurrentPRs: 3could accumulate dozens of open PRs across the 5-minute requeue loop.processIncidentsnow counts currently-open Zelyo-generated PRs on the target repo viagitops.Engine.ListOpenPRsand subtracts from the cap before entering the incident loop. When the cap is already met, the cycle short-circuits — no LLM plan generation, no PR creation, no incident resolution (so the incident stays open for retry after existing PRs merge).status.openPRs(already defined on the CRD but never written) is now populated each reconcile so users can see the current count viakubectl get remediationpolicy.config/crd/basesand syncsdeploy/helm.Why this matters
The field is a blast-radius control — a Zelyo operator that opens 30 unmerged PRs overnight is indistinguishable from a broken one to the humans who have to review them. Matching the implementation to the documented contract is the whole point of the field.
Reviewer notes
RemediationPolicyresources target the same repo, they share the open-PR budget (the cap is per-repo). Per-policy scoping viaPRTemplate.BranchPrefixis not wired —gitops.BranchNamehardcodeszelyo-operator/fix/…and doesn't read the prefix, so filteringListOpenPRsby the defaultzelyo-operator/fix-would silently match zero and re-break the cap we're fixing. This limitation is documented in the helper's comment and the CRD field comment; fixingBranchNameto honorBranchPrefixcan be a follow-up.ListOpenPRserrors are logged and treated as zero. The worst-case during a provider outage is a temporarily-inflated budget for the duration of that cycle — remediation is never permanently blocked.AuthSecret→GitHub engine block inprocessIncidentswas extracted intoensureGitOpsEngineFromSecretto keep the function under the complexity threshold after the new logic landed. Behavior is unchanged.remediation.Engine.GitOpsEngineForRepois new (thin wrapper over the existing unexportedgetGitOpsEngine) so the controller can reuse whichever gitops engine is configured — PAT-from-secret path, pre-registered viaRegisterGitOpsEngine, or injected test stub — without reconstructing one.Test plan
make lint→ 0 issuesgo test -race $(go list ./... | grep -v /e2e)→ all packages greenWhen maxConcurrentPRs is already reached by open PRsruns insideinternal/controller/remediationpolicy_controller_test.goand asserts:gitops.Engine.CreatePullRequestwas never calledCompletewas never calledstatus.openPRs == 3andstatus.remediationsApplied == 0🤖 Generated with Claude Code