Skip to content

fix: enforce maxConcurrentPRs as an open-PR cap, not per-cycle#94

Merged
mayurkr merged 3 commits into
mainfrom
claude/objective-mccarthy-5e1d62
Apr 22, 2026
Merged

fix: enforce maxConcurrentPRs as an open-PR cap, not per-cycle#94
mayurkr merged 3 commits into
mainfrom
claude/objective-mccarthy-5e1d62

Conversation

@mayurkr
Copy link
Copy Markdown
Contributor

@mayurkr mayurkr commented Apr 22, 2026

Summary

  • RemediationPolicy.spec.maxConcurrentPRs was documented as limits the number of open PRs at any time but was implemented as a per-reconcile-cycle bound: prsCreated reset to 0 each cycle, so a policy with maxConcurrentPRs: 3 could accumulate dozens of open PRs across the 5-minute requeue loop.
  • processIncidents now counts currently-open Zelyo-generated PRs on the target repo via gitops.Engine.ListOpenPRs and subtracts from the cap before entering the incident loop. When the cap is already met, the cycle short-circuits — no LLM plan generation, no PR creation, no incident resolution (so the incident stays open for retry after existing PRs merge).
  • status.openPRs (already defined on the CRD but never written) is now populated each reconcile so users can see the current count via kubectl get remediationpolicy.
  • docs: clarify Protect mode requires a RemediationPolicy #90's docs were updated to call out the old per-cycle behavior; this PR walks those docs back to describe the fixed contract and corrects the CRD field comment, then regenerates the CRDs under config/crd/bases and syncs deploy/helm.

Why this matters

The field is a blast-radius control — a Zelyo operator that opens 30 unmerged PRs overnight is indistinguishable from a broken one to the humans who have to review them. Matching the implementation to the documented contract is the whole point of the field.

Reviewer notes

  • Cross-policy scoping. When multiple RemediationPolicy resources target the same repo, they share the open-PR budget (the cap is per-repo). Per-policy scoping via PRTemplate.BranchPrefix is not wired — gitops.BranchName hardcodes zelyo-operator/fix/… and doesn't read the prefix, so filtering ListOpenPRs by the default zelyo-operator/fix- would silently match zero and re-break the cap we're fixing. This limitation is documented in the helper's comment and the CRD field comment; fixing BranchName to honor BranchPrefix can be a follow-up.
  • Transient provider failures. ListOpenPRs errors are logged and treated as zero. The worst-case during a provider outage is a temporarily-inflated budget for the duration of that cycle — remediation is never permanently blocked.
  • Refactor to satisfy gocyclo. The AuthSecret→GitHub engine block in processIncidents was extracted into ensureGitOpsEngineFromSecret to keep the function under the complexity threshold after the new logic landed. Behavior is unchanged.
  • Exported API. remediation.Engine.GitOpsEngineForRepo is new (thin wrapper over the existing unexported getGitOpsEngine) so the controller can reuse whichever gitops engine is configured — PAT-from-secret path, pre-registered via RegisterGitOpsEngine, or injected test stub — without reconstructing one.

Test plan

  • make lint → 0 issues
  • go test -race $(go list ./... | grep -v /e2e) → all packages green
  • New Ginkgo spec When maxConcurrentPRs is already reached by open PRs runs inside internal/controller/remediationpolicy_controller_test.go and asserts:
    • fake gitops.Engine.CreatePullRequest was never called
    • fake LLM Complete was never called
    • the open incident was not resolved (stays available for retry)
    • status.openPRs == 3 and status.remediationsApplied == 0
  • Manual verification against a real GitHub repo with >maxConcurrentPRs already-open Zelyo PRs before merge

🤖 Generated with Claude Code

The RemediationPolicy controller previously reset its PR counter every
reconcile, so a policy with maxConcurrentPRs: 3 could accumulate dozens
of open PRs across cycles — the field documented "limits the number of
open PRs at any time" but the implementation only bounded PRs created
in a single reconcile.

processIncidents now queries the gitops provider's ListOpenPRs at the
start of each cycle (via the newly exported
remediation.Engine.GitOpsEngineForRepo), subtracts the count from
maxConcurrentPRs to derive the per-cycle budget, and short-circuits
when the cap is already met — no LLM plan generation, no PR creation,
no incident resolution. The current count is also written to
status.openPRs so it shows up in kubectl get remediationpolicy. The
AuthSecret→GitHub engine initialization was factored into its own
helper to keep processIncidents under the gocyclo threshold.

A new Ginkgo spec primes the correlator with a high-severity incident,
wires a fake gitops.Engine reporting 3 open PRs, reconciles a policy
with maxConcurrentPRs: 3, and asserts zero CreatePullRequest calls,
zero LLM calls, the incident stays open, and status.openPRs == 3.

The CRD field comment, README, AGENTS.md, and the three user-facing
docs updated in #90 are corrected to describe the fixed behavior; CRDs
regenerated and synced to deploy/helm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the maxConcurrentPRs logic to enforce a global cap on open Zelyo-generated PRs per repository, rather than a per-reconcile-cycle limit. The changes include documentation updates, CRD modifications, and controller logic to query existing PRs from the GitOps provider before creating new ones. Review feedback identifies a race condition in how the GitOps engine is initialized for shared remediation components and suggests improvements for PR count accuracy via pagination and more robust error handling during provider outages.

Comment on lines +328 to +331
ghClient := github.NewPATClient(token, "")
ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
r.RemediationEngine.SetGitOpsEngine(ghEngine)
log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RemediationEngine is a shared component across all reconciliations. Calling SetGitOpsEngine updates the global fallback engine, which creates a race condition when multiple RemediationPolicy resources targeting different repositories are reconciled concurrently. A policy might end up using a GitOps engine initialized with credentials from a different policy. Instead of setting the global fallback, use RegisterGitOpsEngine with a repository-specific key to ensure isolation.

	ghClient := github.NewPATClient(token, "")
	ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
	owner, name := parseRepoURL(repo.Spec.URL)
	if owner != "" && name != "" {
		r.RemediationEngine.RegisterGitOpsEngine(owner+"/"+name, ghEngine)
	}
	log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)

Comment on lines +380 to +387
existing, err := ge.ListOpenPRs(ctx, owner, repo)
if err != nil {
log.Error(err, "Failed to list open PRs — treating as zero for this cycle",
"owner", owner, "repo", repo)
return 0
}
//nolint:gosec // len bounded by GitHub API page size (100).
return int32(len(existing))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ListOpenPRs call appears to be limited by the provider's page size (e.g., 100), as noted in the comment on line 386. If a repository has more open Zelyo PRs than this limit, the openPRs count will be inaccurate, potentially causing the controller to exceed the maxConcurrentPRs cap. It is recommended to implement pagination within the GitOps engine's ListOpenPRs implementation to ensure a complete count.

Comment on lines +382 to +385
log.Error(err, "Failed to list open PRs — treating as zero for this cycle",
"owner", owner, "repo", repo)
return 0
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Treating provider errors as zero open PRs can lead to exceeding the maxConcurrentPRs cap during API outages or rate-limiting events. Since the primary goal of this PR is blast-radius control, consider a more conservative approach, such as skipping remediation or returning an error to trigger a requeue, when the current state of open PRs cannot be reliably determined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the call-out — considered this and kept the soft-fail intentionally. Three reasons:

  1. Blast radius is already bounded. Even when snapshotOpenPRs returns zero during a GitHub outage, the per-cycle loop bound (incidentsHandled >= budget where budget = maxConcurrentPRs) still caps at most maxConcurrentPRs new PRs per 5-minute reconcile. Worst case is one cycle worth of duplicate PRs, which the next successful reconcile dedups via existingBranches — self-healing.
  2. Requeue-on-error would be worse in the common failure mode. A GitHub-wide outage or rate-limit event would trigger controller-runtime backoff across every RemediationPolicy in the cluster, churning metrics/events and still not opening PRs. Soft-fail degrades to "no remediation progress" instead of "reconcile storm with no remediation progress".
  3. The failure is observable. log.Error on the ListOpenPRs failure is structured and hits the operator log; ReconcileTotal{kind="remediationpolicy",result="success"} still increments (the reconcile itself succeeded, just with degraded data). Operators can alert on the combination of a GitHub error-rate signal plus flat zelyo_reconcile_openprs gauge.

Happy to add an explicit SnapshotOpenPRsFailed condition + a countOpenPRsError counter so the degraded state surfaces on the CR status and in Prometheus without changing semantics — that's probably the right middle ground. Opening a follow-up issue for it.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a250d34d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// OpenPRs reflects the total count of open Zelyo-generated PRs in the
// target repo after this cycle: already-open PRs observed at the start
// plus any this cycle opened.
policy.Status.OpenPRs = openPRs + prsCreated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute status.openPRs from real PR creations

status.openPRs is derived as openPRs + prsCreated, but prsCreated is incremented for every successfully processed incident, including dry-run/report strategies where ApplyPlan returns nil and no PR is created. In audit mode this can report non-existent “open PRs,” which violates the new field contract and can mislead operators or any automation reading status.

Useful? React with 👍 / 👎.

Merge conflict resolution folds PR #94's maxConcurrentPRs cap logic
into main's per-incident remediateIncident / existingBranches dedup
refactor. Both sides queried ListOpenPRs separately; the merged shape
consolidates that into a single snapshotOpenPRs helper whose output
feeds both the cap check (count) and the dedup map (branch→URL).

Review fixes landed in the same commit because they touch the same
functions:

- Codex P2 (status.openPRs phantom in audit mode): remediateIncident
  now returns (handled, prCreated bool). processIncidents tracks two
  counters — incidentsHandled drives the per-cycle budget (preserves
  LLM cost control across all strategies), prsCreated is set only
  when ApplyPlan returns a non-nil result. Under StrategyDryRun /
  StrategyReport, status.openPRs stays pinned to the provider count
  and status.remediationsApplied does not increment.

- Gemini High (SetGitOpsEngine cross-reconcile race): ensureGitOpsEngine
  FromSecret now calls RegisterGitOpsEngine(owner+"/"+name, ghEngine)
  instead of mutating the process-wide fallback. Concurrent
  RemediationPolicy reconciles for different repos can no longer
  clobber each other's credentials. Skip registration (rather than
  falling back to SetGitOpsEngine) when owner/name parsing fails,
  since re-introducing the global mutation defeats the fix.

- Gemini Medium (ListOpenPRs page cap): GitHub engine now paginates
  via page= parameter until a short page arrives or the 10-page safety
  cap (1000 PRs) hits. Repos with >100 open Zelyo PRs no longer cause
  silent under-counts that let the controller exceed the cap.

Tests:
- existing budget-exhausted test still passes under the merged shape
- new audit-mode test pins status.openPRs / remediationsApplied
  behavior under StrategyDryRun
- two new github engine pagination tests: happy-path 3-page paginate,
  and the safety-cap stop at listOpenPRsMaxPages

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mayurkr
Copy link
Copy Markdown
Contributor Author

mayurkr commented Apr 22, 2026

Pushed 9661ba4 — merge with main + Codex P2 + Gemini High + Gemini Medium (pagination) + new audit-mode test.

  • Merge: folds the cap-check and main's existingBranches dedup into one snapshotOpenPRs helper so we only call ListOpenPRs once per reconcile.
  • Codex P2 (status.openPRs phantom in audit mode): remediateIncident now returns (handled, prCreated bool). processIncidents tracks two counters — incidentsHandled drives the per-cycle budget (preserves LLM cost control for dry-run), prsCreated only ticks when ApplyPlan returns a real result. Test: audit-test-policy with DryRun: true reconciles one incident, verifies status.openPRs stays pinned at the 1 pre-existing PR (not 2) and status.remediationsApplied doesn't tick.
  • Gemini High (SetGitOpsEngine race): ensureGitOpsEngineFromSecret now calls RegisterGitOpsEngine(owner+"/"+name, ghEngine). Concurrent RemediationPolicies targeting different repos no longer clobber each other's credentials.
  • Gemini Medium (ListOpenPRs pagination): paginates via page=N until a short page arrives or hits the 10-page safety cap (1000 PRs). Two new tests cover the happy path and the cap.

Left the third Gemini comment (errors-as-zero) as-is with a reply inline — rationale boils down to: the per-cycle loop already caps blast radius, and requeue-on-error would produce a reconcile storm during GitHub outages without fixing anything. Happy to add a dedicated SnapshotOpenPRsFailed condition as a follow-up so the degraded state is observable on the CR.

main landed two behavior fixes while PR #94 was under review:

  #92 — RemediationPolicy.spec.targetPolicies scope gate now filters
        incidents (previously the field validated but never filtered,
        so an operator scoping to one SecurityPolicy still got PRs for
        unrelated incidents). Adds SecurityPolicy/namespace tagging on
        correlator.Event + incidentMatchesTargets.

  #93 — RemediationPolicy.spec.dryRun now actually suppresses PR
        creation (previously the flag was logged-only, so dryRun=true
        still opened real PRs). Generates the plan + fires a
        DryRunPreview event but skips ApplyPlan and leaves the
        incident open.

Resolution folds these alongside PR #94's existing changes:

- snapshotOpenPRs (our helper) and snapshotOpenPRBranches (main's)
  queried ListOpenPRs for separate purposes — count vs. dedup map.
  Merged into a single snapshotOpenPRs that returns both from one
  call; the duplicate main helper is deleted.

- ensureGitOpsEngineFromSecret (our helper using RegisterGitOpsEngine
  to fix Gemini's cross-reconcile race) supersedes main's
  initRemediationGitOps (which still called SetGitOpsEngine and
  reintroduced the race). The main helper is deleted.

- remediateIncident now uses main's (opened, charged) return names
  but the success-path value is `return result != nil, true`, not
  main's `return true, true`. The fix matters for the engine-level
  StrategyDryRun / StrategyReport path (distinct from policy.Spec.
  DryRun, which is handled earlier): ApplyPlan returns (nil, nil)
  there, and unconditional opened=true would re-introduce Codex P2
  (phantom status.openPRs).

- processIncidents keeps PR #94's budget check (openPRs vs. maxPRs
  up-front) alongside main's scope gate, using `processed` to drive
  the per-cycle budget. Two-counter accounting (processed, prsCreated)
  preserves accurate status.openPRs under all strategies.

- Test file: kept main's scope-gate + per-policy dryRun suites
  intact; reattached our two Contexts (cap-exhausted and engine-level
  audit-mode). Renamed the second to make explicit it exercises the
  ZelyoConfig.spec.mode=audit path distinct from policy.Spec.DryRun.

Verification: make lint → 0 issues; full envtest suite → 21 specs pass
across two Describe blocks (14 from the merge + 7 inherited from main
including target-policies, per-policy dryRun, and the pre-existing
reconcile smoke test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mayurkr mayurkr merged commit 21e3eac into main Apr 22, 2026
6 checks passed
@mayurkr mayurkr deleted the claude/objective-mccarthy-5e1d62 branch April 22, 2026 03:23
mayurkr added a commit that referenced this pull request Apr 22, 2026
* fix(github): short-circuit CreatePullRequest when an open PR covers the head branch

The GitHub engine's CreatePullRequest had a silent-commit footgun that
produced the "311 commits on one PR" incident observed on
zelyo-operator-fix/demo-app-payment-service PR #12.

Flow before the fix:
  1. getRef → base SHA
  2. createRef returns 422 "branch exists" — engine swallows and continues
  3. createOrUpdateFile for each file — COMMITS LAND ON THE EXISTING PR's
     BRANCH (this is the regression)
  4. openPR returns 422 "a pull request already exists" — error bubbles up
  5. Controller catches the error but the file commit already shipped
  6. Next reconcile: same branch, another commit pile

This was only partially mitigated by the controller-layer existingBranches
dedup shipped in #91 and #94. That mitigation misses when ListOpenPRs
fails, when pagination hits the safety cap, or when any other caller of
CreatePullRequest skips the upstream check. The engine itself needed to
be the last line of defense.

Fix: after createRef returns 422, query GitHub for any open PR whose head
branch is exactly pr.HeadBranch. If one exists, log and return it — no
commit, no duplicate openPR. If no open PR exists (legitimate case when a
user manually closed a PR but left the branch), proceed with commit +
openPR as before. If the dedup lookup ITSELF fails, fail closed —
remediation defers to the next cycle rather than risk polluting an open
PR with duplicate commits.

New tests:
  - SkipsWhenOpenPRExists: createRef=422 + open PR exists → asserts PUT
    /contents and POST /pulls are never called, existing PR is returned
  - BranchExistsWithoutOpenPR: createRef=422 + empty PR list → asserts
    commit + openPR both run (the legitimate orphan-branch case)
  - BranchExistsLookupFails: createRef=422 + PR list returns 500 → asserts
    error return, no commit, no openPR (fail-closed)

Uses GitHub's head=owner:ref query param to scope the lookup server-side
(only one open PR per head branch is possible, so no pagination needed),
with in-memory ref equality as a defense-in-depth filter against mocks
or proxies that ignore the query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address PR review: typed APIError, owner:branch dedup match, shared PR struct

Three reviewer asks on #95:

1. Codex P2 — findOpenPRByHeadBranch was filtering on head.ref alone,
   so a fork PR using the same branch name could false-match the dedup
   check when a proxy/mock ignores the server-side head= query. Switch
   to matching on head.label ("owner:ref"), which GitHub always sets
   to the canonical identifier.

2. Gemini medium — string-matching "422" in the error message was
   brittle. doRequest now returns a typed *APIError with StatusCode
   and Body; the createRef 422 check uses errors.As for a direct
   integer comparison against http.StatusUnprocessableEntity.

3. Gemini medium — the PR-response anonymous struct was duplicated
   between findOpenPRByHeadBranch and ListOpenPRs. Extract
   githubPullResponse at package scope and reuse.

Tests:
 - Existing SkipsWhenOpenPRExists mock now sets head.label so the
   engine's new equality check matches.
 - New DoesNotMatchForkPRWithSameRef: server (simulating a proxy that
   ignored head= query) returns a fork PR with head.label =
   "forkuser:<branch>" distinct from our owner. Asserts the engine
   proceeds with its own commit + openPR instead of false-matching.
 - New APIErrorTypedUnwrap: pins the typed-error contract. Covers
   both the direct-%w-wrap path callers use and the live doRequest
   path, so a future refactor that swallows the type breaks here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant