Skip to content

fix(remediation): honor RemediationPolicy.spec.dryRun#93

Merged
mayurkr merged 4 commits into
mainfrom
claude/condescending-mclean-17ec9a
Apr 22, 2026
Merged

fix(remediation): honor RemediationPolicy.spec.dryRun#93
mayurkr merged 4 commits into
mainfrom
claude/condescending-mclean-17ec9a

Conversation

@mayurkr
Copy link
Copy Markdown
Contributor

@mayurkr mayurkr commented Apr 21, 2026

Summary

  • RemediationPolicy.spec.dryRun was defined in api/v1alpha1/remediationpolicy_types.go:42 as "when true, generates fixes but does not create actual PRs", but the controller only logged the flag and still called RemediationEngine.ApplyPlan unconditionally — so dryRun: true happily opened real GitOps PRs. The only working preview path was the operator-wide ZelyoConfig.spec.mode: audit, which contradicts the per-policy CRD contract.
  • Fixed by gating in processIncidents: when policy.Spec.DryRun is true we still generate the plan (operators get fix count + risk score in the log and a DryRunPreview Kubernetes event), but skip ApplyPlan, skip ResolveIncident, and do not bump status.remediationsApplied. The incident stays open so a later reconcile with dryRun=false picks it up.
  • Added an envtest-level integration test using a fake llm.Client and fake gitops.Engine. Asserts CreatePullRequest is never called when dryRun=true, the seeded incident stays unresolved, and status.remediationsApplied stays at 0. A counter-case with dryRun=false hits the same fakes and asserts CreatePullRequest IS called and the incident is resolved — this stops the dry-run assertion from passing by accident if the harness breaks.

Why this gating site

Could alternatively thread a per-call StrategyDryRun override into the engine, but gating at the controller call site is simpler: it also naturally skips the ResolveIncident + counter increment that sit just below ApplyPlan, which a pure engine-level override would require separate logic to suppress. CRD schema is unchanged; only the controller contract now matches the documented behavior.

Test plan

  • make test — full suite passes (16 controller specs, 22s)
  • Focused -ginkgo.focus=Dry-Run run — 2/2 new specs green
  • Manually apply a RemediationPolicy with dryRun: true to a cluster with an open incident and confirm no PR shows up in the target repo

🤖 Generated with Claude Code

The CRD field spec.dryRun was documented as "generates fixes but does
not create actual PRs" but the controller only logged the flag — it
always called RemediationEngine.ApplyPlan, so setting dryRun=true
still opened real GitOps PRs. The only reliable preview path was
ZelyoConfig.spec.mode=audit at the operator level, which contradicts
the per-policy contract.

Gate ApplyPlan + ResolveIncident in processIncidents: when dryRun is
true, generate the plan (so operators see fix count / risk in the log
and a DryRunPreview event) but skip PR creation, leave the incident
open for a later non-dry-run reconcile, and do not bump
status.remediationsApplied.

Adds an integration test in the controller envtest suite using a fake
llm.Client and fake gitops.Engine: asserts CreatePullRequest is never
called when dryRun=true, the seeded incident stays open, and the
status counter stays at 0. A counter-case with dryRun=false exercises
the same fakes to prove CreatePullRequest is called and the incident
is resolved — this guards the dry-run assertion from passing via a
broken test harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@mayurkr has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 48 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 48 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 00c34a09-6adf-4c86-a42b-b3f891ad316c

📥 Commits

Reviewing files that changed from the base of the PR and between bdad37e and 2738995.

📒 Files selected for processing (2)
  • internal/controller/remediationpolicy_controller.go
  • internal/controller/remediationpolicy_controller_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/condescending-mclean-17ec9a

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a dry-run feature for the remediation policy controller, enabling the generation of fix plans for operator review without submitting pull requests or resolving incidents. It includes comprehensive test coverage with mock LLM and GitOps engines. Feedback was provided regarding the lack of a processing limit in dry-run mode, which could lead to unbounded LLM calls and associated costs since the standard concurrency limit is bypassed.

Comment on lines +272 to +277
if policy.Spec.DryRun {
r.Recorder.Event(policy, corev1.EventTypeNormal, "DryRunPreview",
fmt.Sprintf("Dry-run: would remediate incident %s (fixes=%d, risk=%d) — no PR opened",
incident.ID, len(plan.Fixes), plan.RiskScore))
continue
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The continue statement here bypasses the prsCreated increment, which means the maxConcurrentPRs check at the start of the loop (line 232) will never be triggered when DryRun is enabled. If a policy has a large number of open incidents, the controller will attempt to generate LLM plans for all of them in a single reconciliation cycle, potentially leading to excessive LLM costs and execution timeouts. Consider using a separate counter to limit the number of incidents processed per cycle, regardless of whether they result in a PR or a dry-run preview.

mayurkr and others added 3 commits April 21, 2026 23:51
…complexity

The dry-run gating added one branch to processIncidents, which pushed
gocyclo from 15 to 16 and tripped the repo's threshold (15). Extract
the PAT-token lookup + GitOps engine wiring into a new helper
`maybeSetGitOpsEngineFromSecret` — a flat early-return style that
reads better than the previous deeply-nested if/if/if block and
drops 3 branches from processIncidents. Pure refactor, no behavior
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gemini PR review (PR #93) flagged that prsCreated never increments on
the dryRun path, so a policy with N open incidents fires N LLM calls
per reconcile regardless of spec.maxConcurrentPRs. On clusters with
many correlated incidents that's unbounded LLM cost and reconcile-
timeout risk.

Introduce a separate `processed` counter that increments once per
incident that makes it to the LLM call (before GeneratePlan, so
plan-generation failures still count against the budget). Use it as
the loop ceiling; keep prsCreated driving the status counter so
status semantics are unchanged.

Add a regression test seeding 5 incidents against maxConcurrentPRs=2
with dryRun=true and asserting the LLM is hit exactly 2 times and
every incident stays open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflicts in internal/controller/remediationpolicy_controller.go.
Main (PR #91) refactored processIncidents to extract remediateIncident and
added one-shot PR dedup via ListOpenPRs. This branch's dry-run gating and
per-cycle LLM budget sit naturally inside the new remediateIncident helper.

Resolution:
- Drop this branch's maybeSetGitOpsEngineFromSecret extraction; keep
  main's inline gitops-engine setup block. Main already trimmed
  processIncidents' complexity by extracting remediateIncident, so the
  helper extraction is no longer needed.
- Thread dryRun into remediateIncident: on dryRun, emit DryRunPreview
  event after GeneratePlan and return (opened=false, charged=true); skip
  ApplyPlan + ResolveIncident.
- Change remediateIncident's return from bool to (opened, charged bool)
  so the caller can drive two counters independently: prsCreated for
  status.RemediationsApplied, processed for the per-cycle LLM budget.
- Adapt the dedup branch for dryRun: skip ResolveIncident when dryRun
  is true so the incident survives for a later real reconcile.
- Align the dry-run test's fake LLM response file_path (clusters/...)
  with the test's repo.Spec.Paths so it clears main's new
  filterFixesToAllowedPaths gate. Without this, GeneratePlan returns
  a zero-fix error and the counter-case test silently "passes" for the
  wrong reason.

Validated: make lint (0 issues), controller + remediation packages
green under `go test -race`, all 3 dry-run specs pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mayurkr mayurkr merged commit 7fa2819 into main Apr 22, 2026
6 checks passed
@mayurkr mayurkr deleted the claude/condescending-mclean-17ec9a branch April 22, 2026 02:39
mayurkr added a commit that referenced this pull request Apr 22, 2026
PR #93 split remediateIncident's single bool return into (opened,
charged) so spec.dryRun previews can count against MaxConcurrentPRs
without falsely incrementing status.RemediationsApplied. Merged that
shape into this branch's scope-gate filter:

  - Scope gate still runs in the outer loop before remediateIncident,
    so filtered incidents consume neither the LLM budget (charged)
    nor the PR counter (opened), nor do they trigger ResolveIncident.
  - Combined the doc comments for remediateIncident so the scope-gate
    delegation note sits alongside #93's opened/charged explanation.
  - Kept both test-support blocks: the scope-gate specs and #93's
    dry-run specs plus their fake LLM/gitops doubles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mayurkr added a commit that referenced this pull request Apr 22, 2026
main landed two behavior fixes while PR #94 was under review:

  #92 — RemediationPolicy.spec.targetPolicies scope gate now filters
        incidents (previously the field validated but never filtered,
        so an operator scoping to one SecurityPolicy still got PRs for
        unrelated incidents). Adds SecurityPolicy/namespace tagging on
        correlator.Event + incidentMatchesTargets.

  #93 — RemediationPolicy.spec.dryRun now actually suppresses PR
        creation (previously the flag was logged-only, so dryRun=true
        still opened real PRs). Generates the plan + fires a
        DryRunPreview event but skips ApplyPlan and leaves the
        incident open.

Resolution folds these alongside PR #94's existing changes:

- snapshotOpenPRs (our helper) and snapshotOpenPRBranches (main's)
  queried ListOpenPRs for separate purposes — count vs. dedup map.
  Merged into a single snapshotOpenPRs that returns both from one
  call; the duplicate main helper is deleted.

- ensureGitOpsEngineFromSecret (our helper using RegisterGitOpsEngine
  to fix Gemini's cross-reconcile race) supersedes main's
  initRemediationGitOps (which still called SetGitOpsEngine and
  reintroduced the race). The main helper is deleted.

- remediateIncident now uses main's (opened, charged) return names
  but the success-path value is `return result != nil, true`, not
  main's `return true, true`. The fix matters for the engine-level
  StrategyDryRun / StrategyReport path (distinct from policy.Spec.
  DryRun, which is handled earlier): ApplyPlan returns (nil, nil)
  there, and unconditional opened=true would re-introduce Codex P2
  (phantom status.openPRs).

- processIncidents keeps PR #94's budget check (openPRs vs. maxPRs
  up-front) alongside main's scope gate, using `processed` to drive
  the per-cycle budget. Two-counter accounting (processed, prsCreated)
  preserves accurate status.openPRs under all strategies.

- Test file: kept main's scope-gate + per-policy dryRun suites
  intact; reattached our two Contexts (cap-exhausted and engine-level
  audit-mode). Renamed the second to make explicit it exercises the
  ZelyoConfig.spec.mode=audit path distinct from policy.Spec.DryRun.

Verification: make lint → 0 issues; full envtest suite → 21 specs pass
across two Describe blocks (14 from the merge + 7 inherited from main
including target-policies, per-policy dryRun, and the pre-existing
reconcile smoke test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant