fix(remediation): honor RemediationPolicy.spec.dryRun#93
Conversation
The CRD field spec.dryRun was documented as "generates fixes but does not create actual PRs" but the controller only logged the flag — it always called RemediationEngine.ApplyPlan, so setting dryRun=true still opened real GitOps PRs. The only reliable preview path was ZelyoConfig.spec.mode=audit at the operator level, which contradicts the per-policy contract. Gate ApplyPlan + ResolveIncident in processIncidents: when dryRun is true, generate the plan (so operators see fix count / risk in the log and a DryRunPreview event) but skip PR creation, leave the incident open for a later non-dry-run reconcile, and do not bump status.remediationsApplied. Adds an integration test in the controller envtest suite using a fake llm.Client and fake gitops.Engine: asserts CreatePullRequest is never called when dryRun=true, the seeded incident stays open, and the status counter stays at 0. A counter-case with dryRun=false exercises the same fakes to prove CreatePullRequest is called and the incident is resolved — this guards the dry-run assertion from passing via a broken test harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 48 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request implements a dry-run feature for the remediation policy controller, enabling the generation of fix plans for operator review without submitting pull requests or resolving incidents. It includes comprehensive test coverage with mock LLM and GitOps engines. Feedback was provided regarding the lack of a processing limit in dry-run mode, which could lead to unbounded LLM calls and associated costs since the standard concurrency limit is bypassed.
| if policy.Spec.DryRun { | ||
| r.Recorder.Event(policy, corev1.EventTypeNormal, "DryRunPreview", | ||
| fmt.Sprintf("Dry-run: would remediate incident %s (fixes=%d, risk=%d) — no PR opened", | ||
| incident.ID, len(plan.Fixes), plan.RiskScore)) | ||
| continue | ||
| } |
There was a problem hiding this comment.
The continue statement here bypasses the prsCreated increment, which means the maxConcurrentPRs check at the start of the loop (line 232) will never be triggered when DryRun is enabled. If a policy has a large number of open incidents, the controller will attempt to generate LLM plans for all of them in a single reconciliation cycle, potentially leading to excessive LLM costs and execution timeouts. Consider using a separate counter to limit the number of incidents processed per cycle, regardless of whether they result in a PR or a dry-run preview.
…complexity The dry-run gating added one branch to processIncidents, which pushed gocyclo from 15 to 16 and tripped the repo's threshold (15). Extract the PAT-token lookup + GitOps engine wiring into a new helper `maybeSetGitOpsEngineFromSecret` — a flat early-return style that reads better than the previous deeply-nested if/if/if block and drops 3 branches from processIncidents. Pure refactor, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gemini PR review (PR #93) flagged that prsCreated never increments on the dryRun path, so a policy with N open incidents fires N LLM calls per reconcile regardless of spec.maxConcurrentPRs. On clusters with many correlated incidents that's unbounded LLM cost and reconcile- timeout risk. Introduce a separate `processed` counter that increments once per incident that makes it to the LLM call (before GeneratePlan, so plan-generation failures still count against the budget). Use it as the loop ceiling; keep prsCreated driving the status counter so status semantics are unchanged. Add a regression test seeding 5 incidents against maxConcurrentPRs=2 with dryRun=true and asserting the LLM is hit exactly 2 times and every incident stays open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflicts in internal/controller/remediationpolicy_controller.go. Main (PR #91) refactored processIncidents to extract remediateIncident and added one-shot PR dedup via ListOpenPRs. This branch's dry-run gating and per-cycle LLM budget sit naturally inside the new remediateIncident helper. Resolution: - Drop this branch's maybeSetGitOpsEngineFromSecret extraction; keep main's inline gitops-engine setup block. Main already trimmed processIncidents' complexity by extracting remediateIncident, so the helper extraction is no longer needed. - Thread dryRun into remediateIncident: on dryRun, emit DryRunPreview event after GeneratePlan and return (opened=false, charged=true); skip ApplyPlan + ResolveIncident. - Change remediateIncident's return from bool to (opened, charged bool) so the caller can drive two counters independently: prsCreated for status.RemediationsApplied, processed for the per-cycle LLM budget. - Adapt the dedup branch for dryRun: skip ResolveIncident when dryRun is true so the incident survives for a later real reconcile. - Align the dry-run test's fake LLM response file_path (clusters/...) with the test's repo.Spec.Paths so it clears main's new filterFixesToAllowedPaths gate. Without this, GeneratePlan returns a zero-fix error and the counter-case test silently "passes" for the wrong reason. Validated: make lint (0 issues), controller + remediation packages green under `go test -race`, all 3 dry-run specs pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #93 split remediateIncident's single bool return into (opened, charged) so spec.dryRun previews can count against MaxConcurrentPRs without falsely incrementing status.RemediationsApplied. Merged that shape into this branch's scope-gate filter: - Scope gate still runs in the outer loop before remediateIncident, so filtered incidents consume neither the LLM budget (charged) nor the PR counter (opened), nor do they trigger ResolveIncident. - Combined the doc comments for remediateIncident so the scope-gate delegation note sits alongside #93's opened/charged explanation. - Kept both test-support blocks: the scope-gate specs and #93's dry-run specs plus their fake LLM/gitops doubles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main landed two behavior fixes while PR #94 was under review: #92 — RemediationPolicy.spec.targetPolicies scope gate now filters incidents (previously the field validated but never filtered, so an operator scoping to one SecurityPolicy still got PRs for unrelated incidents). Adds SecurityPolicy/namespace tagging on correlator.Event + incidentMatchesTargets. #93 — RemediationPolicy.spec.dryRun now actually suppresses PR creation (previously the flag was logged-only, so dryRun=true still opened real PRs). Generates the plan + fires a DryRunPreview event but skips ApplyPlan and leaves the incident open. Resolution folds these alongside PR #94's existing changes: - snapshotOpenPRs (our helper) and snapshotOpenPRBranches (main's) queried ListOpenPRs for separate purposes — count vs. dedup map. Merged into a single snapshotOpenPRs that returns both from one call; the duplicate main helper is deleted. - ensureGitOpsEngineFromSecret (our helper using RegisterGitOpsEngine to fix Gemini's cross-reconcile race) supersedes main's initRemediationGitOps (which still called SetGitOpsEngine and reintroduced the race). The main helper is deleted. - remediateIncident now uses main's (opened, charged) return names but the success-path value is `return result != nil, true`, not main's `return true, true`. The fix matters for the engine-level StrategyDryRun / StrategyReport path (distinct from policy.Spec. DryRun, which is handled earlier): ApplyPlan returns (nil, nil) there, and unconditional opened=true would re-introduce Codex P2 (phantom status.openPRs). - processIncidents keeps PR #94's budget check (openPRs vs. maxPRs up-front) alongside main's scope gate, using `processed` to drive the per-cycle budget. Two-counter accounting (processed, prsCreated) preserves accurate status.openPRs under all strategies. - Test file: kept main's scope-gate + per-policy dryRun suites intact; reattached our two Contexts (cap-exhausted and engine-level audit-mode). Renamed the second to make explicit it exercises the ZelyoConfig.spec.mode=audit path distinct from policy.Spec.DryRun. Verification: make lint → 0 issues; full envtest suite → 21 specs pass across two Describe blocks (14 from the merge + 7 inherited from main including target-policies, per-policy dryRun, and the pre-existing reconcile smoke test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
RemediationPolicy.spec.dryRunwas defined in api/v1alpha1/remediationpolicy_types.go:42 as "when true, generates fixes but does not create actual PRs", but the controller only logged the flag and still calledRemediationEngine.ApplyPlanunconditionally — sodryRun: truehappily opened real GitOps PRs. The only working preview path was the operator-wideZelyoConfig.spec.mode: audit, which contradicts the per-policy CRD contract.processIncidents: whenpolicy.Spec.DryRunis true we still generate the plan (operators get fix count + risk score in the log and aDryRunPreviewKubernetes event), but skipApplyPlan, skipResolveIncident, and do not bumpstatus.remediationsApplied. The incident stays open so a later reconcile withdryRun=falsepicks it up.llm.Clientand fakegitops.Engine. AssertsCreatePullRequestis never called whendryRun=true, the seeded incident stays unresolved, andstatus.remediationsAppliedstays at0. A counter-case withdryRun=falsehits the same fakes and assertsCreatePullRequestIS called and the incident is resolved — this stops the dry-run assertion from passing by accident if the harness breaks.Why this gating site
Could alternatively thread a per-call
StrategyDryRunoverride into the engine, but gating at the controller call site is simpler: it also naturally skips theResolveIncident+ counter increment that sit just belowApplyPlan, which a pure engine-level override would require separate logic to suppress. CRD schema is unchanged; only the controller contract now matches the documented behavior.Test plan
make test— full suite passes (16 controller specs, 22s)-ginkgo.focus=Dry-Runrun — 2/2 new specs greenRemediationPolicywithdryRun: trueto a cluster with an open incident and confirm no PR shows up in the target repo🤖 Generated with Claude Code