Skip to content

Commit 087091f

Browse files
mayurkrclaude
andcommitted
Merge origin/main — integrate #92 (targetPolicies) and #93 (dryRun)
main landed two behavior fixes while PR #94 was under review: #92 — RemediationPolicy.spec.targetPolicies scope gate now filters incidents (previously the field validated but never filtered, so an operator scoping to one SecurityPolicy still got PRs for unrelated incidents). Adds SecurityPolicy/namespace tagging on correlator.Event + incidentMatchesTargets. #93 — RemediationPolicy.spec.dryRun now actually suppresses PR creation (previously the flag was logged-only, so dryRun=true still opened real PRs). Generates the plan + fires a DryRunPreview event but skips ApplyPlan and leaves the incident open. Resolution folds these alongside PR #94's existing changes: - snapshotOpenPRs (our helper) and snapshotOpenPRBranches (main's) queried ListOpenPRs for separate purposes — count vs. dedup map. Merged into a single snapshotOpenPRs that returns both from one call; the duplicate main helper is deleted. - ensureGitOpsEngineFromSecret (our helper using RegisterGitOpsEngine to fix Gemini's cross-reconcile race) supersedes main's initRemediationGitOps (which still called SetGitOpsEngine and reintroduced the race). The main helper is deleted. - remediateIncident now uses main's (opened, charged) return names but the success-path value is `return result != nil, true`, not main's `return true, true`. The fix matters for the engine-level StrategyDryRun / StrategyReport path (distinct from policy.Spec. DryRun, which is handled earlier): ApplyPlan returns (nil, nil) there, and unconditional opened=true would re-introduce Codex P2 (phantom status.openPRs). - processIncidents keeps PR #94's budget check (openPRs vs. maxPRs up-front) alongside main's scope gate, using `processed` to drive the per-cycle budget. Two-counter accounting (processed, prsCreated) preserves accurate status.openPRs under all strategies. - Test file: kept main's scope-gate + per-policy dryRun suites intact; reattached our two Contexts (cap-exhausted and engine-level audit-mode). Renamed the second to make explicit it exercises the ZelyoConfig.spec.mode=audit path distinct from policy.Spec.DryRun. Verification: make lint → 0 issues; full envtest suite → 21 specs pass across two Describe blocks (14 from the merge + 7 inherited from main including target-policies, per-policy dryRun, and the pre-existing reconcile smoke test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 parents 9661ba4 + c659ac4 commit 087091f

4 files changed

Lines changed: 757 additions & 68 deletions

File tree

internal/controller/remediationpolicy_controller.go

Lines changed: 122 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,24 @@ func (r *RemediationPolicyReconciler) processIncidents(
215215

216216
log.Info("Found open incidents", "count", len(incidents))
217217

218+
// Build target-policy allowlist. Empty spec.targetPolicies means "all
219+
// SecurityPolicies apply" — skip the scope filter entirely.
220+
//
221+
// Keyed by NamespacedName because SecurityPolicy is a namespaced CRD:
222+
// two tenants can legally register policies with the same name, and
223+
// matching on name alone would let an incident from team-b/baseline
224+
// satisfy team-a's target "baseline". The upstream validation loop
225+
// already resolves targetPolicies against policy.Namespace, so using
226+
// that namespace here keeps the runtime gate consistent with what
227+
// was actually validated.
228+
var targetSet map[types.NamespacedName]struct{}
229+
if len(policy.Spec.TargetPolicies) > 0 {
230+
targetSet = make(map[types.NamespacedName]struct{}, len(policy.Spec.TargetPolicies))
231+
for _, name := range policy.Spec.TargetPolicies {
232+
targetSet[types.NamespacedName{Name: name, Namespace: policy.Namespace}] = struct{}{}
233+
}
234+
}
235+
218236
// Determine severity threshold.
219237
severityFilter := policy.Spec.SeverityFilter
220238
if severityFilter == "" {
@@ -228,11 +246,12 @@ func (r *RemediationPolicyReconciler) processIncidents(
228246
maxPRs = 5
229247
}
230248

231-
// Snapshot the set of open Zelyo-generated PRs once. The result feeds
232-
// two concerns that both need the same data:
249+
// Snapshot open Zelyo-generated PRs once. Feeds two concerns:
233250
// - openPRs count → enforces the maxConcurrentPRs cap across reconciles
251+
// (the headline fix of this PR)
234252
// - existingBranches map → per-finding dedup in remediateIncident so we
235-
// never open a second PR against a branch that already has one open
253+
// never push a second commit/PR against a branch that's already open
254+
// (the fix originally added in #91)
236255
openPRs, existingBranches := r.snapshotOpenPRs(ctx, repoOwner, repoName)
237256
budget := maxPRs - openPRs
238257
if budget <= 0 {
@@ -241,23 +260,39 @@ func (r *RemediationPolicyReconciler) processIncidents(
241260
return 0, openPRs
242261
}
243262

244-
// Two counters: incidentsHandled drives the per-cycle budget (caps the
245-
// cost of LLM calls + apply attempts regardless of strategy), while
246-
// prsCreated only counts real PRs so status.openPRs stays accurate in
247-
// dry-run / report strategies where ApplyPlan returns nil.
248-
var incidentsHandled int32
263+
// Two counters:
264+
// - processed drives the per-cycle budget — ticks for every incident
265+
// that consumed an LLM plan generation, whether the outcome was a
266+
// real PR or a dryRun preview. This bounds BOTH token cost and
267+
// reconcile duration regardless of strategy.
268+
// - prsCreated only ticks when a real PR was opened (result != nil),
269+
// so status.openPRs stays accurate in audit / dryRun / report modes
270+
// where ApplyPlan returns nil.
271+
var processed int32
249272
for _, incident := range incidents {
250-
if incidentsHandled >= budget {
273+
if processed >= budget {
251274
log.Info("MaxConcurrentPRs budget reached this cycle",
252-
"limit", maxPRs, "openPRs", openPRs, "createdThisCycle", prsCreated)
275+
"limit", maxPRs, "openPRs", openPRs, "createdThisCycle", prsCreated,
276+
"dryRun", policy.Spec.DryRun)
253277
break
254278
}
255-
handled, prCreated := r.remediateIncident(ctx, policy, repo, incident,
279+
// Scope gate: when spec.targetPolicies is set, only remediate
280+
// incidents carrying at least one event from a listed
281+
// SecurityPolicy. Checked in the loop (not inside
282+
// remediateIncident) so filtered incidents are skipped without
283+
// paying the dedup cost and without ever calling
284+
// ResolveIncident on them — another RemediationPolicy may own
285+
// that scope. Not charged against the budget since no LLM call
286+
// is made.
287+
if targetSet != nil && !incidentMatchesTargets(incident, targetSet) {
288+
continue
289+
}
290+
opened, charged := r.remediateIncident(ctx, policy, repo, incident,
256291
minSev, repoOwner, repoName, existingBranches)
257-
if handled {
258-
incidentsHandled++
292+
if charged {
293+
processed++
259294
}
260-
if prCreated {
295+
if opened {
261296
prsCreated++
262297
}
263298
}
@@ -362,17 +397,21 @@ func (r *RemediationPolicyReconciler) snapshotOpenPRs(
362397
return int32(len(existing)), branchesByName
363398
}
364399

365-
// remediateIncident handles the full severity-check → dedup →
366-
// GeneratePlan → ApplyPlan → resolve flow for a single incident. Factored
367-
// out of processIncidents to keep each unit under the gocyclo threshold.
400+
// remediateIncident handles the full severity-check → dedup → GeneratePlan
401+
// → (dry-run preview | ApplyPlan) → resolve flow for a single incident.
402+
// Factored out of processIncidents to keep each unit under the gocyclo
403+
// threshold. The targetPolicies scope gate is applied by the caller
404+
// (processIncidents), not here.
368405
//
369-
// Returns two signals:
370-
// - handled: an LLM call + apply attempt took place — drives the per-cycle
371-
// budget so we cap cost regardless of strategy (dry-run LLM calls still
372-
// count). Fast-path skips (severity miss, dedup hit) do not set handled.
373-
// - prCreated: a real PR was opened (result != nil). Only set under
374-
// StrategyGitOpsPR on success. Drives status.openPRs so audit-mode
375-
// (StrategyDryRun / StrategyReport) does not report phantom PRs.
406+
// Returns two flags so the caller can drive independent counters:
407+
// - opened: a real PR was created (counts against status.RemediationsApplied
408+
// and status.OpenPRs). Only true when ApplyPlan returned a non-nil result
409+
// — covers the engine-level StrategyDryRun / StrategyReport case where
410+
// ApplyPlan returns (nil, nil) and we must NOT report a phantom PR.
411+
// - charged: this incident consumed an LLM plan generation (counts against
412+
// the per-cycle MaxConcurrentPRs budget — covers both real PRs and
413+
// dryRun previews, but NOT incidents skipped by severity or dedup since
414+
// no LLM call is made)
376415
func (r *RemediationPolicyReconciler) remediateIncident(
377416
ctx context.Context,
378417
policy *zelyov1alpha1.RemediationPolicy,
@@ -381,7 +420,7 @@ func (r *RemediationPolicyReconciler) remediateIncident(
381420
minSev int,
382421
repoOwner, repoName string,
383422
existingBranches map[string]string,
384-
) (handled, prCreated bool) {
423+
) (opened, charged bool) {
385424
log := logf.FromContext(ctx)
386425

387426
// Severity filter — fast path, no cost, no budget consumption.
@@ -398,11 +437,16 @@ func (r *RemediationPolicyReconciler) remediateIncident(
398437
branch := gitops.BranchName(finding.ResourceName, finding.ResourceNamespace, finding.Title)
399438
if existingURL, exists := existingBranches[branch]; exists {
400439
log.Info("Skipping remediation — open PR already exists",
401-
"incident", incident.ID, "branch", branch, "prURL", existingURL)
402-
// Still mark the incident resolved so we don't loop on it; a
403-
// future scan will regenerate the incident if the PR is closed
404-
// without merging and the finding remains.
405-
r.CorrelatorEngine.ResolveIncident(incident.ID)
440+
"incident", incident.ID, "branch", branch, "prURL", existingURL,
441+
"dryRun", policy.Spec.DryRun)
442+
// In a real reconcile, mark the incident resolved so we don't
443+
// loop on it; a future scan will regenerate the incident if the
444+
// PR is closed without merging and the finding remains. In
445+
// dryRun we must NOT touch correlator state — leave it for the
446+
// next non-dryRun reconcile.
447+
if !policy.Spec.DryRun {
448+
r.CorrelatorEngine.ResolveIncident(incident.ID)
449+
}
406450
return false, false
407451
}
408452

@@ -413,8 +457,8 @@ func (r *RemediationPolicyReconciler) remediateIncident(
413457
"resource", fmt.Sprintf("%s/%s", incident.Namespace, incident.Resource))
414458
r.Recorder.Event(policy, corev1.EventTypeWarning, zelyov1alpha1.EventReasonReconcileError,
415459
fmt.Sprintf("LLM plan generation failed for incident %s: %v", incident.ID, err))
416-
// LLM call was attempted (and cost incurred) — counts against budget.
417-
return true, false
460+
// Still counts against the budget — the LLM call was made.
461+
return false, true
418462
}
419463

420464
log.Info("Generated remediation plan",
@@ -423,13 +467,24 @@ func (r *RemediationPolicyReconciler) remediateIncident(
423467
"riskScore", plan.RiskScore,
424468
"dryRun", policy.Spec.DryRun)
425469

470+
// spec.dryRun is a per-policy preview switch: generate the plan so
471+
// operators can review fix count / risk, but do not submit a PR and
472+
// do not resolve the incident — a later reconcile with dryRun=false
473+
// should still pick it up and remediate.
474+
if policy.Spec.DryRun {
475+
r.Recorder.Event(policy, corev1.EventTypeNormal, "DryRunPreview",
476+
fmt.Sprintf("Dry-run: would remediate incident %s (fixes=%d, risk=%d) — no PR opened",
477+
incident.ID, len(plan.Fixes), plan.RiskScore))
478+
return false, true
479+
}
480+
426481
result, err := r.RemediationEngine.ApplyPlan(ctx, plan, repoOwner, repoName)
427482
if err != nil {
428483
log.Error(err, "Failed to apply remediation plan",
429484
"incident", incident.ID)
430485
r.Recorder.Event(policy, corev1.EventTypeWarning, zelyov1alpha1.EventReasonReconcileError,
431486
fmt.Sprintf("Failed to apply fix for incident %s: %v", incident.ID, err))
432-
return true, false
487+
return false, true
433488
}
434489

435490
if result != nil {
@@ -443,7 +498,40 @@ func (r *RemediationPolicyReconciler) remediateIncident(
443498
}
444499

445500
r.CorrelatorEngine.ResolveIncident(incident.ID)
446-
return true, result != nil
501+
// result==nil happens when the engine is in StrategyDryRun / StrategyReport
502+
// (set by ZelyoConfig.spec.mode=audit, distinct from policy.Spec.DryRun).
503+
// In that case: LLM was called (charged=true) but no PR was opened
504+
// (opened=false). Guards Codex P2 — previously the success path
505+
// returned (true, true) unconditionally and inflated status.openPRs
506+
// under audit mode.
507+
return result != nil, true
508+
}
509+
510+
// incidentMatchesTargets reports whether the incident carries at least one
511+
// event from a SecurityPolicy in the given allowlist. An incident may be a
512+
// correlation of events from multiple SecurityPolicies on the same
513+
// resource; if any one of them is targeted, the RemediationPolicy applies.
514+
//
515+
// Matching is on the (name, namespace) pair because SecurityPolicy is a
516+
// namespaced CRD. An event missing either half is treated as unmatched —
517+
// anomaly/deployment events legitimately have both blank, and a
518+
// SecurityPolicy-originated event with an empty namespace is malformed
519+
// (see securitypolicy_controller.ingestFindingsToCorrelator) and must
520+
// not be allowed to satisfy the gate by coincidence.
521+
func incidentMatchesTargets(incident *correlator.Incident, targets map[types.NamespacedName]struct{}) bool {
522+
if incident == nil || len(targets) == 0 {
523+
return false
524+
}
525+
for _, ev := range incident.Events {
526+
if ev == nil || ev.SecurityPolicy == "" || ev.SecurityPolicyNamespace == "" {
527+
continue
528+
}
529+
key := types.NamespacedName{Name: ev.SecurityPolicy, Namespace: ev.SecurityPolicyNamespace}
530+
if _, ok := targets[key]; ok {
531+
return true
532+
}
533+
}
534+
return false
447535
}
448536

449537
// incidentToFinding converts a correlator incident to a scanner finding for the

0 commit comments

Comments
 (0)