Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Setting `ZelyoConfig.spec.mode: protect` only flips the remediation engine's str
Only then does the full loop fire:

1. Correlator emits an incident
2. `RemediationPolicy` controller filters open incidents by `spec.severityFilter` and caps PR submissions per reconcile cycle via `spec.maxConcurrentPRs`
2. `RemediationPolicy` controller filters open incidents by `spec.severityFilter` and caps the number of open PRs via `spec.maxConcurrentPRs` — already-open Zelyo PRs on the target repo count against the budget, so new PRs only open when existing ones merge or close
3. Remediation engine asks the LLM for a structured JSON fix plan and scores the risk
4. GitHub engine creates a branch, commits the fix, and opens a PR (skipped globally when `ZelyoConfig.spec.mode: audit`, which leaves the engine in `dry-run`)
5. Human team reviews and merges the PR
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ graph LR
| Mode | Behavior |
| ----------------------------- | --------------------------------------------------------------------------------------- |
| **Audit** *(default)* | Detects, correlates, and alerts. The remediation engine runs in `dry-run` — fix plans are logged but no PRs are opened. |
| **Protect** | Switches the remediation engine to the `gitops-pr` strategy. PRs are opened only when at least one `RemediationPolicy` CR points at a configured `GitOpsRepository`. The policy's `severityFilter` decides which incidents qualify, and `maxConcurrentPRs` caps submissions per reconcile cycle. |
| **Protect** | Switches the remediation engine to the `gitops-pr` strategy. PRs are opened only when at least one `RemediationPolicy` CR points at a configured `GitOpsRepository`. The policy's `severityFilter` decides which incidents qualify, and `maxConcurrentPRs` caps the number of open Zelyo PRs on the target repo. |

> **Note:** `ZelyoConfig.spec.mode: protect` by itself does not produce any PRs — it only authorizes the pipeline. See [Enable GitOps Remediation](docs/quickstart.md#enable-gitops-remediation) for the full `GitOpsRepository` + `RemediationPolicy` setup.

Expand Down
6 changes: 5 additions & 1 deletion api/v1alpha1/remediationpolicy_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,11 @@ type RemediationPolicySpec struct {
// +optional
DryRun bool `json:"dryRun,omitempty"`

// maxConcurrentPRs limits the number of open PRs at any time.
// maxConcurrentPRs limits the number of open Zelyo-generated PRs on
// the target repo at any time. Already-open PRs count against the
// budget, so new PRs only open when existing ones merge or close.
// Current count is surfaced on status.openPRs. Multiple
// RemediationPolicies targeting the same repo share this budget.
// +kubebuilder:validation:Minimum=1
// +kubebuilder:default=5
// +optional
Expand Down
8 changes: 6 additions & 2 deletions config/crd/bases/zelyo.ai_remediationpolicies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,12 @@ spec:
type: string
maxConcurrentPRs:
default: 5
description: maxConcurrentPRs limits the number of open PRs at any
time.
description: |-
maxConcurrentPRs limits the number of open Zelyo-generated PRs on
the target repo at any time. Already-open PRs count against the
budget, so new PRs only open when existing ones merge or close.
Current count is surfaced on status.openPRs. Multiple
RemediationPolicies targeting the same repo share this budget.
format: int32
minimum: 1
type: integer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,12 @@ spec:
type: string
maxConcurrentPRs:
default: 5
description: maxConcurrentPRs limits the number of open PRs at any
time.
description: |-
maxConcurrentPRs limits the number of open Zelyo-generated PRs on
the target repo at any time. Already-open PRs count against the
budget, so new PRs only open when existing ones merge or close.
Current count is surfaced on status.openPRs. Multiple
RemediationPolicies targeting the same repo share this budget.
format: int32
minimum: 1
type: integer
Expand Down
2 changes: 1 addition & 1 deletion docs/gitops-onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ metadata:
spec:
gitOpsRepository: production-manifests # must match a GitOpsRepository CR
severityFilter: high # critical | high | medium | low
maxConcurrentPRs: 3 # cap per reconcile cycle (not a global open-PR count)
maxConcurrentPRs: 3 # cap on open Zelyo PRs in the target repo; surfaced on status.openPRs
prTemplate:
titlePrefix: "[Zelyo Operator]"
labels: ["auto-fix", "security"]
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ Aggregate views, cross-cluster correlation, and centralized policy management ac
| Mode | When | Behavior |
|---|---|---|
| **:material-magnify: Audit Mode** (default) | `ZelyoConfig.spec.mode: audit` | Detects, diagnoses, and sends alerts. The remediation engine runs in `dry-run` — fix plans are logged but no PRs are opened. Zero cluster modifications. |
| **:material-shield-check: Protect Mode** | `ZelyoConfig.spec.mode: protect` **and** at least one `RemediationPolicy` targeting a `GitOpsRepository` | Switches the remediation engine to the `gitops-pr` strategy. The `RemediationPolicy` controller drives PR creation — `severityFilter` gates which incidents qualify, `maxConcurrentPRs` caps submissions per reconcile cycle. |
| **:material-shield-check: Protect Mode** | `ZelyoConfig.spec.mode: protect` **and** at least one `RemediationPolicy` targeting a `GitOpsRepository` | Switches the remediation engine to the `gitops-pr` strategy. The `RemediationPolicy` controller drives PR creation — `severityFilter` gates which incidents qualify, `maxConcurrentPRs` caps the number of open Zelyo PRs on the target repo. |

!!! note
`ZelyoConfig.spec.mode: protect` only flips the engine strategy from `dry-run` to `gitops-pr`. **No PRs are opened until you also create at least one `RemediationPolicy` that points at a `GitOpsRepository`.** See [GitOps Onboarding](gitops-onboarding.md) for the full setup.
Expand Down
4 changes: 2 additions & 2 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ All three pieces are required — skipping any one of them means no PRs:
| --- | --- |
| `ZelyoConfig.spec.mode: protect` | Flips the remediation engine from `dry-run` to `gitops-pr`. Without this, plans are logged but never submitted. |
| `GitOpsRepository` | Tells Zelyo which repo, branch, and paths to write fixes into, and provides Git auth. |
| `RemediationPolicy` | The only controller that calls `GeneratePlan` + `ApplyPlan`. `severityFilter` gates which incidents qualify; `maxConcurrentPRs` caps PR submissions per reconcile cycle (not a global limit on open PRs). |
| `RemediationPolicy` | The only controller that calls `GeneratePlan` + `ApplyPlan`. `severityFilter` gates which incidents qualify; `maxConcurrentPRs` caps the number of open Zelyo PRs on the target repo — already-open PRs count against the budget, so new PRs only open when existing ones merge or close. The current count surfaces on `status.openPRs`. |

**0. Switch `ZelyoConfig` to Protect mode** (`ZelyoConfig` is cluster-scoped — no `-n` flag):

Expand Down Expand Up @@ -373,7 +373,7 @@ spec:
labels: ["security", "automated"]
branchPrefix: "zelyo/fix-"
severityFilter: high
maxConcurrentPRs: 3 # per reconcile cycle
maxConcurrentPRs: 3 # caps total open Zelyo PRs on the target repo
dryRun: false
autoMerge: false
```
Expand Down
145 changes: 121 additions & 24 deletions internal/controller/remediationpolicy_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,9 @@ func (r *RemediationPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Re
}

// ── Step 3: Query correlator for open incidents ──
var prsCreated int32
var prsCreated, openPRs int32
if r.CorrelatorEngine != nil && r.RemediationEngine != nil {
prsCreated = r.processIncidents(ctx, policy, repo)
prsCreated, openPRs = r.processIncidents(ctx, policy, repo)
} else {
log.Info("Correlator or remediation engine not configured — skipping active remediation")
}
Expand All @@ -158,6 +158,10 @@ func (r *RemediationPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Re
policy.Status.Phase = zelyov1alpha1.PhaseActive
policy.Status.LastRun = &now
policy.Status.RemediationsApplied += prsCreated
// OpenPRs reflects the total count of open Zelyo-generated PRs in the
// target repo after this cycle: already-open PRs observed at the start
// plus any this cycle opened.
policy.Status.OpenPRs = openPRs + prsCreated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute status.openPRs from real PR creations

status.openPRs is derived as openPRs + prsCreated, but prsCreated is incremented for every successfully processed incident, including dry-run/report strategies where ApplyPlan returns nil and no PR is created. In audit mode this can report non-existent “open PRs,” which violates the new field contract and can mislead operators or any automation reading status.

Useful? React with 👍 / 👎.

policy.Status.ObservedGeneration = policy.Generation
conditions.MarkTrue(&policy.Status.Conditions, zelyov1alpha1.ConditionReady,
zelyov1alpha1.ReasonReconcileSuccess,
Expand All @@ -178,17 +182,24 @@ func (r *RemediationPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Re

// processIncidents queries the correlator for open incidents, filters by severity,
// generates remediation plans, and optionally submits PRs.
//
// Returns (prsCreated, openPRs) where openPRs is the number of Zelyo-generated
// PRs already open on the target repo *at the start of this cycle* — i.e.
// before any PR this cycle may have created. Callers combine them to derive
// status.openPRs.
func (r *RemediationPolicyReconciler) processIncidents(
ctx context.Context,
policy *zelyov1alpha1.RemediationPolicy,
repo *zelyov1alpha1.GitOpsRepository,
) int32 {
) (prsCreated, openPRs int32) {
log := logf.FromContext(ctx)

incidents := r.CorrelatorEngine.GetOpenIncidents()
if len(incidents) == 0 {
log.Info("No open incidents found — nothing to remediate")
return 0
// Even with no incidents, surface the current open-PR count to
// status so users can see it via `kubectl get remediationpolicy`.
return 0, r.countOpenPRs(ctx, policy, repo)
}

log.Info("Found open incidents", "count", len(incidents))
Expand All @@ -201,22 +212,7 @@ func (r *RemediationPolicyReconciler) processIncidents(
minSev := severityOrder[severityFilter]

// ── Step 3: Initialize GitOps Engine from Secret ──
if repo.Spec.AuthSecret != "" {
secret := &corev1.Secret{}
secretKey := types.NamespacedName{Name: repo.Spec.AuthSecret, Namespace: repo.Namespace}
if err := r.Get(ctx, secretKey, secret); err == nil {
token := string(secret.Data["token"])
if token == "" {
token = string(secret.Data["api-key"])
}
if token != "" {
ghClient := github.NewPATClient(token, "")
ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
r.RemediationEngine.SetGitOpsEngine(ghEngine)
log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)
}
}
}
r.ensureGitOpsEngineFromSecret(ctx, repo)

// Respect MaxConcurrentPRs limit.
maxPRs := policy.Spec.MaxConcurrentPRs
Expand All @@ -227,10 +223,21 @@ func (r *RemediationPolicyReconciler) processIncidents(
// Parse repo owner/name from URL for PR submission.
repoOwner, repoName := parseRepoURL(repo.Spec.URL)

var prsCreated int32
// Count already-open Zelyo-generated PRs on the target repo so the
// MaxConcurrentPRs cap is honored across reconciles, not just within
// a single cycle.
openPRs = r.countOpenPRsForProvider(ctx, policy, repoOwner, repoName)
budget := maxPRs - openPRs
if budget <= 0 {
log.Info("MaxConcurrentPRs budget exhausted by already-open PRs — skipping",
"limit", maxPRs, "openPRs", openPRs)
return 0, openPRs
}

for _, incident := range incidents {
if prsCreated >= maxPRs {
log.Info("MaxConcurrentPRs limit reached", "limit", maxPRs)
if prsCreated >= budget {
log.Info("MaxConcurrentPRs budget reached this cycle",
"limit", maxPRs, "openPRs", openPRs, "createdThisCycle", prsCreated)
break
}

Expand Down Expand Up @@ -287,7 +294,97 @@ func (r *RemediationPolicyReconciler) processIncidents(
prsCreated++
}

return prsCreated
return prsCreated, openPRs
}

// ensureGitOpsEngineFromSecret reads the repo's AuthSecret (if any) and,
// when a usable PAT/app token is present, constructs a GitHub engine and
// registers it on the remediation engine. The function is deliberately
// permissive: a missing secret, unreadable secret, or empty token
// silently leaves whatever GitOps engine is already configured in place
// (including injected test engines) — there is no visible error
// condition because the surrounding reconciler handles missing creds by
// degrading gracefully to no-op remediation.
func (r *RemediationPolicyReconciler) ensureGitOpsEngineFromSecret(
ctx context.Context,
repo *zelyov1alpha1.GitOpsRepository,
) {
if repo.Spec.AuthSecret == "" {
return
}
log := logf.FromContext(ctx)
secret := &corev1.Secret{}
secretKey := types.NamespacedName{Name: repo.Spec.AuthSecret, Namespace: repo.Namespace}
if err := r.Get(ctx, secretKey, secret); err != nil {
return
}
token := string(secret.Data["token"])
if token == "" {
token = string(secret.Data["api-key"])
}
if token == "" {
return
}
ghClient := github.NewPATClient(token, "")
ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
r.RemediationEngine.SetGitOpsEngine(ghEngine)
log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RemediationEngine is a shared component across all reconciliations. Calling SetGitOpsEngine updates the global fallback engine, which creates a race condition when multiple RemediationPolicy resources targeting different repositories are reconciled concurrently. A policy might end up using a GitOps engine initialized with credentials from a different policy. Instead of setting the global fallback, use RegisterGitOpsEngine with a repository-specific key to ensure isolation.

	ghClient := github.NewPATClient(token, "")
	ghEngine := github.NewEngine(ghClient, log.WithName("github-engine"))
	owner, name := parseRepoURL(repo.Spec.URL)
	if owner != "" && name != "" {
		r.RemediationEngine.RegisterGitOpsEngine(owner+"/"+name, ghEngine)
	}
	log.Info("Successfully initialized GitOps engine for remediation", "repo", repo.Name)

}

// countOpenPRs resolves the repo owner/name from the GitOpsRepository spec
// and delegates to countOpenPRsForProvider.
func (r *RemediationPolicyReconciler) countOpenPRs(
ctx context.Context,
policy *zelyov1alpha1.RemediationPolicy,
repo *zelyov1alpha1.GitOpsRepository,
) int32 {
repoOwner, repoName := parseRepoURL(repo.Spec.URL)
return r.countOpenPRsForProvider(ctx, policy, repoOwner, repoName)
}

// countOpenPRsForProvider queries the configured GitOps provider for the
// number of currently-open Zelyo-generated PRs on owner/repo. The provider's
// ListOpenPRs implementation is already expected to filter out non-Zelyo
// PRs (by branch-prefix convention or labels).
//
// Errors are logged and treated as zero: a transient provider failure
// must not permanently block remediation. Callers still respect the
// per-cycle loop bound, so the worst case is a temporarily-inflated
// per-cycle budget during provider outages.
//
// When multiple RemediationPolicies target the same repo, they share the
// open-PR count (the cap is applied per repo, not per policy). Per-policy
// scoping requires PRTemplate.BranchPrefix to be both configurable and
// actually propagated into the branch name — that wiring is not yet in
// place (BranchName hardcodes its prefix), so adding a prefix filter here
// would silently match zero PRs under the default config and re-break
// the cap we are fixing.
func (r *RemediationPolicyReconciler) countOpenPRsForProvider(
ctx context.Context,
_ *zelyov1alpha1.RemediationPolicy,
owner, repo string,
) int32 {
log := logf.FromContext(ctx)

if owner == "" || repo == "" {
return 0
}
if r.RemediationEngine == nil {
return 0
}
ge := r.RemediationEngine.GitOpsEngineForRepo(owner, repo)
if ge == nil {
return 0
}

existing, err := ge.ListOpenPRs(ctx, owner, repo)
if err != nil {
log.Error(err, "Failed to list open PRs — treating as zero for this cycle",
"owner", owner, "repo", repo)
return 0
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Treating provider errors as zero open PRs can lead to exceeding the maxConcurrentPRs cap during API outages or rate-limiting events. Since the primary goal of this PR is blast-radius control, consider a more conservative approach, such as skipping remediation or returning an error to trigger a requeue, when the current state of open PRs cannot be reliably determined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the call-out — considered this and kept the soft-fail intentionally. Three reasons:

  1. Blast radius is already bounded. Even when snapshotOpenPRs returns zero during a GitHub outage, the per-cycle loop bound (incidentsHandled >= budget where budget = maxConcurrentPRs) still caps at most maxConcurrentPRs new PRs per 5-minute reconcile. Worst case is one cycle worth of duplicate PRs, which the next successful reconcile dedups via existingBranches — self-healing.
  2. Requeue-on-error would be worse in the common failure mode. A GitHub-wide outage or rate-limit event would trigger controller-runtime backoff across every RemediationPolicy in the cluster, churning metrics/events and still not opening PRs. Soft-fail degrades to "no remediation progress" instead of "reconcile storm with no remediation progress".
  3. The failure is observable. log.Error on the ListOpenPRs failure is structured and hits the operator log; ReconcileTotal{kind="remediationpolicy",result="success"} still increments (the reconcile itself succeeded, just with degraded data). Operators can alert on the combination of a GitHub error-rate signal plus flat zelyo_reconcile_openprs gauge.

Happy to add an explicit SnapshotOpenPRsFailed condition + a countOpenPRsError counter so the degraded state surfaces on the CR status and in Prometheus without changing semantics — that's probably the right middle ground. Opening a follow-up issue for it.

//nolint:gosec // len bounded by GitHub API page size (100).
return int32(len(existing))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ListOpenPRs call appears to be limited by the provider's page size (e.g., 100), as noted in the comment on line 386. If a repository has more open Zelyo PRs than this limit, the openPRs count will be inaccurate, potentially causing the controller to exceed the maxConcurrentPRs cap. It is recommended to implement pagination within the GitOps engine's ListOpenPRs implementation to ensure a complete count.

}

// incidentToFinding converts a correlator incident to a scanner finding for the
Expand Down
Loading