Skip to content

Commit d32a0ee

Browse files
mayurkrclaude
andauthored
fix: audit wave 1 — dashboard reliability, controller correctness, remediation safety, concurrency (#87)
* fix(dashboard): use server-lifecycle context for background goroutines Preset propose spawns a demoAdvancePreset goroutine that outlives the HTTP request. The previous code used context.Background(), which meant the goroutine would continue mutating shared state (preset store, remediation store) after server shutdown — introducing a shutdown race. Switch to a server-lifecycle context published by Start() via a mutex-guarded field. Handlers that spawn background work derive from it via backgroundContext() so goroutines observe server shutdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dashboard): close audit gaps in compliance + SSE + preset flow Consolidated dashboard-layer fixes surfaced during the audit pass: - GitOps badge no longer lies. ConfigStatus was seeded with a demo default (GitOpsConfigured: true, GitOpsRepo: "zelyo-ai/platform-gitops") and SetConfigStatus was never wired, so the Compliance page claimed GitOps was connected even when no GitOpsRepository existed. Derive the connected/repo fields live from GitOpsRepositoryList at request time and drop the demo default. - SSE send-on-closed-channel race. handleSSE's defer closed the subscriber channel outside the map-mutation lock while Broadcast sent on it under RLock — a classic send-on-closed panic window. Delete from the map under Lock and let GC reap the channel. - SSE stream poisoning on unmarshalable event.Data. An ignored json.Marshal error emitted "data: null", breaking every subscriber's stream. Skip the event and log at V(1) instead. - Start() no longer returns before Shutdown finishes. The goroutine-only Shutdown pattern let ListenAndServe return http.ErrServerClosed immediately while active handlers were still draining; join the two via a buffered error channel so the manager sees runnable exit only after Shutdown completes. - backgroundContext() fallback now returns a cancelled ctx instead of context.Background() so handlers invoked before Start (test paths) can't silently re-introduce the goroutine leak. - Preset state race fixed. handlePresetPropose wrote state=Proposing, spawned demoAdvancePreset, then wrote state=PendingMerge — racing with the goroutine's later merged→enabled writes. Complete both synchronous transitions before spawning the goroutine, and thread the GitOps repo slug through so the goroutine's Emit call doesn't re-read shared state. - Preset ID validation. Four preset endpoints now reject IDs that don't match the catalog shape (lowercase, digits, dashes, ≤64 chars) before they reach FindPreset or the remediation store. - /explain body cap. handleExplain now wraps r.Body in http.MaxBytesReader(16 KiB) so a hostile client can't make the handler allocate unbounded memory while decoding the request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(controller): ScanReport GenerateName, target validation, LLM redaction - ClusterScan controller now creates ScanReports via GenerateName rather than a composed "<scan>-<unix-timestamp>" Name. The previous form violated CLAUDE.md (which mandates GenerateName for ScanReports) and could exceed the 63-char DNS-1123 label limit on long scan names. Added scanReportBasename to bound the prefix at 56 chars so the API server's 5-char suffix always fits. - truncateRunes replaces unsafe f.Title[:min(...)] byte slicing in the Finding ID composition. Multi-byte UTF-8 titles (any CJK/emoji) were being chopped mid-codepoint, producing invalid-UTF-8 ID strings. - SecurityPolicy controller adds the RBAC markers that match its runtime behavior: it lists NotificationChannelList and reads the referenced Slack webhook Secret. Without these markers the operator silently fails notification delivery with RBAC-forbidden errors at runtime. - RemediationPolicy controller no longer swallows non-NotFound errors when validating TargetPolicies. Missing targets now mark the policy Ready=False with Reason=ReconcileFailed, set Phase=Error, and requeue with backoff instead of ticking forward into active remediation. - ZelyoConfig controller redacts the LLM probe error before putting it into the Event message and Condition. *llm.APIError.Error() includes the raw provider response body which, on 401/403, often echoes submitted header prefixes or request fragments — visible to anyone with namespace-scoped get-events RBAC. Full detail still goes to the operator log via log.Error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(remediation,github): validate LLM fix plans and escape PR file paths - safeRepoPath in the GitHub engine now validates and URL-escapes every repository-relative path before embedding it in a GitHub contents API URL. GetFile/createOrUpdateFile/deleteFile previously concatenated the path directly, so an LLM-emitted "../secret.yaml" or "%2e%2e/passwd" could overwrite or delete files outside the intended tree. We reject absolute paths, backslashes, and any ../.. segment, and url.PathEscape each remaining segment so spaces/percent-encodings don't corrupt the URL. branch and ref query args are now url.QueryEscape'd too. - extractFixes no longer trusts the LLM response. The previous fallback wrapped any unstructured LLM prose as a single "patch" fix and committed it — exactly the kind of arbitrary-content-in-a-PR outcome the auto-remediation design is supposed to prevent. The new extractFixes rejects any response that doesn't conform to the JSON schema, validates each fix's operation against an allowlist (create/update/delete — no silent "update" default for unknown ops), validates the file path against the same safety rules the GitHub engine enforces, and bounds patch/description sizes. - Added test coverage for the three regressions this closes: unstructured responses must be rejected (0 fixes), path traversal in LLM-emitted paths must be filtered out, and unknown operations must be dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(concurrency): deep-copy shared state + move callback outside lock - correlator.Engine.Ingest no longer holds the engine mutex across onIncident callback invocation. A callback that calls back into the engine (ResolveIncident, GetOpenIncidents) would deadlock waiting for the Ingest-held lock. Snapshot the incident to notify, release the lock, then fire the callback with a copy. - correlator.Engine.Ingest and GetOpenIncidents now return *Incident copies. The previous code returned a pointer into e.incidents that a concurrent Ingest would then mutate via `append(incident.Events, ...)` — producing a data race any caller ranging over Events could hit. New copyIncident clones the Incident struct and its Events slice header (shielding callers from future appends while keeping the *Event pointers shared, which are treated as immutable post-ingest). - anomaly.Detector.GetBaseline now deep-copies Baseline.Values. The shallow struct copy aliased the slice backing array with the engine's own, so a caller ranging over Values raced with Observe() appending. - threat.Aggregator.LookupImage stores and returns ImageThreat copies instead of aliasing one pointer across the cache and every caller. cloneImageThreat handles the CVEs slice too. None of these were hypothetical — each would trip `-race` under realistic load. No behavior change for single-threaded callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lint): rename url locals and fix cancelled/canceled misspell Renaming github/engine.go local url -> reqURL removes the importShadow gocritic warnings that surfaced once net/url was imported for safeRepoPath. No behavior change; mechanical rename of six locals and one parameter. Comment spellings in dashboard/server.go adjusted to "canceled" to match stdlib (context.Canceled) and the misspell linter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address codex review — close post-unlock race, fail on empty plans Two P1 findings from codex review of PR #87: 1. correlator.Engine.Ingest still had a post-unlock race in the callback dispatch path. notifySnapshot previously captured the raw internal *Incident pointer under the lock, but the copyIncident call happened AFTER e.mu.Unlock() — so a concurrent Ingest could append to incident.Events between the unlock and the copy, producing the exact data race the fix was meant to eliminate. Snapshot the incident via copyIncident WHILE holding the lock, then pass the immutable snapshot to the callback after unlock. Added TestEngine_ConcurrentIngestIsRaceFree which drives 16×50 concurrent Ingests against the same (ns, resource) plus a parallel GetOpenIncidents reader — the old code tripped -race immediately; the new code is clean. 2. remediation.Engine.GeneratePlan no longer returns a successful plan with zero fixes. When every LLM-proposed fix fails validation (path traversal, unknown operation, size cap, or the response is unstructured prose), extractFixes correctly returns zero fixes — but GeneratePlan was still returning (plan, nil). In dry-run mode ApplyPlan then returns nil error too, and processIncidents resolves the incident anyway: malformed/unsafe LLM output was silently closing incidents with no remediation applied. GeneratePlan now returns an error in the zero-fix case so the caller keeps the incident open for retry/manual triage. Added TestGeneratePlan_ZeroValidatedFixes_ReturnsError covering three representative rejection paths. Verified: go build, go vet, golangci-lint, full `go test`, and `go test -race ./internal/correlator/... ./internal/remediation/...` are all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address coderabbit review comments on PR #87 Six major-severity items flagged by coderabbit, each independently verifiable: 1. securitypolicy_controller RBAC marker reduced `secrets` to verbs=get. The controller only fetches named secrets via r.Get — list/watch were overscoped. (Note: aggregate role.yaml still shows get;list;watch because four other controllers legitimately need the wider verbs; tightening those is out of scope for this review and tracked separately.) 2. resolveConfigStatus now explicitly clears GitOpsConfigured/GitOpsRepo when the live GitOpsRepositoryList is empty. The previous branch returned the store snapshot unchanged, so if SetConfigStatus ever seeded stale values the badge could still read "connected" after the live query proved there are no repos. Live state is the source of truth. 3. dashboard Server.Start now surfaces ListenAndServe startup errors even when ctx.Done() wins the select. A real bind/cert-load failure could race ctx cancellation and be silently discarded by the old `<-serveErr` discard; we now check the drained error and return it if it's not http.ErrServerClosed. 4. GitHub engine safeRepoPath validation errors are now wrapped with operation context (GetFile / upsert / delete), per CLAUDE.md's rule against bare `return err`. 5. remediation.Engine.GeneratePlan no longer embeds raw LLM analysis in the returned error. The error is emitted as a Kubernetes Event by the RemediationPolicy controller — LLM output can echo secrets, request fragments, or arbitrary text and must not land on cluster-visible sinks. Full analysis stays in the operator log. The now-unused truncateString helper is removed. 6. extractFixes now rejects create/update fixes with empty/whitespace patches. An `operation:"update"` + `patch:""` previously passed validation and would generate a PR that blanks the target manifest — empty payloads are only valid for `delete` (where the op carries the intent). Added TestExtractFixes_EmptyPatchForCreateUpdateDropped covering the mixed case (blank create/update dropped, legit delete + non-blank update kept). Verified: build clean, full test suite + -race on dashboard/remediation/ correlator green, golangci-lint 0 issues. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dashboard): clarify SSE marshal-failure log message Per coderabbit review: the log read "unmarshalable Data" but the op is json.Marshal. Renamed to "json.Marshal failed" and added the error to the structured fields so V(1) logs carry enough context to debug the poisoning event without reaching for the code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dashboard): SSE loop observes shutdown + surface Shutdown errors Two major items from coderabbit review: 1. SSE handler's inner select now watches s.backgroundContext().Done() in addition to r.Context().Done(). http.Server.Shutdown waits for active handlers to return on their own up to the 10s timeout — an SSE loop that only observed the per-request ctx would stall every graceful shutdown until that timeout expired. Wiring the server-lifecycle ctx into the select makes active dashboard tabs drop out promptly when the manager ctx cancels. 2. Start()'s ctx.Done() branch no longer masks Shutdown failures. The previous code logged Shutdown errors and still returned nil once ListenAndServe reported the expected ErrServerClosed — so a real stop failure (DeadlineExceeded, close error) disappeared. Capture shutdownErr, drain serveErr, and return wrapped errors per CLAUDE.md (no bare return err) on both branches: dashboard listen, dashboard listen during shutdown, dashboard shutdown. Verified: build clean, dashboard tests green under -race, lint 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 314269c commit d32a0ee

15 files changed

Lines changed: 784 additions & 144 deletions

internal/anomaly/detector.go

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,10 @@ func (d *Detector) Observe(key string, value float64) *Anomaly {
145145
}
146146
}
147147

148-
// GetBaseline returns the current baseline for a key, or nil if not found.
148+
// GetBaseline returns a copy of the current baseline for key, or nil if
149+
// not found. The Values slice is deep-copied: the previous shallow struct
150+
// copy aliased the backing array, so a caller iterating Values raced with
151+
// the next Observe() `append` mutating the same memory.
149152
func (d *Detector) GetBaseline(key string) *Baseline {
150153
d.mu.RLock()
151154
defer d.mu.RUnlock()
@@ -154,6 +157,9 @@ func (d *Detector) GetBaseline(key string) *Baseline {
154157
return nil
155158
}
156159
result := *b
160+
if len(b.Values) > 0 {
161+
result.Values = append([]float64(nil), b.Values...)
162+
}
157163
return &result
158164
}
159165

internal/controller/clusterscan_controller.go

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -258,7 +258,7 @@ func (r *ClusterScanReconciler) executeScan(ctx context.Context, scan *zelyov1al
258258
for i := range results {
259259
f := &results[i]
260260
finding := zelyov1alpha1.Finding{
261-
ID: fmt.Sprintf("%s-%s-%s-%s", f.RuleType, f.ResourceNamespace, f.ResourceName, f.Title[:min(len(f.Title), 20)]),
261+
ID: fmt.Sprintf("%s-%s-%s-%s", f.RuleType, f.ResourceNamespace, f.ResourceName, truncateRunes(f.Title, 20)),
262262
Severity: f.Severity,
263263
Category: f.RuleType,
264264
Title: f.Title,
@@ -354,11 +354,15 @@ func (r *ClusterScanReconciler) evaluateCompliance(ctx context.Context, scan *ze
354354
func (r *ClusterScanReconciler) createScanReport(ctx context.Context, scan *zelyov1alpha1.ClusterScan, findings []zelyov1alpha1.Finding, summary *zelyov1alpha1.ScanSummary, complianceResults []zelyov1alpha1.ComplianceResult) (string, error) {
355355
log := logf.FromContext(ctx)
356356

357-
reportName := fmt.Sprintf("%s-%d", scan.Name, time.Now().Unix())
357+
// Use GenerateName (not Name) so the API server appends a 5-char random
358+
// suffix: this guarantees the result never exceeds the DNS-1123 label
359+
// limit of 63 chars regardless of scan.Name length, and stays unique
360+
// even when two scans complete in the same second (the unix-timestamp
361+
// approach previously collided under parallel execution).
358362
report := &zelyov1alpha1.ScanReport{
359363
ObjectMeta: metav1.ObjectMeta{
360-
Name: reportName,
361-
Namespace: scan.Namespace,
364+
GenerateName: scanReportBasename(scan.Name),
365+
Namespace: scan.Namespace,
362366
Labels: map[string]string{
363367
"zelyo.ai/scan": scan.Name,
364368
},
@@ -388,7 +392,21 @@ func (r *ClusterScanReconciler) createScanReport(ctx context.Context, scan *zely
388392
log.Error(err, "Failed to update ScanReport status")
389393
}
390394

391-
return reportName, nil
395+
return report.Name, nil
396+
}
397+
398+
// scanReportBasename returns a GenerateName prefix that leaves at least
399+
// 6 chars for the API server's random suffix while respecting the 63-char
400+
// DNS-1123 label limit. ScanReports use the scan name as a prefix so
401+
// human operators can still correlate reports to their scan in `kubectl
402+
// get scanreports`.
403+
func scanReportBasename(scanName string) string {
404+
const maxPrefix = 56 // 63 - 5 suffix - 1 dash - 1 safety = plenty.
405+
base := scanName
406+
if len(base) > maxPrefix {
407+
base = base[:maxPrefix]
408+
}
409+
return base + "-"
392410
}
393411

394412
// resolveTargetPods lists running pods matching the scan's scope.
@@ -499,3 +517,17 @@ func (r *ClusterScanReconciler) SetupWithManager(mgr ctrl.Manager) error {
499517
Named("clusterscan").
500518
Complete(r)
501519
}
520+
521+
// truncateRunes returns s truncated to at most maxRunes runes. Slicing a
522+
// string by byte index (s[:n]) chops mid-codepoint on multi-byte UTF-8
523+
// input, producing invalid sequences in downstream Finding IDs.
524+
func truncateRunes(s string, maxRunes int) string {
525+
if maxRunes <= 0 {
526+
return ""
527+
}
528+
r := []rune(s)
529+
if len(r) <= maxRunes {
530+
return s
531+
}
532+
return string(r[:maxRunes])
533+
}

internal/controller/remediationpolicy_controller.go

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ package controller
1919
import (
2020
"context"
2121
"fmt"
22+
"strings"
2223
"time"
2324

2425
corev1 "k8s.io/api/core/v1"
@@ -112,18 +113,37 @@ func (r *RemediationPolicyReconciler) Reconcile(ctx context.Context, req ctrl.Re
112113
fmt.Sprintf("GitOpsRepository %q is available (phase: %s)", repo.Name, repo.Status.Phase), policy.Generation)
113114

114115
// ── Step 2: Validate targeted SecurityPolicies ──
116+
// Non-NotFound errors previously fell through silently; track missing
117+
// and errored targets explicitly so we can mark a degraded condition
118+
// and requeue with backoff.
119+
var missingTargets []string
115120
if len(policy.Spec.TargetPolicies) > 0 {
116121
for _, policyName := range policy.Spec.TargetPolicies {
117122
sp := &zelyov1alpha1.SecurityPolicy{}
118123
spKey := types.NamespacedName{Name: policyName, Namespace: policy.Namespace}
119124
if err := r.Get(ctx, spKey, sp); err != nil {
120125
if errors.IsNotFound(err) {
126+
missingTargets = append(missingTargets, policyName)
121127
r.Recorder.Event(policy, corev1.EventTypeWarning, zelyov1alpha1.EventReasonReconcileError,
122128
fmt.Sprintf("Target SecurityPolicy %q not found", policyName))
129+
continue
123130
}
131+
return ctrl.Result{}, fmt.Errorf("fetching target SecurityPolicy %q: %w", policyName, err)
124132
}
125133
}
126134
}
135+
if len(missingTargets) > 0 {
136+
conditions.MarkFalse(&policy.Status.Conditions, zelyov1alpha1.ConditionReady,
137+
zelyov1alpha1.ReasonReconcileFailed,
138+
fmt.Sprintf("target SecurityPolicies not found: %s", strings.Join(missingTargets, ", ")),
139+
policy.Generation)
140+
policy.Status.Phase = zelyov1alpha1.PhaseError
141+
policy.Status.ObservedGeneration = policy.Generation
142+
if err := r.Status().Update(ctx, policy); err != nil {
143+
return ctrl.Result{}, fmt.Errorf("updating status: %w", err)
144+
}
145+
return ctrl.Result{RequeueAfter: 2 * time.Minute}, nil
146+
}
127147

128148
// ── Step 3: Query correlator for open incidents ──
129149
var prsCreated int32

internal/controller/securitypolicy_controller.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,10 @@ type SecurityPolicyReconciler struct {
7070
// +kubebuilder:rbac:groups=zelyo.ai,resources=securitypolicies,verbs=get;list;watch;create;update;patch;delete
7171
// +kubebuilder:rbac:groups=zelyo.ai,resources=securitypolicies/status,verbs=get;update;patch
7272
// +kubebuilder:rbac:groups=zelyo.ai,resources=securitypolicies/finalizers,verbs=update
73+
// +kubebuilder:rbac:groups=zelyo.ai,resources=notificationchannels,verbs=get;list;watch
7374
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch
7475
// +kubebuilder:rbac:groups="",resources=namespaces,verbs=get;list;watch
76+
// +kubebuilder:rbac:groups="",resources=secrets,verbs=get
7577
// +kubebuilder:rbac:groups="",resources=events,verbs=create;patch
7678

7779
// Reconcile implements the main reconciliation loop for SecurityPolicy.

internal/controller/zelyoconfig_controller.go

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ package controller
1818

1919
import (
2020
"context"
21+
stderrors "errors"
2122
"fmt"
2223
"os"
2324
"time"
@@ -254,13 +255,20 @@ func (r *ZelyoConfigReconciler) reconcileRemediationEngine(ctx context.Context,
254255
if ns == "" {
255256
ns = "zelyo-system"
256257
}
258+
// Events and Conditions are visible to anyone with namespace
259+
// read access; upstream *APIError.Error() includes the raw
260+
// provider response body which can echo request fragments or
261+
// header prefixes on 401/403. Use a redacted status-code-only
262+
// summary for any externally-visible surface. Full detail
263+
// stays in the operator log above.
264+
probeSummary := redactLLMProbeError(probeErr)
257265
r.Recorder.Event(config, corev1.EventTypeWarning, "LLMKeyVerificationFailed",
258-
fmt.Sprintf("LLM API key verification failed: %v. Verify your key with: kubectl get secret %s -n %s -o jsonpath='{.data.api-key}' | base64 -d",
259-
probeErr, config.Spec.LLM.APIKeySecret, ns))
266+
fmt.Sprintf("LLM API key verification failed: %s. Verify your key with: kubectl get secret %s -n %s -o jsonpath='{.data.api-key}' | base64 -d",
267+
probeSummary, config.Spec.LLM.APIKeySecret, ns))
260268
config.Status.LLMKeyStatus = "Invalid"
261269
conditions.MarkFalse(&config.Status.Conditions, zelyov1alpha1.ConditionLLMKeyVerified,
262270
zelyov1alpha1.ReasonLLMKeyInvalid,
263-
fmt.Sprintf("API key probe failed: %v", probeErr), config.Generation)
271+
fmt.Sprintf("API key probe failed: %s", probeSummary), config.Generation)
264272
} else {
265273
now := metav1.Now()
266274
config.Status.LLMKeyStatus = "Verified"
@@ -297,3 +305,25 @@ func (r *ZelyoConfigReconciler) SetupWithManager(mgr ctrl.Manager) error {
297305
Named("zelyoconfig").
298306
Complete(r)
299307
}
308+
309+
// redactLLMProbeError returns a short status-code-only summary for the
310+
// LLM probe error, suitable for Kubernetes Events and Condition messages.
311+
// The full error (including upstream provider response body) stays in the
312+
// operator log; Events are visible to any namespace reader and must not
313+
// carry raw upstream bodies that may echo prompts or header prefixes.
314+
func redactLLMProbeError(err error) string {
315+
if err == nil {
316+
return ""
317+
}
318+
var apiErr *llm.APIError
319+
if stderrors.As(err, &apiErr) {
320+
if apiErr.StatusCode == 429 {
321+
return "rate limited by provider"
322+
}
323+
if apiErr.StatusCode >= 400 && apiErr.StatusCode < 500 {
324+
return fmt.Sprintf("provider returned HTTP %d (auth/config error)", apiErr.StatusCode)
325+
}
326+
return fmt.Sprintf("provider returned HTTP %d", apiErr.StatusCode)
327+
}
328+
return "provider unreachable"
329+
}

internal/correlator/engine.go

Lines changed: 59 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -83,18 +83,26 @@ func NewEngine(cfg *Config) *Engine {
8383
}
8484

8585
// Ingest adds an event and attempts to correlate it.
86+
//
87+
// The onIncident callback is invoked OUTSIDE the engine lock so callbacks
88+
// that call back into the engine (ResolveIncident, GetOpenIncidents) do
89+
// not deadlock. Both the callback payload and the returned *Incident are
90+
// independent copies snapshotted WHILE the lock is held — a concurrent
91+
// Ingest must not be able to append to incident.Events between our
92+
// snapshot and the callback/return.
8693
func (e *Engine) Ingest(event *Event) *Incident {
87-
e.mu.Lock()
88-
defer e.mu.Unlock()
94+
var notifySnapshot *Incident
95+
var result *Incident
8996

97+
e.mu.Lock()
9098
if event.Timestamp.IsZero() {
9199
event.Timestamp = time.Now()
92100
}
93101

94102
e.events = append(e.events, event)
95103
e.pruneOldEvents()
96104

97-
for id, incident := range e.incidents {
105+
for _, incident := range e.incidents {
98106
if incident.Resolved {
99107
continue
100108
}
@@ -104,50 +112,74 @@ func (e *Engine) Ingest(event *Event) *Incident {
104112
if severityOrder(event.Severity) > severityOrder(incident.Severity) {
105113
incident.Severity = event.Severity
106114
}
107-
if e.onIncident != nil {
108-
e.onIncident(incident)
109-
}
110-
return e.incidents[id]
115+
notifySnapshot = copyIncident(incident)
116+
result = copyIncident(incident)
117+
break
111118
}
112119
}
113120

114-
related := e.findRelated(event)
115-
if len(related) >= 2 {
116-
e.incidentSeq++
117-
incident := &Incident{
118-
ID: fmt.Sprintf("INC-%06d", e.incidentSeq),
119-
Severity: highestSeverity(related),
120-
Title: fmt.Sprintf("Correlated incident on %s/%s", event.Namespace, event.Resource),
121-
Events: related,
122-
Namespace: event.Namespace,
123-
Resource: event.Resource,
124-
CreatedAt: time.Now(),
125-
UpdatedAt: time.Now(),
126-
}
127-
e.incidents[incident.ID] = incident
128-
if e.onIncident != nil {
129-
e.onIncident(incident)
121+
if result == nil {
122+
related := e.findRelated(event)
123+
if len(related) >= 2 {
124+
e.incidentSeq++
125+
incident := &Incident{
126+
ID: fmt.Sprintf("INC-%06d", e.incidentSeq),
127+
Severity: highestSeverity(related),
128+
Title: fmt.Sprintf("Correlated incident on %s/%s", event.Namespace, event.Resource),
129+
Events: related,
130+
Namespace: event.Namespace,
131+
Resource: event.Resource,
132+
CreatedAt: time.Now(),
133+
UpdatedAt: time.Now(),
134+
}
135+
e.incidents[incident.ID] = incident
136+
notifySnapshot = copyIncident(incident)
137+
result = copyIncident(incident)
130138
}
131-
return incident
132139
}
140+
cb := e.onIncident
141+
e.mu.Unlock()
133142

134-
return nil
143+
if notifySnapshot != nil && cb != nil {
144+
cb(notifySnapshot)
145+
}
146+
return result
135147
}
136148

137-
// GetOpenIncidents returns all unresolved incidents.
149+
// GetOpenIncidents returns copies of all unresolved incidents. Returning
150+
// copies keeps the internal *Incident records from being aliased into
151+
// caller code, where a concurrent Ingest appending to incident.Events
152+
// would race with a range-loop reader.
138153
func (e *Engine) GetOpenIncidents() []*Incident {
139154
e.mu.Lock()
140155
defer e.mu.Unlock()
141156

142157
result := make([]*Incident, 0)
143158
for _, inc := range e.incidents {
144159
if !inc.Resolved {
145-
result = append(result, inc)
160+
result = append(result, copyIncident(inc))
146161
}
147162
}
148163
return result
149164
}
150165

166+
// copyIncident returns a shallow-deep copy of the incident: a new
167+
// *Incident with a new Events slice header containing the same *Event
168+
// pointers. Events themselves are treated as immutable once ingested;
169+
// cloning the slice header is sufficient to shield callers from future
170+
// `append`s mutating the engine's backing array.
171+
func copyIncident(inc *Incident) *Incident {
172+
if inc == nil {
173+
return nil
174+
}
175+
cp := *inc
176+
if len(inc.Events) > 0 {
177+
cp.Events = make([]*Event, len(inc.Events))
178+
copy(cp.Events, inc.Events)
179+
}
180+
return &cp
181+
}
182+
151183
// ResolveIncident marks an incident as resolved.
152184
func (e *Engine) ResolveIncident(id string) {
153185
e.mu.Lock()

0 commit comments

Comments
 (0)