Commit d32a0ee
fix: audit wave 1 — dashboard reliability, controller correctness, remediation safety, concurrency (#87)
* fix(dashboard): use server-lifecycle context for background goroutines
Preset propose spawns a demoAdvancePreset goroutine that outlives the
HTTP request. The previous code used context.Background(), which meant
the goroutine would continue mutating shared state (preset store,
remediation store) after server shutdown — introducing a shutdown race.
Switch to a server-lifecycle context published by Start() via a
mutex-guarded field. Handlers that spawn background work derive from it
via backgroundContext() so goroutines observe server shutdown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dashboard): close audit gaps in compliance + SSE + preset flow
Consolidated dashboard-layer fixes surfaced during the audit pass:
- GitOps badge no longer lies. ConfigStatus was seeded with a demo
default (GitOpsConfigured: true, GitOpsRepo: "zelyo-ai/platform-gitops")
and SetConfigStatus was never wired, so the Compliance page claimed
GitOps was connected even when no GitOpsRepository existed. Derive
the connected/repo fields live from GitOpsRepositoryList at request
time and drop the demo default.
- SSE send-on-closed-channel race. handleSSE's defer closed the
subscriber channel outside the map-mutation lock while Broadcast sent
on it under RLock — a classic send-on-closed panic window. Delete
from the map under Lock and let GC reap the channel.
- SSE stream poisoning on unmarshalable event.Data. An ignored
json.Marshal error emitted "data: null", breaking every subscriber's
stream. Skip the event and log at V(1) instead.
- Start() no longer returns before Shutdown finishes. The goroutine-only
Shutdown pattern let ListenAndServe return http.ErrServerClosed
immediately while active handlers were still draining; join the two
via a buffered error channel so the manager sees runnable exit only
after Shutdown completes.
- backgroundContext() fallback now returns a cancelled ctx instead of
context.Background() so handlers invoked before Start (test paths)
can't silently re-introduce the goroutine leak.
- Preset state race fixed. handlePresetPropose wrote state=Proposing,
spawned demoAdvancePreset, then wrote state=PendingMerge — racing
with the goroutine's later merged→enabled writes. Complete both
synchronous transitions before spawning the goroutine, and thread
the GitOps repo slug through so the goroutine's Emit call doesn't
re-read shared state.
- Preset ID validation. Four preset endpoints now reject IDs that
don't match the catalog shape (lowercase, digits, dashes, ≤64 chars)
before they reach FindPreset or the remediation store.
- /explain body cap. handleExplain now wraps r.Body in
http.MaxBytesReader(16 KiB) so a hostile client can't make the
handler allocate unbounded memory while decoding the request.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(controller): ScanReport GenerateName, target validation, LLM redaction
- ClusterScan controller now creates ScanReports via GenerateName rather
than a composed "<scan>-<unix-timestamp>" Name. The previous form
violated CLAUDE.md (which mandates GenerateName for ScanReports) and
could exceed the 63-char DNS-1123 label limit on long scan names.
Added scanReportBasename to bound the prefix at 56 chars so the API
server's 5-char suffix always fits.
- truncateRunes replaces unsafe f.Title[:min(...)] byte slicing in the
Finding ID composition. Multi-byte UTF-8 titles (any CJK/emoji) were
being chopped mid-codepoint, producing invalid-UTF-8 ID strings.
- SecurityPolicy controller adds the RBAC markers that match its runtime
behavior: it lists NotificationChannelList and reads the referenced
Slack webhook Secret. Without these markers the operator silently
fails notification delivery with RBAC-forbidden errors at runtime.
- RemediationPolicy controller no longer swallows non-NotFound errors
when validating TargetPolicies. Missing targets now mark the policy
Ready=False with Reason=ReconcileFailed, set Phase=Error, and requeue
with backoff instead of ticking forward into active remediation.
- ZelyoConfig controller redacts the LLM probe error before putting it
into the Event message and Condition. *llm.APIError.Error() includes
the raw provider response body which, on 401/403, often echoes
submitted header prefixes or request fragments — visible to anyone
with namespace-scoped get-events RBAC. Full detail still goes to the
operator log via log.Error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(remediation,github): validate LLM fix plans and escape PR file paths
- safeRepoPath in the GitHub engine now validates and URL-escapes every
repository-relative path before embedding it in a GitHub contents API
URL. GetFile/createOrUpdateFile/deleteFile previously concatenated the
path directly, so an LLM-emitted "../secret.yaml" or "%2e%2e/passwd"
could overwrite or delete files outside the intended tree. We reject
absolute paths, backslashes, and any ../.. segment, and url.PathEscape
each remaining segment so spaces/percent-encodings don't corrupt the
URL. branch and ref query args are now url.QueryEscape'd too.
- extractFixes no longer trusts the LLM response. The previous
fallback wrapped any unstructured LLM prose as a single "patch" fix
and committed it — exactly the kind of arbitrary-content-in-a-PR
outcome the auto-remediation design is supposed to prevent. The new
extractFixes rejects any response that doesn't conform to the JSON
schema, validates each fix's operation against an allowlist
(create/update/delete — no silent "update" default for unknown ops),
validates the file path against the same safety rules the GitHub
engine enforces, and bounds patch/description sizes.
- Added test coverage for the three regressions this closes:
unstructured responses must be rejected (0 fixes), path traversal
in LLM-emitted paths must be filtered out, and unknown operations
must be dropped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(concurrency): deep-copy shared state + move callback outside lock
- correlator.Engine.Ingest no longer holds the engine mutex across
onIncident callback invocation. A callback that calls back into the
engine (ResolveIncident, GetOpenIncidents) would deadlock waiting
for the Ingest-held lock. Snapshot the incident to notify, release
the lock, then fire the callback with a copy.
- correlator.Engine.Ingest and GetOpenIncidents now return *Incident
copies. The previous code returned a pointer into e.incidents that a
concurrent Ingest would then mutate via `append(incident.Events, ...)`
— producing a data race any caller ranging over Events could hit.
New copyIncident clones the Incident struct and its Events slice
header (shielding callers from future appends while keeping the
*Event pointers shared, which are treated as immutable post-ingest).
- anomaly.Detector.GetBaseline now deep-copies Baseline.Values. The
shallow struct copy aliased the slice backing array with the engine's
own, so a caller ranging over Values raced with Observe() appending.
- threat.Aggregator.LookupImage stores and returns ImageThreat copies
instead of aliasing one pointer across the cache and every caller.
cloneImageThreat handles the CVEs slice too.
None of these were hypothetical — each would trip `-race` under
realistic load. No behavior change for single-threaded callers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(lint): rename url locals and fix cancelled/canceled misspell
Renaming github/engine.go local url -> reqURL removes the
importShadow gocritic warnings that surfaced once net/url was
imported for safeRepoPath. No behavior change; mechanical rename of
six locals and one parameter.
Comment spellings in dashboard/server.go adjusted to "canceled" to
match stdlib (context.Canceled) and the misspell linter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address codex review — close post-unlock race, fail on empty plans
Two P1 findings from codex review of PR #87:
1. correlator.Engine.Ingest still had a post-unlock race in the
callback dispatch path. notifySnapshot previously captured the raw
internal *Incident pointer under the lock, but the copyIncident
call happened AFTER e.mu.Unlock() — so a concurrent Ingest could
append to incident.Events between the unlock and the copy,
producing the exact data race the fix was meant to eliminate.
Snapshot the incident via copyIncident WHILE holding the lock,
then pass the immutable snapshot to the callback after unlock.
Added TestEngine_ConcurrentIngestIsRaceFree which drives 16×50
concurrent Ingests against the same (ns, resource) plus a parallel
GetOpenIncidents reader — the old code tripped -race immediately;
the new code is clean.
2. remediation.Engine.GeneratePlan no longer returns a successful
plan with zero fixes. When every LLM-proposed fix fails validation
(path traversal, unknown operation, size cap, or the response is
unstructured prose), extractFixes correctly returns zero fixes —
but GeneratePlan was still returning (plan, nil). In dry-run mode
ApplyPlan then returns nil error too, and processIncidents resolves
the incident anyway: malformed/unsafe LLM output was silently
closing incidents with no remediation applied. GeneratePlan now
returns an error in the zero-fix case so the caller keeps the
incident open for retry/manual triage. Added
TestGeneratePlan_ZeroValidatedFixes_ReturnsError covering three
representative rejection paths.
Verified: go build, go vet, golangci-lint, full `go test`, and
`go test -race ./internal/correlator/... ./internal/remediation/...`
are all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address coderabbit review comments on PR #87
Six major-severity items flagged by coderabbit, each independently
verifiable:
1. securitypolicy_controller RBAC marker reduced `secrets` to verbs=get.
The controller only fetches named secrets via r.Get — list/watch were
overscoped. (Note: aggregate role.yaml still shows get;list;watch
because four other controllers legitimately need the wider verbs;
tightening those is out of scope for this review and tracked
separately.)
2. resolveConfigStatus now explicitly clears GitOpsConfigured/GitOpsRepo
when the live GitOpsRepositoryList is empty. The previous branch
returned the store snapshot unchanged, so if SetConfigStatus ever
seeded stale values the badge could still read "connected" after
the live query proved there are no repos. Live state is the source
of truth.
3. dashboard Server.Start now surfaces ListenAndServe startup errors
even when ctx.Done() wins the select. A real bind/cert-load failure
could race ctx cancellation and be silently discarded by the old
`<-serveErr` discard; we now check the drained error and return it
if it's not http.ErrServerClosed.
4. GitHub engine safeRepoPath validation errors are now wrapped with
operation context (GetFile / upsert / delete), per CLAUDE.md's rule
against bare `return err`.
5. remediation.Engine.GeneratePlan no longer embeds raw LLM analysis
in the returned error. The error is emitted as a Kubernetes Event
by the RemediationPolicy controller — LLM output can echo secrets,
request fragments, or arbitrary text and must not land on
cluster-visible sinks. Full analysis stays in the operator log.
The now-unused truncateString helper is removed.
6. extractFixes now rejects create/update fixes with empty/whitespace
patches. An `operation:"update"` + `patch:""` previously passed
validation and would generate a PR that blanks the target manifest
— empty payloads are only valid for `delete` (where the op carries
the intent). Added TestExtractFixes_EmptyPatchForCreateUpdateDropped
covering the mixed case (blank create/update dropped, legit
delete + non-blank update kept).
Verified: build clean, full test suite + -race on dashboard/remediation/
correlator green, golangci-lint 0 issues.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dashboard): clarify SSE marshal-failure log message
Per coderabbit review: the log read "unmarshalable Data" but the op is
json.Marshal. Renamed to "json.Marshal failed" and added the error to
the structured fields so V(1) logs carry enough context to debug the
poisoning event without reaching for the code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dashboard): SSE loop observes shutdown + surface Shutdown errors
Two major items from coderabbit review:
1. SSE handler's inner select now watches s.backgroundContext().Done()
in addition to r.Context().Done(). http.Server.Shutdown waits for
active handlers to return on their own up to the 10s timeout — an
SSE loop that only observed the per-request ctx would stall every
graceful shutdown until that timeout expired. Wiring the
server-lifecycle ctx into the select makes active dashboard tabs
drop out promptly when the manager ctx cancels.
2. Start()'s ctx.Done() branch no longer masks Shutdown failures. The
previous code logged Shutdown errors and still returned nil once
ListenAndServe reported the expected ErrServerClosed — so a real
stop failure (DeadlineExceeded, close error) disappeared. Capture
shutdownErr, drain serveErr, and return wrapped errors per CLAUDE.md
(no bare return err) on both branches: dashboard listen, dashboard
listen during shutdown, dashboard shutdown.
Verified: build clean, dashboard tests green under -race, lint 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 314269c commit d32a0ee
15 files changed
Lines changed: 784 additions & 144 deletions
File tree
- internal
- anomaly
- controller
- correlator
- dashboard
- github
- remediation
- threat
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
148 | | - | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
149 | 152 | | |
150 | 153 | | |
151 | 154 | | |
| |||
154 | 157 | | |
155 | 158 | | |
156 | 159 | | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
157 | 163 | | |
158 | 164 | | |
159 | 165 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
258 | 258 | | |
259 | 259 | | |
260 | 260 | | |
261 | | - | |
| 261 | + | |
262 | 262 | | |
263 | 263 | | |
264 | 264 | | |
| |||
354 | 354 | | |
355 | 355 | | |
356 | 356 | | |
357 | | - | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
358 | 362 | | |
359 | 363 | | |
360 | | - | |
361 | | - | |
| 364 | + | |
| 365 | + | |
362 | 366 | | |
363 | 367 | | |
364 | 368 | | |
| |||
388 | 392 | | |
389 | 393 | | |
390 | 394 | | |
391 | | - | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
392 | 410 | | |
393 | 411 | | |
394 | 412 | | |
| |||
499 | 517 | | |
500 | 518 | | |
501 | 519 | | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
112 | 113 | | |
113 | 114 | | |
114 | 115 | | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
115 | 120 | | |
116 | 121 | | |
117 | 122 | | |
118 | 123 | | |
119 | 124 | | |
120 | 125 | | |
| 126 | + | |
121 | 127 | | |
122 | 128 | | |
| 129 | + | |
123 | 130 | | |
| 131 | + | |
124 | 132 | | |
125 | 133 | | |
126 | 134 | | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
127 | 147 | | |
128 | 148 | | |
129 | 149 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
| 73 | + | |
73 | 74 | | |
74 | 75 | | |
| 76 | + | |
75 | 77 | | |
76 | 78 | | |
77 | 79 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| |||
254 | 255 | | |
255 | 256 | | |
256 | 257 | | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
257 | 265 | | |
258 | | - | |
259 | | - | |
| 266 | + | |
| 267 | + | |
260 | 268 | | |
261 | 269 | | |
262 | 270 | | |
263 | | - | |
| 271 | + | |
264 | 272 | | |
265 | 273 | | |
266 | 274 | | |
| |||
297 | 305 | | |
298 | 306 | | |
299 | 307 | | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
83 | 83 | | |
84 | 84 | | |
85 | 85 | | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
86 | 93 | | |
87 | | - | |
88 | | - | |
| 94 | + | |
| 95 | + | |
89 | 96 | | |
| 97 | + | |
90 | 98 | | |
91 | 99 | | |
92 | 100 | | |
93 | 101 | | |
94 | 102 | | |
95 | 103 | | |
96 | 104 | | |
97 | | - | |
| 105 | + | |
98 | 106 | | |
99 | 107 | | |
100 | 108 | | |
| |||
104 | 112 | | |
105 | 113 | | |
106 | 114 | | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
111 | 118 | | |
112 | 119 | | |
113 | 120 | | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
130 | 138 | | |
131 | | - | |
132 | 139 | | |
| 140 | + | |
| 141 | + | |
133 | 142 | | |
134 | | - | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
135 | 147 | | |
136 | 148 | | |
137 | | - | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
138 | 153 | | |
139 | 154 | | |
140 | 155 | | |
141 | 156 | | |
142 | 157 | | |
143 | 158 | | |
144 | 159 | | |
145 | | - | |
| 160 | + | |
146 | 161 | | |
147 | 162 | | |
148 | 163 | | |
149 | 164 | | |
150 | 165 | | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
151 | 183 | | |
152 | 184 | | |
153 | 185 | | |
| |||
0 commit comments