Fix Prometheus alerting gaps in v4#614
Conversation
Closes #400. AlertManager descriptions wrapped subject_CN, issuer_CN, secret_namespace/secret_name and filepath in double quotes, which broke parsing when the description was reinjected into JSON-bearing webhook payloads (Teams Adaptive Cards, custom receivers). Switch to backticks: no JSON escaping required, Slack/Teams/Discord render them as inline code spans, plain-text destinations still show the value clearly. The surrounding Helm template machinery is unchanged — only the literal quote characters around the placeholders flip from " to `.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded collision-dropped metric tracking and new health alerts. The exporter now records certificate data loss from collision conflicts, added alerts for source down and certificate validity, introduced Helm toggles for alert management, and updated Grafana and documentation to reflect new metrics and alert rules. ChangesCollision metric instrumentation and source health alerting
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@chart/templates/prometheusrule.yaml`:
- Line 24: The alert uses delta(x509_source_errors_total[15m]) which is
reset-sensitive; replace delta(...) with increase(x509_source_errors_total[15m])
in the PrometheusRule expression so the counter alert uses increase() over the
15m window (keep the same metric name x509_source_errors_total and time range)
to avoid under-reporting after restarts.
In `@chart/values.yaml`:
- Around line 750-752: The new boolean toggles alertOnSourceDown,
alertOnPassphraseFailures, alertOnCertificateNotYetValid, and
alertOnCertificateCollisions are missing per-field # `@schema` annotations; add a
# `@schema` block immediately above each field (alertOnSourceDown,
alertOnPassphraseFailures, alertOnCertificateNotYetValid,
alertOnCertificateCollisions) declaring the field as a boolean (with an explicit
default where appropriate) and include a short description and strict
constraints for chart-controlled values while allowing permissive rules for any
Kubernetes pass-through fields per the chart guidelines. Ensure each block
accompanies its respective YAML key so the chart schema generator and linters
pick up the type and metadata.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b14c89f4-1c5d-437c-ac82-5312ba08ef54
⛔ Files ignored due to path filters (1)
chart/README.mdis excluded by!chart/README.md
📒 Files selected for processing (8)
chart/grafana-dashboards/x509-certificate-exporter.jsonchart/templates/prometheusrule.yamlchart/values.schema.jsonchart/values.yamldocs/metrics.mdpkg/registry/collector.gopkg/registry/registry.gopkg/registry/registry_test.go
…se increase() The X509ExporterReadErrors PrometheusRule alert matched x509_read_errors, the v3 metric name. The v4 exporter emits x509_source_errors_total instead — the alert never fired since the v4 rewrite. Same stale name lingered in the values.yaml extraAlertGroups example and across 8 Grafana dashboard panels. All three sites renamed in lockstep. Switch the alert's range function from delta() to increase() while we're here. delta() is documented for gauges; for monotonic counters increase() is the canonical choice and correctly handles counter resets across exporter restarts.
…on alerts
Four new opt-in alerts close observability holes the existing rules
left uncovered:
- SourceDown (critical) — x509_source_up == 0 for 5m. Catches RBAC
failures, persistent K8s API errors, unreadable file paths. A source
that never converges would otherwise be silent while certs it should
watch are not checked.
- KeystorePassphraseFailures (warning) — increase on
x509_{pkcs12,jks}_passphrase_failures_total. A misconfigured
passphraseKey / passphraseSecretRef previously only leaked into
logs and an unwatched counter.
- CertificateNotYetValid (warning) — x509_cert_not_before > time().
Depends on exposeNotBeforeMetric: true; gated for explicit opt-in.
- CertificateCollision (warning) — increase on x509_cert_collision_total.
Two certs sharing a label set means one is silently invisible.
Each alert ships with an alertOnXxx toggle (default true) and its own
severity in values.yaml. Schema + README regenerated. helm lint, helm
template, schema fixtures and helm-examples all pass.
d6eb7b9 to
5f335eb
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
chart/values.schema.json (1)
1-1460:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRegenerate Helm artifacts to resolve schema drift.
CI already shows
chart/values.schema.jsonis stale relative tochart/values.yaml; please runtask doc:helmand commit the regeneratedchart/values.schema.json(andchart/README.md) in this PR.As per coding guidelines, "Regenerate chart documentation and schema after touching
chart/values.yaml,chart/templates/**, or anything influencing chart output viatask doc:helm— commit bothchart/README.mdandchart/values.schema.jsonalongside changes."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.schema.json` around lines 1 - 1460, The values.schema.json is stale after changes to chart inputs; regenerate Helm docs and schema by running the helm doc task and commit the updated artifacts. Run the project task "task doc:helm" (or the equivalent make/npm script) to regenerate chart/values.schema.json and chart/README.md, verify changes against chart/values.yaml, and add/commit both regenerated files to the PR so CI schema checks pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@chart/values.schema.json`:
- Around line 1-1460: The values.schema.json is stale after changes to chart
inputs; regenerate Helm docs and schema by running the helm doc task and commit
the updated artifacts. Run the project task "task doc:helm" (or the equivalent
make/npm script) to regenerate chart/values.schema.json and chart/README.md,
verify changes against chart/values.yaml, and add/commit both regenerated files
to the PR so CI schema checks pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01eb8446-80bf-495d-8353-16670583678b
⛔ Files ignored due to path filters (1)
chart/README.mdis excluded by!chart/README.md
📒 Files selected for processing (8)
chart/grafana-dashboards/x509-certificate-exporter.jsonchart/templates/prometheusrule.yamlchart/values.schema.jsonchart/values.yamldocs/metrics.mdpkg/registry/collector.gopkg/registry/registry.gopkg/registry/registry_test.go
✅ Files skipped from review due to trivial changes (2)
- docs/metrics.md
- chart/grafana-dashboards/x509-certificate-exporter.json
Detect-only x509_cert_collision_total ticks on every scrape under the default CollisionAuto policy, even when the discriminator label silently resolves the overlap and no certificate is dropped. Alerting on it produced perma-firing pages for a benign config quirk. Introduce x509_cert_collision_dropped_total: counts only the items the registry actually threw away (CollisionNever policy). The CertificateCollision PrometheusRule now points at the new counter so the alert fires only when data is genuinely lost. The detect-only counter stays available for dashboards and diagnostics. Tests in pkg/registry cement the contract: dropped stays at 0 under CollisionAuto and is positive under CollisionNever. docs/metrics.md documents both counters and cross-links them.
9e63fa6 to
db04e3d
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
chart/values.schema.json (1)
1-1487:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPipeline failure: schema out of date — run
task doc:helm.CI reports
chart/values.schema.jsonis out of sync withchart/values.yaml. Per coding guidelines, regenerate the schema and commit bothchart/README.mdandchart/values.schema.json.task doc:helm git add chart/README.md chart/values.schema.json🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.schema.json` around lines 1 - 1487, Schema is out of sync with values.yaml causing CI failure; regenerate the Helm docs and schema by running the documented task and commit the updated artifacts. Run task doc:helm to regenerate the chart README and values.schema.json, then git-add and commit the regenerated README and schema (the JSON controls top-level keys such as "prometheusRules" and "secretsExporter" so you can verify changes there) and push the commit to trigger CI again.
🧹 Nitpick comments (2)
chart/templates/prometheusrule.yaml (1)
40-40: 💤 Low value
nindent 0after helper'snindent 8produces incorrect indentation.The
alertExtraLabelshelper already appliesnindent 8. Calling$extraLabels | nindent 0here resets to column 0, but you're not piping—you're just emitting the pre-indented output. The helper returns content starting with newline + 8 spaces, which should land correctly at this position. However, if no extra labels exist, the helper returns empty string andnindent 0is a no-op.Actually, re-reading:
$extraLabelsis assigned at line 3 and contains the result of the include (already indented). The| nindent 0on an already-rendered string doesn't re-indent—it's effectively dead code. Consider removing the redundant| nindent 0for clarity:- {{- $extraLabels | nindent 0 }} + {{- $extraLabels }}Same applies to all
$extraLabels | nindent 0and$extraAnnotations | nindent 0occurrences.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/templates/prometheusrule.yaml` at line 40, The rendered helper outputs ($extraLabels and $extraAnnotations) already contain leading indentation because alertExtraLabels/alertExtraAnnotations is called with nindent 8 when assigned; the trailing " | nindent 0" is redundant and can confuse readers, so remove the " | nindent 0" from usages of $extraLabels and $extraAnnotations in promtheusrule.yaml (locate the template references to the variables named $extraLabels and $extraAnnotations and delete the pipe to nindent 0), leaving the pre-indented helper output intact.test/schema/valid/severities-all-values.yaml (1)
1-7: 💤 Low valueConsider extending fixture to cover all new severity fields.
The fixture validates some severity enums but doesn't cover the new fields added in this PR:
sourceErrorsSustainedSeveritykubeTransportErrorsSeveritykubeTransportErrorsSustainedSeveritykeystorePassphraseFailuresSeveritycertificateNotYetValidSeveritycertificateCollisionSeveritysourceDownSeverityAdding them ensures the enum constraints are exercised.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/schema/valid/severities-all-values.yaml` around lines 1 - 7, Update the fixture file "severities-all-values.yaml" by adding the new severity fields under the prometheusRules block so the enum constraints are tested: add sourceErrorsSustainedSeverity, kubeTransportErrorsSeverity, kubeTransportErrorsSustainedSeverity, keystorePassphraseFailuresSeverity, certificateNotYetValidSeverity, certificateCollisionSeverity, and sourceDownSeverity; assign values (info, warning, critical) across these new fields so that all enum values are exercised (follow the pattern used by existing keys like sourceErrorsSeverity, certificateErrorSeverity, certificateExpirationsSeverity) and keep the same YAML indentation/formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@chart/values.schema.json`:
- Around line 1-1487: Schema is out of sync with values.yaml causing CI failure;
regenerate the Helm docs and schema by running the documented task and commit
the updated artifacts. Run task doc:helm to regenerate the chart README and
values.schema.json, then git-add and commit the regenerated README and schema
(the JSON controls top-level keys such as "prometheusRules" and
"secretsExporter" so you can verify changes there) and push the commit to
trigger CI again.
---
Nitpick comments:
In `@chart/templates/prometheusrule.yaml`:
- Line 40: The rendered helper outputs ($extraLabels and $extraAnnotations)
already contain leading indentation because
alertExtraLabels/alertExtraAnnotations is called with nindent 8 when assigned;
the trailing " | nindent 0" is redundant and can confuse readers, so remove the
" | nindent 0" from usages of $extraLabels and $extraAnnotations in
promtheusrule.yaml (locate the template references to the variables named
$extraLabels and $extraAnnotations and delete the pipe to nindent 0), leaving
the pre-indented helper output intact.
In `@test/schema/valid/severities-all-values.yaml`:
- Around line 1-7: Update the fixture file "severities-all-values.yaml" by
adding the new severity fields under the prometheusRules block so the enum
constraints are tested: add sourceErrorsSustainedSeverity,
kubeTransportErrorsSeverity, kubeTransportErrorsSustainedSeverity,
keystorePassphraseFailuresSeverity, certificateNotYetValidSeverity,
certificateCollisionSeverity, and sourceDownSeverity; assign values (info,
warning, critical) across these new fields so that all enum values are exercised
(follow the pattern used by existing keys like sourceErrorsSeverity,
certificateErrorSeverity, certificateExpirationsSeverity) and keep the same YAML
indentation/formatting.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8c0ce5d8-1f8b-4f6f-b25b-bef7644dac6c
⛔ Files ignored due to path filters (1)
chart/README.mdis excluded by!chart/README.md
📒 Files selected for processing (18)
chart/templates/_helpers.tplchart/templates/prometheusrule.yamlchart/values.schema.jsonchart/values.yamlcmd/x509-certificate-exporter/main.godocs/examples/secrets-tuned.values.yamldocs/metrics.mdpkg/cert/reason.gopkg/registry/collector.gopkg/registry/registry.gopkg/registry/registry_test.gopkg/source/k8s/k8s.gopkg/source/k8s/k8s_passphrase_test.gotest/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txttest/schema/invalid/prometheusrules-readerrorsseverity-typo.yamltest/schema/invalid/prometheusrules-sourceerrorsseverity-typo.expect.txttest/schema/invalid/prometheusrules-sourceerrorsseverity-typo.yamltest/schema/valid/severities-all-values.yaml
💤 Files with no reviewable changes (2)
- test/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txt
- test/schema/invalid/prometheusrules-readerrorsseverity-typo.yaml
✅ Files skipped from review due to trivial changes (2)
- pkg/cert/reason.go
- test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.yaml
…ailures
The Kubernetes source's transport layer had 9 distinct failure sites
that surfaced only as logs: LIST failure (with retry/backoff), WATCH
start failure, WATCH stream Error event, watch flap (close within 5s
of open), and namespace informer sync timeout — once for secrets, once
for configmaps. None had a corresponding metric, so an exporter that
was functionally up but operationally degraded (token rotation drift,
apiserver throttling, intermittent network) was invisible to ops.
Introduce x509_kube_transport_errors_total{source_name,resource,reason}
incremented at each of the 9 sites via a new Recorder interface on
k8s.Options. The interface keeps the source decoupled from
*registry.Registry; nil is valid (the unmetered path keeps the old
log-only behaviour) and tests plug a fake. main.go wires the actual
registry through.
Ship a paired KubeTransportErrors PrometheusRule alert (opt-in via
alertOnKubeTransportErrors, default true). docs/metrics.md documents
the new metric, the five reason codes, and cross-links the alert.
MAJOR CHANGE: PrometheusRule alert names and chart value keys
renamed. Users with AlertManager routing or silencing rules that
reference the old names must update them on upgrade. Migration map:
X509ExporterReadErrors -> SourceErrors (+ SourceErrorsSustained)
alertOnReadErrors -> alertOnSourceErrors
readErrorsSeverity -> sourceErrorsSeverity (+ ...SustainedSeverity)
alertOnPassphraseFailures -> alertOnKeystorePassphraseFailures
passphraseFailuresSeverity -> keystorePassphraseFailuresSeverity
alertOnCertificateErrors -> alertOnCertificateError (singular)
certificateErrorsSeverity -> certificateErrorSeverity (singular)
alertOnCertificateCollisions -> alertOnCertificateCollision (singular)
certificateCollisionsSeverity -> certificateCollisionSeverity
alertOnCRLs -> (removed; CRL alerts always-on like Renewal/Expiration)
Substantive changes alongside the renames:
- SourceErrors gets the same two-band shape as KubeTransportErrors
(warning at >5/15min for 5m, critical for 30m). The previous "fire
on any single error" expression paged on a single malformed PEM,
which was indistinguishable from a real outage.
- SourceErrors and KubeTransportErrors both aggregate over
reason/resource via 'sum without (...)', so the alert routes by
source. The breakdown stays available on the metric for triage.
- CRLNeedsRefresh and CRLStale no longer require alertOnCRLs — same
policy as CertificateRenewal/Expiration. The underlying x509_crl_*
series only exist when a CRL is actually observed, so installs that
don't watch CRLs get nothing.
- CertificateRenewal expression now anchors to (not_after - now) > 0,
so it doesn't keep firing alongside CertificateExpiration once a
cert is past its NotAfter (those two alerts had identical truth
values in the late-life window).
- CertificateExpiration description distinguishes 'expires in X' from
'expired X ago' via humanizeDuration's sign, so the message stays
accurate after the cert is past NotAfter.
- CertificateError, CertificateNotYetValid, CertificateRenewal and
CertificateExpiration each get inline comments explaining their
dependency or always-on status — consistent with the recent
KubeTransportErrors / CertificateCollision additions.
Template hygiene:
- Three helper templates in _helpers.tpl: alertExtraLabels and
alertExtraAnnotations (replace 12 duplicated if-blocks);
alertLocationSuffix (replaces the 5 sites that re-emit the
"{{if $labels.secret_name}}in Kubernetes secret …{{else}}at
location …{{end}}" literal — when we flipped quotes to backticks
recently, 5 edits were required; the helper would have made that
one).
- All alerts stay in a single PrometheusRule group. Groups in
PrometheusRule only matter for sequential recording-rule ordering,
per-group evaluation intervals, or execution isolation — none of
which apply here. AlertManager routes on labels (alertname,
severity), not on group names, so splitting by domain would
fragment without buying anything; worse, it would trap users who
add recording rules via extraAlertGroups into the wrong group.
Category-based routing on the consumer side is straightforward via
alertname regex (Source.*, Cert.*, CRL.*, Kube.*).
db04e3d to
5aa04f9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
chart/values.yaml (2)
743-915:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRegenerate the chart artifacts for this values change.
CI is already failing because
chart/values.schema.jsonno longer matches thisprometheusRulesblock. Please reruntask doc:helmafter finalizing these values and commit the generated artifacts.As per coding guidelines, "Regenerate chart documentation and schema after touching
chart/values.yaml,chart/templates/**, or anything influencing chart output viatask doc:helm— commit bothchart/README.mdandchart/values.schema.jsonalongside changes."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.yaml` around lines 743 - 915, The CI is failing because chart/values.schema.json and chart/README.md are out of sync with the edits to chart/values.yaml (the prometheusRules/alerting blocks); regenerate the Helm docs and schema by running the repo task "task doc:helm" (or the equivalent Helm doc/schema generation command), then commit the updated chart/values.schema.json and chart/README.md alongside your changes to chart/values.yaml so the generated artifacts match the new values.
761-846:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep a compatibility path for the renamed alert values.
This block introduces new
prometheusRuleskeys in place of earlier ones such asalertOnReadErrors/alertOnCertificateErrors. With the chart's strict schema, upgraded releases carrying the old keys will hard-fail unless you keep aliases or explicitly ship this as a breaking values migration.As per coding guidelines, "Flag changes that would break upgrades from a previous chart version (renamed values, removed defaults, changed selectors)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.yaml` around lines 761 - 846, The new prometheusRules naming breaks upgrades because older keys like alertOnReadErrors / alertOnCertificateErrors (and the specific keys in this diff such as alertOnSourceErrors, alertOnCertificateError, alertOnKubeTransportErrors, etc.) were renamed; add a compatibility path that prefers the new prometheusRules.* values but falls back to the old top-level keys when the new ones are unset. Implement this by updating the chart templates (where you reference prometheusRules) to check .Values.prometheusRules.<key> ?: .Values.<oldKey> (or equivalent Helm default/fallback logic) for each renamed key, and update the values schema to include the deprecated old keys as allowed (or mark them deprecated) so upgrades do not hard-fail. Ensure you cover all renamed booleans and severity fields shown (sourceErrorsSeverity, certificateErrorSeverity, kubeTransportErrorsSeverity, etc.).chart/values.schema.json (1)
582-777:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThese renamed alert keys are upgrade-breaking under the strict schema.
Because
prometheusRulesisadditionalProperties: false, existing release values using the previous keys will fail validation on upgrade instead of mapping forward. That applies to the renamed alert toggles/severities here and to removed options like the old CRL gate.As per coding guidelines, "Flag changes that would break upgrades from a previous chart version (renamed values, removed defaults, changed selectors)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.schema.json` around lines 582 - 777, The schema is strict (prometheusRules additionalProperties: false) so renamed/removed alert keys (e.g., alertOnCertificateCollision, alertOnCertificateError, alertOnCertificateNotYetValid, alertOnKeystorePassphraseFailures, alertOnKubeTransportErrors, alertOnSourceDown, alertOnSourceErrors and their *Severity counterparts like certificateCollisionSeverity, certificateErrorSeverity, certificateExpirationsSeverity, etc.) will break upgrades; to fix, update the JSON schema to accept legacy names by either adding the old property names back as deprecated entries with the same types/defaults (so existing values still validate) or relax prometheusRules (allow additionalProperties) and add migration/validation guidance, ensuring the legacy keys are referenced in the schema alongside the new keys so upgrades don’t fail.
🧹 Nitpick comments (1)
pkg/registry/registry_test.go (1)
357-369: ⚡ Quick winConvert this registry test to table-driven form.
This case is likely to grow with new
(source, resource, reason)combinations; table-driven structure will keep additions safer and less repetitive.Proposed refactor
func TestMarkTransportError(t *testing.T) { r := New(Config{}, nopLogger()) - // Two distinct (resource, reason) pairs land on separate series. - r.MarkTransportError("kube-secrets", "secrets", cert.ReasonWatchFlapped) - r.MarkTransportError("kube-secrets", "secrets", cert.ReasonWatchFlapped) - r.MarkTransportError("kube-cms", "configmaps", cert.ReasonListFailed) - if got := testutil.ToFloat64(r.transportErrors.WithLabelValues("kube-secrets", "secrets", "watch_flapped")); got != 2 { - t.Errorf("watch_flapped counter = %v, want 2", got) - } - if got := testutil.ToFloat64(r.transportErrors.WithLabelValues("kube-cms", "configmaps", "list_failed")); got != 1 { - t.Errorf("list_failed counter = %v, want 1", got) - } + + inc := []struct { + source string + resource string + reason string + }{ + {"kube-secrets", "secrets", cert.ReasonWatchFlapped}, + {"kube-secrets", "secrets", cert.ReasonWatchFlapped}, + {"kube-cms", "configmaps", cert.ReasonListFailed}, + } + for _, c := range inc { + r.MarkTransportError(c.source, c.resource, c.reason) + } + + asserts := []struct { + source string + resource string + reason string + want float64 + }{ + {"kube-secrets", "secrets", cert.ReasonWatchFlapped, 2}, + {"kube-cms", "configmaps", cert.ReasonListFailed, 1}, + } + for _, a := range asserts { + if got := testutil.ToFloat64(r.transportErrors.WithLabelValues(a.source, a.resource, a.reason)); got != a.want { + t.Errorf("%s/%s/%s = %v, want %v", a.source, a.resource, a.reason, got, a.want) + } + } }As per coding guidelines,
**/*_test.go: Tests use the standard*_test.gocolocated layout. Prefer table-driven tests for the parser/registry packages.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/registry/registry_test.go` around lines 357 - 369, Rewrite TestMarkTransportError into a table-driven form by defining a slice of cases (each with fields like name, source, resource, reason, and expectedCount) and iterating over them to call r.MarkTransportError for each input; after applying all marks, loop over expected cases and assert counts using r.transportErrors.WithLabelValues(...). Use the existing helpers: New(Config{}, nopLogger()), r.MarkTransportError, and testutil.ToFloat64(r.transportErrors.WithLabelValues(...)) to locate and update assertions; ensure the test remains equivalent (two calls for the same (source,resource,reason) should yield count 2) and keep the single registry instance for the aggregated counts as in the original.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@chart/README.md.gotmpl`:
- Around line 232-253: The chart docs/schema are out of sync after changes to
alert docs/values; run the Helm doc generator (task doc:helm) to regenerate
chart/README.md and chart/values.schema.json from the updated templates/values,
then commit both regenerated artifacts alongside your changes (ensure the
generated chart/README.md reflects the updated alert table and that
chart/values.schema.json is updated for any alert-related keys such as
prometheusRules.alertOn<Name> and alertExpr/alertFor overrides).
In `@chart/values.schema.json`:
- Around line 574-580: The current JSON Schema for alertForOverrides uses a
restrictive pattern (^[0-9]+(ms|s|m|h|d|w|y)$) that only allows single
integer+unit durations; update the pattern for
alertForOverrides.additionalProperties to accept Prometheus duration literals by
allowing either (a) a single number that may be integer or fractional followed
by one unit (e.g., 1.5h) or (b) a compound sequence of integer+unit pairs (e.g.,
1h30m); keep the property type as string and preserve additionalProperties usage
so any key under alertForOverrides validates against the new regex.
---
Outside diff comments:
In `@chart/values.schema.json`:
- Around line 582-777: The schema is strict (prometheusRules
additionalProperties: false) so renamed/removed alert keys (e.g.,
alertOnCertificateCollision, alertOnCertificateError,
alertOnCertificateNotYetValid, alertOnKeystorePassphraseFailures,
alertOnKubeTransportErrors, alertOnSourceDown, alertOnSourceErrors and their
*Severity counterparts like certificateCollisionSeverity,
certificateErrorSeverity, certificateExpirationsSeverity, etc.) will break
upgrades; to fix, update the JSON schema to accept legacy names by either adding
the old property names back as deprecated entries with the same types/defaults
(so existing values still validate) or relax prometheusRules (allow
additionalProperties) and add migration/validation guidance, ensuring the legacy
keys are referenced in the schema alongside the new keys so upgrades don’t fail.
In `@chart/values.yaml`:
- Around line 743-915: The CI is failing because chart/values.schema.json and
chart/README.md are out of sync with the edits to chart/values.yaml (the
prometheusRules/alerting blocks); regenerate the Helm docs and schema by running
the repo task "task doc:helm" (or the equivalent Helm doc/schema generation
command), then commit the updated chart/values.schema.json and chart/README.md
alongside your changes to chart/values.yaml so the generated artifacts match the
new values.
- Around line 761-846: The new prometheusRules naming breaks upgrades because
older keys like alertOnReadErrors / alertOnCertificateErrors (and the specific
keys in this diff such as alertOnSourceErrors, alertOnCertificateError,
alertOnKubeTransportErrors, etc.) were renamed; add a compatibility path that
prefers the new prometheusRules.* values but falls back to the old top-level
keys when the new ones are unset. Implement this by updating the chart templates
(where you reference prometheusRules) to check .Values.prometheusRules.<key> ?:
.Values.<oldKey> (or equivalent Helm default/fallback logic) for each renamed
key, and update the values schema to include the deprecated old keys as allowed
(or mark them deprecated) so upgrades do not hard-fail. Ensure you cover all
renamed booleans and severity fields shown (sourceErrorsSeverity,
certificateErrorSeverity, kubeTransportErrorsSeverity, etc.).
---
Nitpick comments:
In `@pkg/registry/registry_test.go`:
- Around line 357-369: Rewrite TestMarkTransportError into a table-driven form
by defining a slice of cases (each with fields like name, source, resource,
reason, and expectedCount) and iterating over them to call r.MarkTransportError
for each input; after applying all marks, loop over expected cases and assert
counts using r.transportErrors.WithLabelValues(...). Use the existing helpers:
New(Config{}, nopLogger()), r.MarkTransportError, and
testutil.ToFloat64(r.transportErrors.WithLabelValues(...)) to locate and update
assertions; ensure the test remains equivalent (two calls for the same
(source,resource,reason) should yield count 2) and keep the single registry
instance for the aggregated counts as in the original.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 758d3ee5-9962-4fff-8228-28234986584d
⛔ Files ignored due to path filters (1)
chart/README.mdis excluded by!chart/README.md
📒 Files selected for processing (25)
chart/README.md.gotmplchart/templates/_helpers.tplchart/templates/prometheusrule.yamlchart/values.schema.jsonchart/values.yamlcmd/x509-certificate-exporter/main.godagger/render_alignment.godocs/examples/secrets-tuned.values.yamldocs/metrics.mdpkg/cert/reason.gopkg/registry/collector.gopkg/registry/registry.gopkg/registry/registry_test.gopkg/source/k8s/k8s.gopkg/source/k8s/k8s_passphrase_test.gotest/render/prometheusrules-default-alerts.expect-pass.txttest/render/prometheusrules-default-alerts.yamltest/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txttest/schema/invalid/prometheusrules-alertforoverrides-bad-duration.yamltest/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txttest/schema/invalid/prometheusrules-readerrorsseverity-typo.yamltest/schema/invalid/prometheusrules-sourceerrorsseverity-typo.expect.txttest/schema/invalid/prometheusrules-sourceerrorsseverity-typo.yamltest/schema/valid/prometheusrules-alert-overrides.yamltest/schema/valid/severities-all-values.yaml
💤 Files with no reviewable changes (2)
- test/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txt
- test/schema/invalid/prometheusrules-readerrorsseverity-typo.yaml
✅ Files skipped from review due to trivial changes (4)
- test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txt
- test/render/prometheusrules-default-alerts.expect-pass.txt
- test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.expect.txt
- docs/metrics.md
Two new value maps under `prometheusRules`: - `alertExprOverrides` — alertName -> custom Prometheus expression that replaces the default `expr:`. Closes #253: multi-cluster setups can inject `max by (cluster, …)` aggregations or label filters without having to disableBuiltinAlertGroup and reconstruct everything. - `alertForOverrides` — alertName -> `for:` duration that replaces the default. Schema-validated against Prometheus's duration syntax. Both maps key on bare alert names (no `rulePrefix`). Overrides are taken whole — no merge with the default. Single-line and multi-line strings both render correctly: the template now emits each `expr:` via `| quote` which wraps in a YAML double-quoted scalar (newlines serialise to \n; Prometheus parses the resulting string identically to the multi-line form). Test ratchets added alongside: - test/schema/valid/prometheusrules-alert-overrides.yaml — valid case covering both maps, with a multi-line expr to exercise the quoting path. - test/schema/invalid/prometheusrules-alertforoverrides-bad-duration — paired with .expect.txt, ensures the duration regex rejects natural-language inputs ("30 minutes"). - test/render/prometheusrules-default-alerts.{yaml,expect-pass.txt} — new ratchet under the existing TestHelmRender machinery (extended to support "positive substring" checks via .expect-pass.txt). Locks in the exact 13 alert names the chart ships by default: any rename or accidental removal breaks the test at PR time. Docs: - chart/README.md.gotmpl alert table refreshed — was still on the v3 layout (X509ExporterReadErrors + 3 cert alerts). Now lists all 13. - values.yaml gets the two new map keys with descriptions and commented examples.
5aa04f9 to
71314f6
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
chart/values.yaml (1)
848-852:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep backward compatibility for CRL alert toggle behavior.
Making CRL alerts unconditional removes prior opt-out behavior and can start paging immediately after upgrade for users that had CRL alerting disabled. Keep a deprecated compatibility key (or enforce a clearly documented breaking-version migration path) before merging this behavior change.
As per coding guidelines, "Flag changes that would break upgrades from a previous chart version (renamed values, removed defaults, changed selectors)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.yaml` around lines 848 - 852, The CRL alert change must preserve the prior opt-out toggle: add a deprecated compatibility key (for example crl.alerts.enabled or crl.lifecycleAlerts.enabled) and make the alerting logic read that key first (if defined) to preserve existing installs, otherwise fall back to the new unconditional behavior; update the values.yaml comment/@schema block near the "CRL lifecycle alerts" and the x509_crl_* section to document the deprecated key and migration, and ensure the Helm schema (if present) marks the old key as deprecated so upgrades retain previous opt-out behavior unless users explicitly migrate.chart/values.schema.json (1)
1-3:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRegenerate Helm artifacts; schema is currently stale.
CI already reports
chart/values.schema.jsonis out of date relative tochart/values.yaml. Runtask doc:helmand commit the regeneratedchart/values.schema.json(andchart/README.md) in this PR.As per coding guidelines, "Regenerate chart documentation and schema after touching
chart/values.yaml,chart/templates/**, or anything influencing chart output viatask doc:helm— commit bothchart/README.mdandchart/values.schema.jsonalongside changes."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@chart/values.schema.json` around lines 1 - 3, The Helm values JSON schema (chart/values.schema.json) is stale relative to chart/values.yaml; run the documentation generation task (task doc:helm) to regenerate chart/values.schema.json and chart/README.md, verify the updated files reflect the current chart/values.yaml and templates, and commit both regenerated files (chart/values.schema.json and chart/README.md) to this PR so CI stops flagging the mismatch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@chart/values.schema.json`:
- Around line 1-3: The Helm values JSON schema (chart/values.schema.json) is
stale relative to chart/values.yaml; run the documentation generation task (task
doc:helm) to regenerate chart/values.schema.json and chart/README.md, verify the
updated files reflect the current chart/values.yaml and templates, and commit
both regenerated files (chart/values.schema.json and chart/README.md) to this PR
so CI stops flagging the mismatch.
In `@chart/values.yaml`:
- Around line 848-852: The CRL alert change must preserve the prior opt-out
toggle: add a deprecated compatibility key (for example crl.alerts.enabled or
crl.lifecycleAlerts.enabled) and make the alerting logic read that key first (if
defined) to preserve existing installs, otherwise fall back to the new
unconditional behavior; update the values.yaml comment/@schema block near the
"CRL lifecycle alerts" and the x509_crl_* section to document the deprecated key
and migration, and ensure the Helm schema (if present) marks the old key as
deprecated so upgrades retain previous opt-out behavior unless users explicitly
migrate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f459fe06-692f-45fa-864e-d223892ff98d
⛔ Files ignored due to path filters (1)
chart/README.mdis excluded by!chart/README.md
📒 Files selected for processing (10)
chart/README.md.gotmplchart/templates/prometheusrule.yamlchart/values.schema.jsonchart/values.yamldagger/render_alignment.gotest/render/prometheusrules-default-alerts.expect-pass.txttest/render/prometheusrules-default-alerts.yamltest/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txttest/schema/invalid/prometheusrules-alertforoverrides-bad-duration.yamltest/schema/valid/prometheusrules-alert-overrides.yaml
✅ Files skipped from review due to trivial changes (5)
- test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.yaml
- test/render/prometheusrules-default-alerts.yaml
- test/schema/valid/prometheusrules-alert-overrides.yaml
- test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txt
- chart/README.md.gotmpl
Summary by CodeRabbit
New Features
Improvements
Tests