Skip to content

Fix Prometheus alerting gaps in v4#614

Merged
npdgm merged 7 commits into
mainfrom
feat/prom-alerting-gaps
May 25, 2026
Merged

Fix Prometheus alerting gaps in v4#614
npdgm merged 7 commits into
mainfrom
feat/prom-alerting-gaps

Conversation

@npdgm

@npdgm npdgm commented May 25, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

  • New Features

    • New built-in alerts: source-down, keystore passphrase failures, certificates not-yet-valid, certificate collisions, Kubernetes transport errors, plus sustained variants
    • Per-alert override support for expressions and durations
    • New metric reporting certificates actually dropped due to collisions; transport-error metrics per source/resource/reason
  • Improvements

    • Grafana panels updated to use the corrected exporter error metric and refreshed queries
    • Helm values/schema reworked with new toggles, severities, labels, and CRL behavior
  • Tests

    • Added/updated tests for collision/dropped counters, transport-error recording, and chart rendering expectations

Review Change Stack

Closes #400. AlertManager descriptions wrapped subject_CN, issuer_CN,
secret_namespace/secret_name and filepath in double quotes, which
broke parsing when the description was reinjected into JSON-bearing
webhook payloads (Teams Adaptive Cards, custom receivers).

Switch to backticks: no JSON escaping required, Slack/Teams/Discord
render them as inline code spans, plain-text destinations still show
the value clearly. The surrounding Helm template machinery is
unchanged — only the literal quote characters around the placeholders
flip from " to `.
@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added collision-dropped metric tracking and new health alerts. The exporter now records certificate data loss from collision conflicts, added alerts for source down and certificate validity, introduced Helm toggles for alert management, and updated Grafana and documentation to reflect new metrics and alert rules.

Changes

Collision metric instrumentation and source health alerting

Layer / File(s) Summary
Collision-dropped metric instrumentation
pkg/registry/registry.go, pkg/registry/collector.go
Registry struct gains collisionDropped counter vector field initialized in initSelfMetrics. Prometheus Describe and Collect methods expose the metric. CollisionNever handler now records dropped certificate count.
Collision metric tests
pkg/registry/registry_test.go
Tests assert collisionTotal increases on collisions and collisionDropped is 0 for auto resolution and >= 1 for CollisionNever drops.
Kubernetes source recorder and wiring
pkg/source/k8s/k8s.go, pkg/source/k8s/k8s_passphrase_test.go, cmd/x509-certificate-exporter/main.go
Adds Recorder interface and Options.Recorder, nil-safe recordTransportErr helper, records transport failures from namespace/Secret/ConfigMap LIST and WATCH error paths, adds unit tests with recordingRecorder, and passes reg as Recorder when constructing the k8s source.
PrometheusRule templates and helpers
chart/templates/_helpers.tpl, chart/templates/prometheusrule.yaml
Adds alert helper templates, replaces read-errors alert with source-errors (increase(x509_source_errors_total[15m]) > 5) and sustained variant, adds SourceDown, KubeTransportErrors/sustained, KeystorePassphraseFailures, CertificateNotYetValid, CertificateCollision alerts, and updates certificate/CRL annotation formatting to use backticks and shared location text.
Helm values schema and defaults
chart/values.schema.json, chart/values.yaml, docs/examples/secrets-tuned.values.yaml
Introduces new alertOn* toggles and severity enums (sourceErrorsSeverity, sourceErrorsSustainedSeverity, sourceDownSeverity, kubeTransportErrorsSustainedSeverity, etc.), updates required fields, and updates example extraAlertGroups to use x509_source_errors_total.
Grafana and metrics docs
chart/grafana-dashboards/x509-certificate-exporter.json, docs/metrics.md
Grafana panels (8 panels) now use x509_source_errors_total in place of x509_read_errors. Documentation adds x509_kube_transport_errors_total and x509_cert_collision_dropped_total, and clarifies x509_cert_collision_total semantics.
Schema tests and fixtures
test/schema/invalid/*, test/schema/valid/*
Updated/added schema test fixtures and expectation files to reflect renamed severity keys (sourceErrorsSeverity, certificateErrorSeverity).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Fix Prometheus alerting gaps in v4' is vague and generic, using non-descriptive terms like 'gaps' that don't convey meaningful information about the specific changes. Consider a more specific title that describes the main change, such as 'Replace read-errors alerts with source-errors and transport-errors model' or 'Restructure Prometheus alerts for source and transport error detection'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/prom-alerting-gaps

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added go Pull requests that update Go code chart docs labels May 25, 2026
@npdgm npdgm self-assigned this May 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@chart/templates/prometheusrule.yaml`:
- Line 24: The alert uses delta(x509_source_errors_total[15m]) which is
reset-sensitive; replace delta(...) with increase(x509_source_errors_total[15m])
in the PrometheusRule expression so the counter alert uses increase() over the
15m window (keep the same metric name x509_source_errors_total and time range)
to avoid under-reporting after restarts.

In `@chart/values.yaml`:
- Around line 750-752: The new boolean toggles alertOnSourceDown,
alertOnPassphraseFailures, alertOnCertificateNotYetValid, and
alertOnCertificateCollisions are missing per-field # `@schema` annotations; add a
# `@schema` block immediately above each field (alertOnSourceDown,
alertOnPassphraseFailures, alertOnCertificateNotYetValid,
alertOnCertificateCollisions) declaring the field as a boolean (with an explicit
default where appropriate) and include a short description and strict
constraints for chart-controlled values while allowing permissive rules for any
Kubernetes pass-through fields per the chart guidelines. Ensure each block
accompanies its respective YAML key so the chart schema generator and linters
pick up the type and metadata.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b14c89f4-1c5d-437c-ac82-5312ba08ef54

📥 Commits

Reviewing files that changed from the base of the PR and between 305a284 and d6eb7b9.

⛔ Files ignored due to path filters (1)
  • chart/README.md is excluded by !chart/README.md
📒 Files selected for processing (8)
  • chart/grafana-dashboards/x509-certificate-exporter.json
  • chart/templates/prometheusrule.yaml
  • chart/values.schema.json
  • chart/values.yaml
  • docs/metrics.md
  • pkg/registry/collector.go
  • pkg/registry/registry.go
  • pkg/registry/registry_test.go

Comment thread chart/templates/prometheusrule.yaml Outdated
Comment thread chart/values.yaml
npdgm added 2 commits May 25, 2026 14:48
…se increase()

The X509ExporterReadErrors PrometheusRule alert matched x509_read_errors,
the v3 metric name. The v4 exporter emits x509_source_errors_total
instead — the alert never fired since the v4 rewrite. Same stale name
lingered in the values.yaml extraAlertGroups example and across 8
Grafana dashboard panels. All three sites renamed in lockstep.

Switch the alert's range function from delta() to increase() while
we're here. delta() is documented for gauges; for monotonic counters
increase() is the canonical choice and correctly handles counter
resets across exporter restarts.
…on alerts

Four new opt-in alerts close observability holes the existing rules
left uncovered:

- SourceDown (critical) — x509_source_up == 0 for 5m. Catches RBAC
  failures, persistent K8s API errors, unreadable file paths. A source
  that never converges would otherwise be silent while certs it should
  watch are not checked.
- KeystorePassphraseFailures (warning) — increase on
  x509_{pkcs12,jks}_passphrase_failures_total. A misconfigured
  passphraseKey / passphraseSecretRef previously only leaked into
  logs and an unwatched counter.
- CertificateNotYetValid (warning) — x509_cert_not_before > time().
  Depends on exposeNotBeforeMetric: true; gated for explicit opt-in.
- CertificateCollision (warning) — increase on x509_cert_collision_total.
  Two certs sharing a label set means one is silently invisible.

Each alert ships with an alertOnXxx toggle (default true) and its own
severity in values.yaml. Schema + README regenerated. helm lint, helm
template, schema fixtures and helm-examples all pass.
@npdgm npdgm force-pushed the feat/prom-alerting-gaps branch from d6eb7b9 to 5f335eb Compare May 25, 2026 12:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
chart/values.schema.json (1)

1-1460: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regenerate Helm artifacts to resolve schema drift.

CI already shows chart/values.schema.json is stale relative to chart/values.yaml; please run task doc:helm and commit the regenerated chart/values.schema.json (and chart/README.md) in this PR.

As per coding guidelines, "Regenerate chart documentation and schema after touching chart/values.yaml, chart/templates/**, or anything influencing chart output via task doc:helm — commit both chart/README.md and chart/values.schema.json alongside changes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.schema.json` around lines 1 - 1460, The values.schema.json is
stale after changes to chart inputs; regenerate Helm docs and schema by running
the helm doc task and commit the updated artifacts. Run the project task "task
doc:helm" (or the equivalent make/npm script) to regenerate
chart/values.schema.json and chart/README.md, verify changes against
chart/values.yaml, and add/commit both regenerated files to the PR so CI schema
checks pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@chart/values.schema.json`:
- Around line 1-1460: The values.schema.json is stale after changes to chart
inputs; regenerate Helm docs and schema by running the helm doc task and commit
the updated artifacts. Run the project task "task doc:helm" (or the equivalent
make/npm script) to regenerate chart/values.schema.json and chart/README.md,
verify changes against chart/values.yaml, and add/commit both regenerated files
to the PR so CI schema checks pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01eb8446-80bf-495d-8353-16670583678b

📥 Commits

Reviewing files that changed from the base of the PR and between d6eb7b9 and 5f335eb.

⛔ Files ignored due to path filters (1)
  • chart/README.md is excluded by !chart/README.md
📒 Files selected for processing (8)
  • chart/grafana-dashboards/x509-certificate-exporter.json
  • chart/templates/prometheusrule.yaml
  • chart/values.schema.json
  • chart/values.yaml
  • docs/metrics.md
  • pkg/registry/collector.go
  • pkg/registry/registry.go
  • pkg/registry/registry_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/metrics.md
  • chart/grafana-dashboards/x509-certificate-exporter.json

Detect-only x509_cert_collision_total ticks on every scrape under the
default CollisionAuto policy, even when the discriminator label
silently resolves the overlap and no certificate is dropped. Alerting
on it produced perma-firing pages for a benign config quirk.

Introduce x509_cert_collision_dropped_total: counts only the items
the registry actually threw away (CollisionNever policy). The
CertificateCollision PrometheusRule now points at the new counter so
the alert fires only when data is genuinely lost. The detect-only
counter stays available for dashboards and diagnostics.

Tests in pkg/registry cement the contract: dropped stays at 0 under
CollisionAuto and is positive under CollisionNever. docs/metrics.md
documents both counters and cross-links them.
@npdgm npdgm force-pushed the feat/prom-alerting-gaps branch 3 times, most recently from 9e63fa6 to db04e3d Compare May 25, 2026 14:00

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
chart/values.schema.json (1)

1-1487: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pipeline failure: schema out of date — run task doc:helm.

CI reports chart/values.schema.json is out of sync with chart/values.yaml. Per coding guidelines, regenerate the schema and commit both chart/README.md and chart/values.schema.json.

task doc:helm
git add chart/README.md chart/values.schema.json
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.schema.json` around lines 1 - 1487, Schema is out of sync with
values.yaml causing CI failure; regenerate the Helm docs and schema by running
the documented task and commit the updated artifacts. Run task doc:helm to
regenerate the chart README and values.schema.json, then git-add and commit the
regenerated README and schema (the JSON controls top-level keys such as
"prometheusRules" and "secretsExporter" so you can verify changes there) and
push the commit to trigger CI again.
🧹 Nitpick comments (2)
chart/templates/prometheusrule.yaml (1)

40-40: 💤 Low value

nindent 0 after helper's nindent 8 produces incorrect indentation.

The alertExtraLabels helper already applies nindent 8. Calling $extraLabels | nindent 0 here resets to column 0, but you're not piping—you're just emitting the pre-indented output. The helper returns content starting with newline + 8 spaces, which should land correctly at this position. However, if no extra labels exist, the helper returns empty string and nindent 0 is a no-op.

Actually, re-reading: $extraLabels is assigned at line 3 and contains the result of the include (already indented). The | nindent 0 on an already-rendered string doesn't re-indent—it's effectively dead code. Consider removing the redundant | nindent 0 for clarity:

-        {{- $extraLabels | nindent 0 }}
+        {{- $extraLabels }}

Same applies to all $extraLabels | nindent 0 and $extraAnnotations | nindent 0 occurrences.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/templates/prometheusrule.yaml` at line 40, The rendered helper outputs
($extraLabels and $extraAnnotations) already contain leading indentation because
alertExtraLabels/alertExtraAnnotations is called with nindent 8 when assigned;
the trailing " | nindent 0" is redundant and can confuse readers, so remove the
" | nindent 0" from usages of $extraLabels and $extraAnnotations in
promtheusrule.yaml (locate the template references to the variables named
$extraLabels and $extraAnnotations and delete the pipe to nindent 0), leaving
the pre-indented helper output intact.
test/schema/valid/severities-all-values.yaml (1)

1-7: 💤 Low value

Consider extending fixture to cover all new severity fields.

The fixture validates some severity enums but doesn't cover the new fields added in this PR:

  • sourceErrorsSustainedSeverity
  • kubeTransportErrorsSeverity
  • kubeTransportErrorsSustainedSeverity
  • keystorePassphraseFailuresSeverity
  • certificateNotYetValidSeverity
  • certificateCollisionSeverity
  • sourceDownSeverity

Adding them ensures the enum constraints are exercised.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/schema/valid/severities-all-values.yaml` around lines 1 - 7, Update the
fixture file "severities-all-values.yaml" by adding the new severity fields
under the prometheusRules block so the enum constraints are tested: add
sourceErrorsSustainedSeverity, kubeTransportErrorsSeverity,
kubeTransportErrorsSustainedSeverity, keystorePassphraseFailuresSeverity,
certificateNotYetValidSeverity, certificateCollisionSeverity, and
sourceDownSeverity; assign values (info, warning, critical) across these new
fields so that all enum values are exercised (follow the pattern used by
existing keys like sourceErrorsSeverity, certificateErrorSeverity,
certificateExpirationsSeverity) and keep the same YAML indentation/formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@chart/values.schema.json`:
- Around line 1-1487: Schema is out of sync with values.yaml causing CI failure;
regenerate the Helm docs and schema by running the documented task and commit
the updated artifacts. Run task doc:helm to regenerate the chart README and
values.schema.json, then git-add and commit the regenerated README and schema
(the JSON controls top-level keys such as "prometheusRules" and
"secretsExporter" so you can verify changes there) and push the commit to
trigger CI again.

---

Nitpick comments:
In `@chart/templates/prometheusrule.yaml`:
- Line 40: The rendered helper outputs ($extraLabels and $extraAnnotations)
already contain leading indentation because
alertExtraLabels/alertExtraAnnotations is called with nindent 8 when assigned;
the trailing " | nindent 0" is redundant and can confuse readers, so remove the
" | nindent 0" from usages of $extraLabels and $extraAnnotations in
promtheusrule.yaml (locate the template references to the variables named
$extraLabels and $extraAnnotations and delete the pipe to nindent 0), leaving
the pre-indented helper output intact.

In `@test/schema/valid/severities-all-values.yaml`:
- Around line 1-7: Update the fixture file "severities-all-values.yaml" by
adding the new severity fields under the prometheusRules block so the enum
constraints are tested: add sourceErrorsSustainedSeverity,
kubeTransportErrorsSeverity, kubeTransportErrorsSustainedSeverity,
keystorePassphraseFailuresSeverity, certificateNotYetValidSeverity,
certificateCollisionSeverity, and sourceDownSeverity; assign values (info,
warning, critical) across these new fields so that all enum values are exercised
(follow the pattern used by existing keys like sourceErrorsSeverity,
certificateErrorSeverity, certificateExpirationsSeverity) and keep the same YAML
indentation/formatting.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c0ce5d8-1f8b-4f6f-b25b-bef7644dac6c

📥 Commits

Reviewing files that changed from the base of the PR and between 5f335eb and db04e3d.

⛔ Files ignored due to path filters (1)
  • chart/README.md is excluded by !chart/README.md
📒 Files selected for processing (18)
  • chart/templates/_helpers.tpl
  • chart/templates/prometheusrule.yaml
  • chart/values.schema.json
  • chart/values.yaml
  • cmd/x509-certificate-exporter/main.go
  • docs/examples/secrets-tuned.values.yaml
  • docs/metrics.md
  • pkg/cert/reason.go
  • pkg/registry/collector.go
  • pkg/registry/registry.go
  • pkg/registry/registry_test.go
  • pkg/source/k8s/k8s.go
  • pkg/source/k8s/k8s_passphrase_test.go
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txt
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.yaml
  • test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.expect.txt
  • test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.yaml
  • test/schema/valid/severities-all-values.yaml
💤 Files with no reviewable changes (2)
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txt
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.yaml
✅ Files skipped from review due to trivial changes (2)
  • pkg/cert/reason.go
  • test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.yaml

npdgm added 2 commits May 25, 2026 18:48
…ailures

The Kubernetes source's transport layer had 9 distinct failure sites
that surfaced only as logs: LIST failure (with retry/backoff), WATCH
start failure, WATCH stream Error event, watch flap (close within 5s
of open), and namespace informer sync timeout — once for secrets, once
for configmaps. None had a corresponding metric, so an exporter that
was functionally up but operationally degraded (token rotation drift,
apiserver throttling, intermittent network) was invisible to ops.

Introduce x509_kube_transport_errors_total{source_name,resource,reason}
incremented at each of the 9 sites via a new Recorder interface on
k8s.Options. The interface keeps the source decoupled from
*registry.Registry; nil is valid (the unmetered path keeps the old
log-only behaviour) and tests plug a fake. main.go wires the actual
registry through.

Ship a paired KubeTransportErrors PrometheusRule alert (opt-in via
alertOnKubeTransportErrors, default true). docs/metrics.md documents
the new metric, the five reason codes, and cross-links the alert.
MAJOR CHANGE: PrometheusRule alert names and chart value keys
renamed. Users with AlertManager routing or silencing rules that
reference the old names must update them on upgrade. Migration map:

  X509ExporterReadErrors     -> SourceErrors (+ SourceErrorsSustained)
  alertOnReadErrors          -> alertOnSourceErrors
  readErrorsSeverity         -> sourceErrorsSeverity (+ ...SustainedSeverity)
  alertOnPassphraseFailures  -> alertOnKeystorePassphraseFailures
  passphraseFailuresSeverity -> keystorePassphraseFailuresSeverity
  alertOnCertificateErrors   -> alertOnCertificateError       (singular)
  certificateErrorsSeverity  -> certificateErrorSeverity      (singular)
  alertOnCertificateCollisions -> alertOnCertificateCollision (singular)
  certificateCollisionsSeverity -> certificateCollisionSeverity
  alertOnCRLs                -> (removed; CRL alerts always-on like Renewal/Expiration)

Substantive changes alongside the renames:

- SourceErrors gets the same two-band shape as KubeTransportErrors
  (warning at >5/15min for 5m, critical for 30m). The previous "fire
  on any single error" expression paged on a single malformed PEM,
  which was indistinguishable from a real outage.
- SourceErrors and KubeTransportErrors both aggregate over
  reason/resource via 'sum without (...)', so the alert routes by
  source. The breakdown stays available on the metric for triage.
- CRLNeedsRefresh and CRLStale no longer require alertOnCRLs — same
  policy as CertificateRenewal/Expiration. The underlying x509_crl_*
  series only exist when a CRL is actually observed, so installs that
  don't watch CRLs get nothing.
- CertificateRenewal expression now anchors to (not_after - now) > 0,
  so it doesn't keep firing alongside CertificateExpiration once a
  cert is past its NotAfter (those two alerts had identical truth
  values in the late-life window).
- CertificateExpiration description distinguishes 'expires in X' from
  'expired X ago' via humanizeDuration's sign, so the message stays
  accurate after the cert is past NotAfter.
- CertificateError, CertificateNotYetValid, CertificateRenewal and
  CertificateExpiration each get inline comments explaining their
  dependency or always-on status — consistent with the recent
  KubeTransportErrors / CertificateCollision additions.

Template hygiene:

- Three helper templates in _helpers.tpl: alertExtraLabels and
  alertExtraAnnotations (replace 12 duplicated if-blocks);
  alertLocationSuffix (replaces the 5 sites that re-emit the
  "{{if $labels.secret_name}}in Kubernetes secret …{{else}}at
  location …{{end}}" literal — when we flipped quotes to backticks
  recently, 5 edits were required; the helper would have made that
  one).
- All alerts stay in a single PrometheusRule group. Groups in
  PrometheusRule only matter for sequential recording-rule ordering,
  per-group evaluation intervals, or execution isolation — none of
  which apply here. AlertManager routes on labels (alertname,
  severity), not on group names, so splitting by domain would
  fragment without buying anything; worse, it would trap users who
  add recording rules via extraAlertGroups into the wrong group.
  Category-based routing on the consumer side is straightforward via
  alertname regex (Source.*, Cert.*, CRL.*, Kube.*).
@npdgm npdgm force-pushed the feat/prom-alerting-gaps branch from db04e3d to 5aa04f9 Compare May 25, 2026 16:48

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
chart/values.yaml (2)

743-915: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regenerate the chart artifacts for this values change.

CI is already failing because chart/values.schema.json no longer matches this prometheusRules block. Please rerun task doc:helm after finalizing these values and commit the generated artifacts.

As per coding guidelines, "Regenerate chart documentation and schema after touching chart/values.yaml, chart/templates/**, or anything influencing chart output via task doc:helm — commit both chart/README.md and chart/values.schema.json alongside changes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.yaml` around lines 743 - 915, The CI is failing because
chart/values.schema.json and chart/README.md are out of sync with the edits to
chart/values.yaml (the prometheusRules/alerting blocks); regenerate the Helm
docs and schema by running the repo task "task doc:helm" (or the equivalent Helm
doc/schema generation command), then commit the updated chart/values.schema.json
and chart/README.md alongside your changes to chart/values.yaml so the generated
artifacts match the new values.

761-846: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep a compatibility path for the renamed alert values.

This block introduces new prometheusRules keys in place of earlier ones such as alertOnReadErrors / alertOnCertificateErrors. With the chart's strict schema, upgraded releases carrying the old keys will hard-fail unless you keep aliases or explicitly ship this as a breaking values migration.

As per coding guidelines, "Flag changes that would break upgrades from a previous chart version (renamed values, removed defaults, changed selectors)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.yaml` around lines 761 - 846, The new prometheusRules naming
breaks upgrades because older keys like alertOnReadErrors /
alertOnCertificateErrors (and the specific keys in this diff such as
alertOnSourceErrors, alertOnCertificateError, alertOnKubeTransportErrors, etc.)
were renamed; add a compatibility path that prefers the new prometheusRules.*
values but falls back to the old top-level keys when the new ones are unset.
Implement this by updating the chart templates (where you reference
prometheusRules) to check .Values.prometheusRules.<key> ?: .Values.<oldKey> (or
equivalent Helm default/fallback logic) for each renamed key, and update the
values schema to include the deprecated old keys as allowed (or mark them
deprecated) so upgrades do not hard-fail. Ensure you cover all renamed booleans
and severity fields shown (sourceErrorsSeverity, certificateErrorSeverity,
kubeTransportErrorsSeverity, etc.).
chart/values.schema.json (1)

582-777: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These renamed alert keys are upgrade-breaking under the strict schema.

Because prometheusRules is additionalProperties: false, existing release values using the previous keys will fail validation on upgrade instead of mapping forward. That applies to the renamed alert toggles/severities here and to removed options like the old CRL gate.

As per coding guidelines, "Flag changes that would break upgrades from a previous chart version (renamed values, removed defaults, changed selectors)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.schema.json` around lines 582 - 777, The schema is strict
(prometheusRules additionalProperties: false) so renamed/removed alert keys
(e.g., alertOnCertificateCollision, alertOnCertificateError,
alertOnCertificateNotYetValid, alertOnKeystorePassphraseFailures,
alertOnKubeTransportErrors, alertOnSourceDown, alertOnSourceErrors and their
*Severity counterparts like certificateCollisionSeverity,
certificateErrorSeverity, certificateExpirationsSeverity, etc.) will break
upgrades; to fix, update the JSON schema to accept legacy names by either adding
the old property names back as deprecated entries with the same types/defaults
(so existing values still validate) or relax prometheusRules (allow
additionalProperties) and add migration/validation guidance, ensuring the legacy
keys are referenced in the schema alongside the new keys so upgrades don’t fail.
🧹 Nitpick comments (1)
pkg/registry/registry_test.go (1)

357-369: ⚡ Quick win

Convert this registry test to table-driven form.

This case is likely to grow with new (source, resource, reason) combinations; table-driven structure will keep additions safer and less repetitive.

Proposed refactor
 func TestMarkTransportError(t *testing.T) {
 	r := New(Config{}, nopLogger())
-	// Two distinct (resource, reason) pairs land on separate series.
-	r.MarkTransportError("kube-secrets", "secrets", cert.ReasonWatchFlapped)
-	r.MarkTransportError("kube-secrets", "secrets", cert.ReasonWatchFlapped)
-	r.MarkTransportError("kube-cms", "configmaps", cert.ReasonListFailed)
-	if got := testutil.ToFloat64(r.transportErrors.WithLabelValues("kube-secrets", "secrets", "watch_flapped")); got != 2 {
-		t.Errorf("watch_flapped counter = %v, want 2", got)
-	}
-	if got := testutil.ToFloat64(r.transportErrors.WithLabelValues("kube-cms", "configmaps", "list_failed")); got != 1 {
-		t.Errorf("list_failed counter = %v, want 1", got)
-	}
+
+	inc := []struct {
+		source   string
+		resource string
+		reason   string
+	}{
+		{"kube-secrets", "secrets", cert.ReasonWatchFlapped},
+		{"kube-secrets", "secrets", cert.ReasonWatchFlapped},
+		{"kube-cms", "configmaps", cert.ReasonListFailed},
+	}
+	for _, c := range inc {
+		r.MarkTransportError(c.source, c.resource, c.reason)
+	}
+
+	asserts := []struct {
+		source   string
+		resource string
+		reason   string
+		want     float64
+	}{
+		{"kube-secrets", "secrets", cert.ReasonWatchFlapped, 2},
+		{"kube-cms", "configmaps", cert.ReasonListFailed, 1},
+	}
+	for _, a := range asserts {
+		if got := testutil.ToFloat64(r.transportErrors.WithLabelValues(a.source, a.resource, a.reason)); got != a.want {
+			t.Errorf("%s/%s/%s = %v, want %v", a.source, a.resource, a.reason, got, a.want)
+		}
+	}
 }

As per coding guidelines, **/*_test.go: Tests use the standard *_test.go colocated layout. Prefer table-driven tests for the parser/registry packages.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/registry/registry_test.go` around lines 357 - 369, Rewrite
TestMarkTransportError into a table-driven form by defining a slice of cases
(each with fields like name, source, resource, reason, and expectedCount) and
iterating over them to call r.MarkTransportError for each input; after applying
all marks, loop over expected cases and assert counts using
r.transportErrors.WithLabelValues(...). Use the existing helpers: New(Config{},
nopLogger()), r.MarkTransportError, and
testutil.ToFloat64(r.transportErrors.WithLabelValues(...)) to locate and update
assertions; ensure the test remains equivalent (two calls for the same
(source,resource,reason) should yield count 2) and keep the single registry
instance for the aggregated counts as in the original.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@chart/README.md.gotmpl`:
- Around line 232-253: The chart docs/schema are out of sync after changes to
alert docs/values; run the Helm doc generator (task doc:helm) to regenerate
chart/README.md and chart/values.schema.json from the updated templates/values,
then commit both regenerated artifacts alongside your changes (ensure the
generated chart/README.md reflects the updated alert table and that
chart/values.schema.json is updated for any alert-related keys such as
prometheusRules.alertOn<Name> and alertExpr/alertFor overrides).

In `@chart/values.schema.json`:
- Around line 574-580: The current JSON Schema for alertForOverrides uses a
restrictive pattern (^[0-9]+(ms|s|m|h|d|w|y)$) that only allows single
integer+unit durations; update the pattern for
alertForOverrides.additionalProperties to accept Prometheus duration literals by
allowing either (a) a single number that may be integer or fractional followed
by one unit (e.g., 1.5h) or (b) a compound sequence of integer+unit pairs (e.g.,
1h30m); keep the property type as string and preserve additionalProperties usage
so any key under alertForOverrides validates against the new regex.

---

Outside diff comments:
In `@chart/values.schema.json`:
- Around line 582-777: The schema is strict (prometheusRules
additionalProperties: false) so renamed/removed alert keys (e.g.,
alertOnCertificateCollision, alertOnCertificateError,
alertOnCertificateNotYetValid, alertOnKeystorePassphraseFailures,
alertOnKubeTransportErrors, alertOnSourceDown, alertOnSourceErrors and their
*Severity counterparts like certificateCollisionSeverity,
certificateErrorSeverity, certificateExpirationsSeverity, etc.) will break
upgrades; to fix, update the JSON schema to accept legacy names by either adding
the old property names back as deprecated entries with the same types/defaults
(so existing values still validate) or relax prometheusRules (allow
additionalProperties) and add migration/validation guidance, ensuring the legacy
keys are referenced in the schema alongside the new keys so upgrades don’t fail.

In `@chart/values.yaml`:
- Around line 743-915: The CI is failing because chart/values.schema.json and
chart/README.md are out of sync with the edits to chart/values.yaml (the
prometheusRules/alerting blocks); regenerate the Helm docs and schema by running
the repo task "task doc:helm" (or the equivalent Helm doc/schema generation
command), then commit the updated chart/values.schema.json and chart/README.md
alongside your changes to chart/values.yaml so the generated artifacts match the
new values.
- Around line 761-846: The new prometheusRules naming breaks upgrades because
older keys like alertOnReadErrors / alertOnCertificateErrors (and the specific
keys in this diff such as alertOnSourceErrors, alertOnCertificateError,
alertOnKubeTransportErrors, etc.) were renamed; add a compatibility path that
prefers the new prometheusRules.* values but falls back to the old top-level
keys when the new ones are unset. Implement this by updating the chart templates
(where you reference prometheusRules) to check .Values.prometheusRules.<key> ?:
.Values.<oldKey> (or equivalent Helm default/fallback logic) for each renamed
key, and update the values schema to include the deprecated old keys as allowed
(or mark them deprecated) so upgrades do not hard-fail. Ensure you cover all
renamed booleans and severity fields shown (sourceErrorsSeverity,
certificateErrorSeverity, kubeTransportErrorsSeverity, etc.).

---

Nitpick comments:
In `@pkg/registry/registry_test.go`:
- Around line 357-369: Rewrite TestMarkTransportError into a table-driven form
by defining a slice of cases (each with fields like name, source, resource,
reason, and expectedCount) and iterating over them to call r.MarkTransportError
for each input; after applying all marks, loop over expected cases and assert
counts using r.transportErrors.WithLabelValues(...). Use the existing helpers:
New(Config{}, nopLogger()), r.MarkTransportError, and
testutil.ToFloat64(r.transportErrors.WithLabelValues(...)) to locate and update
assertions; ensure the test remains equivalent (two calls for the same
(source,resource,reason) should yield count 2) and keep the single registry
instance for the aggregated counts as in the original.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 758d3ee5-9962-4fff-8228-28234986584d

📥 Commits

Reviewing files that changed from the base of the PR and between db04e3d and 5aa04f9.

⛔ Files ignored due to path filters (1)
  • chart/README.md is excluded by !chart/README.md
📒 Files selected for processing (25)
  • chart/README.md.gotmpl
  • chart/templates/_helpers.tpl
  • chart/templates/prometheusrule.yaml
  • chart/values.schema.json
  • chart/values.yaml
  • cmd/x509-certificate-exporter/main.go
  • dagger/render_alignment.go
  • docs/examples/secrets-tuned.values.yaml
  • docs/metrics.md
  • pkg/cert/reason.go
  • pkg/registry/collector.go
  • pkg/registry/registry.go
  • pkg/registry/registry_test.go
  • pkg/source/k8s/k8s.go
  • pkg/source/k8s/k8s_passphrase_test.go
  • test/render/prometheusrules-default-alerts.expect-pass.txt
  • test/render/prometheusrules-default-alerts.yaml
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txt
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.yaml
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txt
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.yaml
  • test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.expect.txt
  • test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.yaml
  • test/schema/valid/prometheusrules-alert-overrides.yaml
  • test/schema/valid/severities-all-values.yaml
💤 Files with no reviewable changes (2)
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.expect.txt
  • test/schema/invalid/prometheusrules-readerrorsseverity-typo.yaml
✅ Files skipped from review due to trivial changes (4)
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txt
  • test/render/prometheusrules-default-alerts.expect-pass.txt
  • test/schema/invalid/prometheusrules-sourceerrorsseverity-typo.expect.txt
  • docs/metrics.md

Comment thread chart/README.md.gotmpl
Comment thread chart/values.schema.json
Two new value maps under `prometheusRules`:

- `alertExprOverrides` — alertName -> custom Prometheus expression
  that replaces the default `expr:`. Closes #253: multi-cluster setups
  can inject `max by (cluster, …)` aggregations or label filters
  without having to disableBuiltinAlertGroup and reconstruct everything.
- `alertForOverrides` — alertName -> `for:` duration that replaces
  the default. Schema-validated against Prometheus's duration syntax.

Both maps key on bare alert names (no `rulePrefix`). Overrides are
taken whole — no merge with the default. Single-line and multi-line
strings both render correctly: the template now emits each `expr:`
via `| quote` which wraps in a YAML double-quoted scalar (newlines
serialise to \n; Prometheus parses the resulting string identically
to the multi-line form).

Test ratchets added alongside:

- test/schema/valid/prometheusrules-alert-overrides.yaml — valid case
  covering both maps, with a multi-line expr to exercise the quoting
  path.
- test/schema/invalid/prometheusrules-alertforoverrides-bad-duration
  — paired with .expect.txt, ensures the duration regex rejects
  natural-language inputs ("30 minutes").
- test/render/prometheusrules-default-alerts.{yaml,expect-pass.txt} —
  new ratchet under the existing TestHelmRender machinery (extended
  to support "positive substring" checks via .expect-pass.txt). Locks
  in the exact 13 alert names the chart ships by default: any rename
  or accidental removal breaks the test at PR time.

Docs:

- chart/README.md.gotmpl alert table refreshed — was still on the v3
  layout (X509ExporterReadErrors + 3 cert alerts). Now lists all 13.
- values.yaml gets the two new map keys with descriptions and
  commented examples.
@npdgm npdgm force-pushed the feat/prom-alerting-gaps branch from 5aa04f9 to 71314f6 Compare May 25, 2026 17:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
chart/values.yaml (1)

848-852: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep backward compatibility for CRL alert toggle behavior.

Making CRL alerts unconditional removes prior opt-out behavior and can start paging immediately after upgrade for users that had CRL alerting disabled. Keep a deprecated compatibility key (or enforce a clearly documented breaking-version migration path) before merging this behavior change.

As per coding guidelines, "Flag changes that would break upgrades from a previous chart version (renamed values, removed defaults, changed selectors)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.yaml` around lines 848 - 852, The CRL alert change must preserve
the prior opt-out toggle: add a deprecated compatibility key (for example
crl.alerts.enabled or crl.lifecycleAlerts.enabled) and make the alerting logic
read that key first (if defined) to preserve existing installs, otherwise fall
back to the new unconditional behavior; update the values.yaml comment/@schema
block near the "CRL lifecycle alerts" and the x509_crl_* section to document the
deprecated key and migration, and ensure the Helm schema (if present) marks the
old key as deprecated so upgrades retain previous opt-out behavior unless users
explicitly migrate.
chart/values.schema.json (1)

1-3: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Regenerate Helm artifacts; schema is currently stale.

CI already reports chart/values.schema.json is out of date relative to chart/values.yaml. Run task doc:helm and commit the regenerated chart/values.schema.json (and chart/README.md) in this PR.

As per coding guidelines, "Regenerate chart documentation and schema after touching chart/values.yaml, chart/templates/**, or anything influencing chart output via task doc:helm — commit both chart/README.md and chart/values.schema.json alongside changes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@chart/values.schema.json` around lines 1 - 3, The Helm values JSON schema
(chart/values.schema.json) is stale relative to chart/values.yaml; run the
documentation generation task (task doc:helm) to regenerate
chart/values.schema.json and chart/README.md, verify the updated files reflect
the current chart/values.yaml and templates, and commit both regenerated files
(chart/values.schema.json and chart/README.md) to this PR so CI stops flagging
the mismatch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@chart/values.schema.json`:
- Around line 1-3: The Helm values JSON schema (chart/values.schema.json) is
stale relative to chart/values.yaml; run the documentation generation task (task
doc:helm) to regenerate chart/values.schema.json and chart/README.md, verify the
updated files reflect the current chart/values.yaml and templates, and commit
both regenerated files (chart/values.schema.json and chart/README.md) to this PR
so CI stops flagging the mismatch.

In `@chart/values.yaml`:
- Around line 848-852: The CRL alert change must preserve the prior opt-out
toggle: add a deprecated compatibility key (for example crl.alerts.enabled or
crl.lifecycleAlerts.enabled) and make the alerting logic read that key first (if
defined) to preserve existing installs, otherwise fall back to the new
unconditional behavior; update the values.yaml comment/@schema block near the
"CRL lifecycle alerts" and the x509_crl_* section to document the deprecated key
and migration, and ensure the Helm schema (if present) marks the old key as
deprecated so upgrades retain previous opt-out behavior unless users explicitly
migrate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f459fe06-692f-45fa-864e-d223892ff98d

📥 Commits

Reviewing files that changed from the base of the PR and between 5aa04f9 and 71314f6.

⛔ Files ignored due to path filters (1)
  • chart/README.md is excluded by !chart/README.md
📒 Files selected for processing (10)
  • chart/README.md.gotmpl
  • chart/templates/prometheusrule.yaml
  • chart/values.schema.json
  • chart/values.yaml
  • dagger/render_alignment.go
  • test/render/prometheusrules-default-alerts.expect-pass.txt
  • test/render/prometheusrules-default-alerts.yaml
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txt
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.yaml
  • test/schema/valid/prometheusrules-alert-overrides.yaml
✅ Files skipped from review due to trivial changes (5)
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.yaml
  • test/render/prometheusrules-default-alerts.yaml
  • test/schema/valid/prometheusrules-alert-overrides.yaml
  • test/schema/invalid/prometheusrules-alertforoverrides-bad-duration.expect.txt
  • chart/README.md.gotmpl

@npdgm npdgm merged commit d73f3c0 into main May 25, 2026
23 checks passed
@npdgm npdgm deleted the feat/prom-alerting-gaps branch May 25, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chart docs go Pull requests that update Go code sensitive

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant