Skip to content

Make health probe timeoutSeconds and failureThreshold configurable via values #2316

@lindeskar

Description

@lindeskar

Problem

Health probe settings (timeoutSeconds, failureThreshold, etc.) are hardcoded in templates and cannot be overridden via values.yaml. This causes issues in environments with higher network latency, where the default 1s timeout and low failure thresholds trigger unnecessary pod restarts and 502 errors.

For example, harbor-core's liveness/readiness probes have failureThreshold: 2 with no configurable timeoutSeconds (defaults to 1s). In high-latency environments, transient latency spikes cause probe failures, pod restarts (~every 2 days), and endpoint churn leading to intermittent 502 Bad Gateway errors.

Current state

Only two components have any probe configurability in values.yaml:

  • core.startupProbe.enabled and core.startupProbe.initialDelaySeconds
  • database.internal.livenessProbe.timeoutSeconds and database.internal.readinessProbe.timeoutSeconds

All other probe parameters across all 9 components (core, portal, jobservice, registry, nginx, exporter, trivy, database, redis) are hardcoded in templates.

Proposal

Make probe timing parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold, successThreshold) configurable via values.yaml for all components, following the existing pattern used by database.internal.livenessProbe.timeoutSeconds.

Default values would match the current hardcoded values exactly, so there is zero behavioral change on upgrade.

Example values.yaml structure

core:
  startupProbe:
    enabled: true
    initialDelaySeconds: 10
    periodSeconds: 10
    timeoutSeconds: 1
    failureThreshold: 360
    successThreshold: 1
  livenessProbe:
    initialDelaySeconds: 0
    periodSeconds: 10
    timeoutSeconds: 1
    failureThreshold: 2
    successThreshold: 1
  readinessProbe:
    initialDelaySeconds: 0
    periodSeconds: 10
    timeoutSeconds: 1
    failureThreshold: 2
    successThreshold: 1

Example usage for high-latency environments

# values-override.yaml
core:
  livenessProbe:
    timeoutSeconds: 5
    failureThreshold: 5
  readinessProbe:
    timeoutSeconds: 5
    failureThreshold: 5
  startupProbe:
    timeoutSeconds: 5
    failureThreshold: 5

Components affected

All 9 components with probes: core, portal, jobservice, registry (registry + registryctl containers), nginx, exporter, trivy, database, redis.

I'm happy to open a PR for this if the approach looks good.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions