akuity-intelligence-examples/runbooks/argocd/argocd-app-sync-failure.md at main · akuity/akuity-intelligence-examples

General

First, do the initial triage and collect the basic information to understand the incident.
Next, send a Slack notification with the link to the conversation to channel "PLEASE REPLACE" with basic detail.
Next, work on the incident according to the runbook. Don't take any action automatically, ask for approval.
If the app is stable, check 30 seconds later again, then you can close the incident automatically. Please do slack all the details in concise messages.
If you get stuck, send a Slack message again and mention that you need help.
Please ensure you send Slack message with the link to the conversation, so engineer can work with you together if needed.

Symptoms:

ArgoCD application shows "Syncing" status for extended period without completing.
Sync operation fails with errors in ArgoCD UI or application events.
Application status shows "Failed" or "Error" in sync phase.
Repeated sync attempts fail with same error message.
Resources not being created, updated, or deleted as expected during sync.

Root cause:

Invalid or malformed manifests in Git repository preventing sync.
Insufficient RBAC permissions for ArgoCD to create/update target resources.
Resource conflicts or dependencies preventing successful application of manifests.
Git repository access issues (authentication, branch/path not found).
Target cluster connectivity problems or cluster resource constraints.
Kubernetes API server rate limiting or timeout issues during large syncs.
Helm chart or Kustomize rendering failures in application source.

Solution:

Check ArgoCD application status and recent sync attempts to identify specific failure reasons and error messages.
Review application events using ArgoCD events to understand sync operation timeline and failure points.- Inspect individual managed resources for creation/update failures; check if RBAC permissions allow ArgoCD service account to perform required operations.
For manifest validation issues, review Git repository content for syntax errors, invalid resource definitions, or missing dependencies.
Check target cluster health, API server responsiveness, and available resources (CPU, memory, storage) that might prevent resource creation.
If Helm or Kustomize source, verify template rendering succeeds and produces valid Kubernetes manifests.
After addressing root cause, trigger manual sync or wait for next automatic sync cycle; monitor for successful completion and resource health.