You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version: 1.0
Date: December 28, 2024
Author: Engineering Team
Status: Complete
1. Executive Summary
This document describes the implementation of a platform for ephemeral environments per Pull Request using Kubernetes. Each PR will have its own isolated environment (application + database + observability), accessible via public URL and automatically destroyed when the PR is closed/merged.
The project was developed in phases: Phase 1 (VPS infrastructure), Phase 1.5 (Epic 7 improvements), and Phase 2 (Epic 8 simplified onboarding) are complete. Phase 3 (future) will migrate to Amazon EKS.
2. Problem
#
Problem
Impact
How We'll Measure
1
Review is slow because there's no public URL for testing
Elevated time-to-market
% of PRs with review < 4h
2
Devs overwrite shared staging environment
Defects escape to production
# of post-release hotfixes
3
PR logs/metrics are scattered
Slow and complex debugging
Average time to locate root cause
4
Test infrastructure is created manually
Inconsistent environments
# of incidents due to config differences
5
Reviewers need to run code locally
Slow feedback cycle
Average review time
3. Objectives and Key Results (OKRs)
Objective
Key Result
O1. Every PR has an isolated environment
KR1.1: ≥ 95% of PRs with URL delivered in < 10 min
Destroy namespace and volumes when PR is closed/merged (< 5 min)
Must
RF-03
Unique URL per PR: https://{project-id}-pr-{number}.preview.domain.com
Must
RF-04
Automatic re-deploy on new commit push
Must
RF-05
Automatic comment on PR with preview URL and status
Should
RF-06
Apply ResourceQuota and LimitRange to PR namespaces
Should
RF-07
Allow environment "pin" via preserve=true label (max 48h)
Could
8.2 Database
ID
Requirement
Priority
RF-08
Isolated database instance per PR
Must
RF-09
Exclusive credentials per PR stored in Secrets
Must
RF-10
Database destruction along with namespace
Must
8.3 Observability
ID
Requirement
Priority
RF-11
Logs from all pods collected by Loki
Must
RF-12
CPU/memory/network metrics collected by Prometheus
Must
RF-13
Pre-configured dashboards in Grafana
Should
RF-14
Basic alerts (disk, memory, pod restarts)
Could
8.4 GitHub Runners
ID
Requirement
Priority
RF-15
Auto-registered runners on GitHub
Must
RF-16
Runners with cluster access via ServiceAccount
Must
RF-17
Runners with kubectl, helm, docker installed
Must
RF-18
Ephemeral runners (one per job) via ARC
Should
9. Non-Functional Requirements
9.1 Performance
ID
Requirement
Target
RNF-01
Environment creation time
≤ 10 min (p95)
RNF-02
Namespace destruction time
< 5 min
RNF-03
Observability stack overhead
< 6 GB RAM
9.2 Availability
ID
Requirement
Target
RNF-04
Cluster uptime (business hours)
≥ 95%
RNF-05
Automatic recovery after VPS reboot
Yes
9.3 Security
ID
Requirement
Target
RNF-06
Secrets never in plain text in repos
100%
RNF-07
Network isolation between PRs (NetworkPolicy)
Required
RNF-08
Short-lived GitHub tokens
Yes
RNF-09
CIS Kubernetes Benchmark level 1
Yes
9.4 Capacity
ID
Requirement
Target
RNF-10
Simultaneous PRs supported
≥ 5
RNF-11
Log retention
7 days
RNF-12
Metric retention
7 days
RNF-13
Limits per PR namespace
Dynamic based on enabled databases (base: 300m CPU, 512Mi RAM; scales up with each database)
10. Technology Stack
Component
Technology
Justification
Kubernetes
k3s
Lightweight, production-ready, ideal for single-node, includes containerd
Ingress
Traefik
Included in k3s, native Let's Encrypt support
CI/CD
GitHub Actions
Already used by the team, native integration
Logs
Loki + Promtail
Lightweight, native Grafana integration
Metrics
Prometheus
Industry standard, broad ecosystem
Dashboards
Grafana
Unified interface for logs and metrics
Runners
actions-runner-controller (ARC)
Ephemeral and scalable runners in the cluster
DB Operator
CloudNativePG
Manages PostgreSQL lifecycle in the cluster
MariaDB
mariadb:11
Simple deployment for MySQL-compatible databases
Secrets
Sealed Secrets
Basic security, encrypted secrets in git
Storage
Local Path Provisioner
Simple, adequate for MVP using VPS NVMe
DNS
Wildcard
*.preview.domain.com → VPS IP
11. Design Decisions
Topic
Decision
Rationale
K8s Runtime
k3s
Fast install, small footprint, ideal for VPS
DB per PR
PostgreSQL via CloudNativePG
Complete isolation, automated lifecycle
Storage
Local Path Provisioner
Avoids CSI complexity; acceptable for MVP
DNS
Wildcard *.preview.domain.com
Avoids creating DNS record per PR
Secrets
Sealed Secrets
Allows versioning encrypted secrets
Manifests
Helm charts
Flexible templating, large community
12. User Flow (Happy Path)
Dev creates branch feat/new-feature and opens PR
GitHub Actions detects pull_request: opened event
Pipeline creates namespace {project-id}-pr-{number} with ResourceQuota
Deploys application with image :pr-<number> + database
Ingress created; URL {project-id}-pr-{number}.preview.domain.com active
Bot comments on PR with preview URL and Grafana link
Dev/QA/Reviewer test and give feedback
Additional commits trigger automatic re-deploy
PR merged → Actions executes namespace cleanup
Loki/Prometheus data retained for 7 days for troubleshooting
13. Metrics and SLIs
13.1 Service Level Indicators (SLIs)
SLI
Target
% of PRs with URL delivered in < 10 min
≥ 95%
% of namespaces removed in < 5 min after close
≥ 98%
% of pods with metrics/logs collected
≥ 95%
13.2 Observability Metrics
# Provisioning time
github_actions_workflow_run_duration_seconds{job="provision"}
# Resource usage vs allocated
kube_pod_container_resource_requests / node_allocatable
# API server availability
up{job="kubernetes-apiservers"}
# Disk usage
node_filesystem_avail_bytes / node_filesystem_size_bytes