Skip to content

πŸ—οΈ Deploy multi-architecture NodePool for CI/CD buildsΒ #16

@kingdonb

Description

@kingdonb

🎯 Objective

Deploy multi-architecture NodePool infrastructure to support hephy-builder CI/CD workloads without disrupting existing cluster operations.

πŸ“‹ Current Infrastructure Challenge

Our current CI/CD builds are limited by single-architecture runner availability. To enable true multi-arch builds (AMD64 + ARM64), we need dedicated infrastructure that:

  • Supports both architectures efficiently
  • Isolates CI/CD workloads from production applications
  • Uses cost-effective spot instances for ephemeral builds
  • Scales automatically based on demand

πŸ—οΈ Proposed Solution: Multi-Architecture Spot NodePool

Infrastructure Design

# New NodePool: multiarch-spot
spec:
  disruption:
    consolidateAfter: 30s  # Fast consolidation for ephemeral builds
    consolidationPolicy: WhenEmpty
  template:
    metadata:
      labels:
        lifecycle: Ec2Spot
        intent: cicd-builds
    spec:
      taints:
        - key: cicd-builds
          value: "true"
          effect: NoSchedule
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: multiarch-spot-nodeclass
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In  
          values: ["amd64", "arm64"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.large", "m6i.xlarge", "m6a.large", "m6a.xlarge"]

πŸ”§ Implementation Plan

Phase 1: Infrastructure Deployment

  • Create multiarch-spot-nodeclass EC2NodeClass
  • Deploy multiarch-spot NodePool with proper taints/tolerations
  • Configure IAM permissions for Karpenter node management
  • Validate node provisioning for both AMD64 and ARM64

Phase 2: GitLab Runner Deployment

  • Deploy AMD64 GitLab runner with redacted-sandbox-amd64 tag
  • Deploy ARM64 GitLab runner with redacted-sandbox-arm64 tag
  • Configure runners with appropriate tolerations for CI taints
  • Test ECR authentication from both runners

Phase 3: Pipeline Integration

  • Update GitLab CI variables for new runner tags
  • Test multi-arch builds with isolated NodePool
  • Validate build performance and cost efficiency
  • Monitor resource utilization and scaling behavior

Phase 4: Production Readiness

  • Implement monitoring and alerting for NodePool health
  • Create runbook for NodePool management
  • Document cost optimization and scaling patterns
  • Establish backup/disaster recovery procedures

πŸ“Š Technical Specifications

NodePool Configuration

  • Capacity Type: Spot instances (cost optimization)
  • Architectures: AMD64 + ARM64 dual support
  • Instance Types: m6i/m6a family (balanced compute/memory)
  • Scaling: Automatic based on demand
  • Taints: Dedicated for CI/CD workloads

Security & Isolation

  • Network Isolation: Same VPC, isolated subnet (optional)
  • Workload Isolation: Taints prevent non-CI workload scheduling
  • IAM Permissions: Minimal required permissions for ECR/S3 access
  • Security Groups: Restricted access for build operations

Cost Optimization

  • Spot Instances: 60-70% cost savings vs on-demand
  • Fast Consolidation: 30s empty node termination
  • Right-sizing: Build-optimized instance types
  • Auto-scaling: Zero cost when no builds running

🎯 Success Criteria

Functional Requirements

  • AMD64 and ARM64 runners successfully provision nodes
  • Multi-arch builds complete successfully
  • ECR authentication works from both architectures
  • Build performance meets or exceeds current single-arch builds

Operational Requirements

  • NodePool scales from 0 to required capacity automatically
  • Spot instance interruptions handled gracefully
  • Cost tracking and optimization metrics available
  • No impact on existing cluster workloads

Performance Targets

  • Node Provisioning: < 2 minutes from request to ready
  • Build Performance: Comparable to current single-arch builds
  • Cost Efficiency: < 70% of on-demand equivalent costs
  • Availability: 99%+ uptime for build operations

πŸ”— Dependencies & Prerequisites

Infrastructure Access

  • Crossplane cluster with NodePool management permissions
  • AWS IAM permissions for Karpenter operations
  • VPC and subnet configuration for additional nodes
  • ECR registry access for both architectures

Configuration Files

  • Update multiarch-spot-nodepool.yaml with final specifications
  • Create GitLab runner deployment manifests
  • Configure runner authentication and ECR credentials

Testing Infrastructure

  • Sample multi-arch builds for validation
  • Monitoring and observability stack
  • Cost tracking and reporting tools

πŸ“š Reference Materials

Existing Configuration

  • multiarch-spot-nodepool.yaml: Current NodePool specification draft
  • attic/crossplane-node-pool-objects.yaml: Reference implementation
  • attic/gitlab-runner-*.yaml: Runner deployment examples

Documentation

  • Karpenter NodePool configuration guide
  • GitLab runner Kubernetes deployment
  • AWS Spot instance best practices

Priority: High - This infrastructure enables the core multi-architecture vision of hephy-builder and removes current build limitations.

Timeline: Target completion within 1-2 weeks for full multi-arch CI/CD capability.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions