Skip to content

Commit 89cac18

Browse files
committed
feat: use dynamodb instead of ssm for JIT config
Add extra capabilities to support ephemeral multiarch runners. In workflow jobs with large matrix, github can request large number of runners simultaneously. This can cause the ssm to be overwhelmed. This PR replaces the ssm with dynamodb to store the JIT config. Also fixed a couple things: - `runners_ssm_housekeeper` schema was not aligned across modules - syntax for `env` in `github_agent.ubuntu.pkr.hcl` And updated baseline provisioning script for prebuilt runners. It is now closer to what Github hosted runners use.
1 parent 2f6d9e0 commit 89cac18

38 files changed

+3411
-2105
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Ephemeral Multi-Architecture Prebuilt Runners
2+
3+
This example demonstrates how to create GitHub action runners with the following features:
4+
5+
- **Ephemeral Runners**: Runners are used for one job only and terminated after completion
6+
- **Multi-Architecture Support**: Configures both x64 and ARM64 runners
7+
- **Prebuilt AMIs**: Uses custom prebuilt AMIs for faster startup times
8+
- **DynamoDB Storage**: Uses DynamoDB instead of Parameter Store to avoid rate limiting issues
9+
- **Cleanup for Offline Runners**: Includes a lambda to clean up registered offline runners from the organization
10+
11+
## Usages
12+
13+
Steps for the full setup, such as creating a GitHub app can be found in the [docs](https://github-aws-runners.github.io/terraform-aws-github-runner/getting-started/). First download the Lambda releases from GitHub. Alternatively you can build the lambdas locally with Node or Docker, there is a simple build script in `<root>/.ci/build.sh`. In the `main.tf` you can simply remove the location of the lambda zip files, the default location will work in this case.
14+
15+
> The default example assumes local built lambda's available. Ensure you have built the lambda's. Alternatively you can download the lambda's. The version needs to be set to a GitHub release version, see https://github.com/github-aws-runners/terraform-aws-github-runner/releases
16+
17+
```bash
18+
cd ../lambdas-download
19+
terraform init
20+
terraform apply -var=module_version=<VERSION>
21+
cd -
22+
```
23+
24+
25+
### Packer Images
26+
27+
You will need to build your images for both x64 and ARM64 architectures. This example deployment uses the images in `/images/linux-al2023`. You must build these images with packer in your AWS account first. Once you have built them, you need to provide your owner ID as a variable.
28+
29+
### Deploy
30+
31+
Before running Terraform, ensure the GitHub app is configured. See the [configuration details](https://github-aws-runners.github.io/terraform-aws-github-runner/configuration/#ephemeral-runners) for more details.
32+
33+
```bash
34+
terraform init
35+
terraform apply
36+
```
37+
38+
39+
The module will try to update the GitHub App webhook and secret (only linux/mac). You can receive the webhook details by running:
40+
41+
```bash
42+
terraform output webhook_secret
43+
```
44+
45+
46+
## Features
47+
48+
### Ephemeral Runners
49+
50+
Ephemeral runners are used for one job only. Each job requires a fresh instance. This feature should be used in combination with the `workflow_job` event. See GitHub webhook endpoint configuration in the documentation.
51+
52+
### Multi-Architecture Support
53+
54+
This example configures both x64 and ARM64 runners with appropriate labels. The module will decide the runner for the workflow job based on the match in the labels defined in the workflow job and runner configuration.
55+
56+
### DynamoDB Storage
57+
58+
This example uses DynamoDB instead of Parameter Store to store runner configuration and state. This helps avoid rate limiting issues that can occur with Parameter Store when managing many runners.
59+
60+
### Cleanup for Offline Runners
61+
62+
The example includes a lambda function that periodically checks for and removes registered offline runners from the organization. This is particularly useful for handling cases where spot instances are terminated by AWS while still running a job.
63+
64+
<!-- BEGIN_TF_DOCS -->
65+
## Requirements
66+
67+
| Name | Version |
68+
|------|---------|
69+
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3.0 |
70+
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | ~> 5.27 |
71+
| <a name="requirement_local"></a> [local](#requirement\_local) | ~> 2.0 |
72+
| <a name="requirement_random"></a> [random](#requirement\_random) | ~> 3.0 |
73+
74+
## Providers
75+
76+
| Name | Version |
77+
|------|---------|
78+
| <a name="provider_random"></a> [random](#provider\_random) | 3.6.3 |
79+
80+
## Modules
81+
82+
| Name | Source | Version |
83+
|------|--------|---------|
84+
| <a name="module_base"></a> [base](#module\_base) | ../base | n/a |
85+
| <a name="module_runners"></a> [runners](#module\_runners) | ../../modules/multi-runner | n/a |
86+
| <a name="module_webhook_github_app"></a> [webhook\_github\_app](#module\_webhook\_github\_app) | ../../modules/webhook-github-app | n/a |
87+
88+
## Resources
89+
90+
| Name | Type |
91+
|------|------|
92+
| [random_id.random](https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/id) | resource |
93+
94+
## Inputs
95+
96+
| Name | Description | Type | Default | Required |
97+
|------|-------------|------|---------|:--------:|
98+
| <a name="input_aws_region"></a> [aws\_region](#input\_aws\_region) | AWS region to deploy to | `string` | `"eu-west-1"` | no |
99+
| <a name="input_environment"></a> [environment](#input\_environment) | Environment name, used as prefix | `string` | `null` | no |
100+
| <a name="input_github_app"></a> [github\_app](#input\_github\_app) | GitHub for API usages. | <pre>object({<br/> id = string<br/> key_base64 = string<br/> })</pre> | n/a | yes |
101+
102+
## Outputs
103+
104+
| Name | Description |
105+
|------|-------------|
106+
| <a name="output_webhook_endpoint"></a> [webhook\_endpoint](#output\_webhook\_endpoint) | n/a |
107+
| <a name="output_webhook_secret"></a> [webhook\_secret](#output\_webhook\_secret) | n/a |
108+
<!-- END_TF_DOCS -->
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
locals {
2+
webhook_secret = random_id.random.hex
3+
4+
multi_runner_config = { for c in fileset("${path.module}/templates/runner-configs", "*.yaml") : trimsuffix(c, ".yaml") => yamldecode(file("${path.module}/templates/runner-configs/${c}")) }
5+
}
6+
7+
resource "random_id" "random" {
8+
byte_length = 20
9+
}
10+
11+
module "vpc" {
12+
source = "terraform-aws-modules/vpc/aws"
13+
version = "5.0.0"
14+
15+
name = "${var.environment}-vpc"
16+
cidr = "10.0.0.0/16"
17+
18+
azs = ["${var.aws_region}a", "${var.aws_region}b"]
19+
private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
20+
public_subnets = ["10.0.101.0/24", "10.0.102.0/24"]
21+
22+
enable_dns_hostnames = true
23+
enable_nat_gateway = false
24+
map_public_ip_on_launch = true
25+
26+
tags = {
27+
Environment = var.environment
28+
}
29+
}
30+
31+
module "dynamodb" {
32+
source = "../../modules/dynamodb"
33+
34+
table_name = "${var.environment}-runner-config"
35+
billing_mode = "PAY_PER_REQUEST"
36+
tags = {
37+
Environment = var.environment
38+
}
39+
}
40+
41+
module "runners" {
42+
source = "../../modules/multi-runner"
43+
aws_region = var.aws_region
44+
multi_runner_config = local.multi_runner_config
45+
vpc_id = module.vpc.vpc_id
46+
subnet_ids = module.vpc.public_subnets
47+
runners_scale_up_lambda_timeout = 60
48+
runners_scale_down_lambda_timeout = 60
49+
cleanup_org_runners = var.cleanup_org_runners
50+
prefix = var.environment
51+
dynamodb_arn = module.dynamodb.table_arn
52+
dynamodb_table_name = module.dynamodb.table_name
53+
tags = {
54+
Environment = var.environment
55+
}
56+
github_app = {
57+
key_base64 = var.github_app.key_base64
58+
id = var.github_app.id
59+
webhook_secret = random_id.random.hex
60+
}
61+
62+
logging_retention_in_days = 7
63+
64+
# Deploy webhook using the EventBridge
65+
eventbridge = {
66+
enable = true
67+
# adjust the allow events to only allow specific events, like workflow_job
68+
accept_events = ["workflow_job"]
69+
}
70+
71+
webhook_lambda_zip = "../../lambda_output/webhook.zip"
72+
runners_lambda_zip = "../../lambda_output/runners.zip"
73+
74+
instance_termination_watcher = {
75+
enable = true
76+
}
77+
78+
runners_ssm_housekeeper = {
79+
state = "DISABLED"
80+
config = {}
81+
}
82+
83+
metrics = {
84+
enable = true
85+
metric = {
86+
enable_github_app_rate_limit = true
87+
enable_job_retry = true
88+
enable_spot_termination_warning = true
89+
}
90+
}
91+
}
92+
93+
module "webhook_github_app" {
94+
source = "../../modules/webhook-github-app"
95+
depends_on = [module.runners]
96+
97+
github_app = {
98+
key_base64 = var.github_app.key_base64
99+
id = var.github_app.id
100+
webhook_secret = local.webhook_secret
101+
}
102+
webhook_endpoint = module.runners.webhook.endpoint
103+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
output "webhook_endpoint" {
2+
value = module.runners.webhook.endpoint
3+
}
4+
5+
output "webhook_secret" {
6+
sensitive = true
7+
value = random_id.random.hex
8+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
provider "aws" {
2+
region = var.aws_region
3+
4+
default_tags {
5+
tags = {
6+
Environment = var.environment
7+
}
8+
}
9+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
matcherConfig:
2+
exactMatch: true
3+
labelMatchers:
4+
- [self-hosted, linux, x64, ephemeral]
5+
fifo: true
6+
redrive_build_queue:
7+
enabled: true
8+
maxReceiveCount: 3
9+
runner_config:
10+
runner_os: linux
11+
runner_architecture: x64
12+
runner_run_as: ubuntu
13+
runner_name_prefix: ubuntu-2204-amd64_
14+
enable_ssm_on_runners: true
15+
credit_specification: standard
16+
instance_types:
17+
- m7a.large
18+
- m7i.large
19+
- m7i-flex.large
20+
- m6a.large
21+
- m6i.large
22+
runners_maximum_count: 256
23+
delay_webhook_event: 0
24+
scale_down_schedule_expression: cron(* * * * ? *)
25+
userdata_template: ./templates/user-data.sh
26+
enable_userdata: true
27+
ami_owners:
28+
- "self"
29+
ami_filter:
30+
name:
31+
- github-runner-ubuntu-jammy-amd64-*
32+
state:
33+
- available
34+
enable_organization_runners: true
35+
enable_ephemeral_runners: true
36+
enable_job_queued_check: true
37+
minimum_running_time_in_minutes: 2
38+
enable_runner_binaries_syncer: false
39+
create_service_linked_role_spot: true
40+
scale_up_reserved_concurrent_executions: 12
41+
lambda_architecture: arm64
42+
job_retry:
43+
enabled: true
44+
max_attempts: 3
45+
delay_in_seconds: 180
46+
block_device_mappings:
47+
- device_name: /dev/xvda
48+
delete_on_termination: true
49+
volume_type: gp3
50+
volume_size: 30
51+
encrypted: true
52+
iops: null
53+
throughput: null
54+
kms_key_id: null
55+
snapshot_id: null
56+
runner_log_files:
57+
- log_group_name: syslog
58+
prefix_log_group: true
59+
file_path: /var/log/syslog
60+
log_stream_name: "{instance_id}"
61+
runner_hook_job_started: |
62+
echo "Running pre job hook as $(whoami)"
63+
runner_hook_job_completed: |
64+
echo "Running post job hook as $(whoami)"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
matcherConfig:
2+
exactMatch: true
3+
labelMatchers:
4+
- [self-hosted, linux, arm64, ephemeral]
5+
fifo: true
6+
redrive_build_queue:
7+
enabled: true
8+
maxReceiveCount: 3
9+
runner_config:
10+
runner_os: linux
11+
runner_architecture: arm64
12+
runner_run_as: ubuntu
13+
runner_name_prefix: ubuntu-2204-arm64_
14+
enable_ssm_on_runners: true
15+
credit_specification: standard
16+
instance_types:
17+
- m8g.large
18+
- m7g.large
19+
- m6g.large
20+
runners_maximum_count: 256
21+
delay_webhook_event: 0
22+
scale_down_schedule_expression: cron(* * * * ? *)
23+
userdata_template: ./templates/user-data.sh
24+
enable_userdata: true
25+
ami_owners:
26+
- "self"
27+
ami_filter:
28+
name:
29+
- github-runner-ubuntu-jammy-arm64-*
30+
state:
31+
- available
32+
enable_organization_runners: true
33+
enable_ephemeral_runners: true
34+
enable_job_queued_check: true
35+
minimum_running_time_in_minutes: 2
36+
enable_runner_binaries_syncer: false
37+
create_service_linked_role_spot: true
38+
scale_up_reserved_concurrent_executions: 12
39+
lambda_architecture: arm64
40+
job_retry:
41+
enabled: true
42+
max_attempts: 3
43+
delay_in_seconds: 180
44+
block_device_mappings:
45+
- device_name: /dev/xvda
46+
delete_on_termination: true
47+
volume_type: gp3
48+
volume_size: 30
49+
encrypted: true
50+
iops: null
51+
throughput: null
52+
kms_key_id: null
53+
snapshot_id: null
54+
runner_log_files:
55+
- log_group_name: syslog
56+
prefix_log_group: true
57+
file_path: /var/log/syslog
58+
log_stream_name: "{instance_id}"
59+
runner_hook_job_started: |
60+
echo "Running pre job hook as $(whoami)"
61+
runner_hook_job_completed: |
62+
echo "Running post job hook as $(whoami)"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/bin/bash
2+
exec > >(tee /var/log/user-data.log | logger -t user-data -s 2>/dev/console) 2>&1
3+
4+
5+
# AWS suggest to create a log for debug purpose based on https://aws.amazon.com/premiumsupport/knowledge-center/ec2-linux-log-user-data/
6+
# As side effect all command, set +x disable debugging explicitly.
7+
#
8+
# An alternative for masking tokens could be: exec > >(sed 's/--token\ [^ ]* /--token\ *** /g' > /var/log/user-data.log) 2>&1
9+
set +x
10+
11+
%{ if enable_debug_logging }
12+
set -x
13+
%{ endif }
14+
15+
cd /opt/actions-runner
16+
17+
%{ if hook_job_started != "" }
18+
cat > /opt/actions-runner/hook_job_started.sh <<'EOF'
19+
${hook_job_started}
20+
EOF
21+
echo ACTIONS_RUNNER_HOOK_JOB_STARTED=/opt/actions-runner/hook_job_started.sh | tee -a /opt/actions-runner/.env
22+
%{ endif }
23+
24+
%{ if hook_job_completed != "" }
25+
cat > /opt/actions-runner/hook_job_completed.sh <<'EOF'
26+
${hook_job_completed}
27+
EOF
28+
echo ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/opt/actions-runner/hook_job_completed.sh | tee -a /opt/actions-runner/.env
29+
%{ endif }
30+
31+
${start_runner}

0 commit comments

Comments
 (0)