Skip to content

feat: adjust pool dynamically based on demand #3855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,11 +166,11 @@ Join our discord community via [this invite link](https://discord.gg/bxgXW8jJGh)
| <a name="input_matcher_config_parameter_store_tier"></a> [matcher\_config\_parameter\_store\_tier](#input\_matcher\_config\_parameter\_store\_tier) | The tier of the parameter store for the matcher configuration. Valid values are `Standard`, and `Advanced`. | `string` | `"Standard"` | no |
| <a name="input_metrics"></a> [metrics](#input\_metrics) | Configuration for metrics created by the module, by default disabled to avoid additional costs. When metrics are enable all metrics are created unless explicit configured otherwise. | <pre>object({<br/> enable = optional(bool, false)<br/> namespace = optional(string, "GitHub Runners")<br/> metric = optional(object({<br/> enable_github_app_rate_limit = optional(bool, true)<br/> enable_job_retry = optional(bool, true)<br/> enable_spot_termination_warning = optional(bool, true)<br/> }), {})<br/> })</pre> | `{}` | no |
| <a name="input_minimum_running_time_in_minutes"></a> [minimum\_running\_time\_in\_minutes](#input\_minimum\_running\_time\_in\_minutes) | The time an ec2 action runner should be running at minimum before terminated, if not busy. | `number` | `null` | no |
| <a name="input_pool_config"></a> [pool\_config](#input\_pool\_config) | The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the `schedule_expression`. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1. Use `schedule_expression_timezone` to override the schedule time zone (defaults to UTC). | <pre>list(object({<br/> schedule_expression = string<br/> schedule_expression_timezone = optional(string)<br/> size = number<br/> }))</pre> | `[]` | no |
| <a name="input_pool_config"></a> [pool\_config](#input\_pool\_config) | The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the `schedule_expression`. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1. Use `schedule_expression_timezone` to override the schedule time zone (defaults to UTC). Experimental! Use `dynamic_pool_scaling_enabled` to enable scaling the pool dynamically, up to the `pool_size`, based on the number of queued jobs (defaults to false). | <pre>list(object({<br/> dynamic_pool_scaling_enabled = optional(bool, false)<br/> schedule_expression = string<br/> schedule_expression_timezone = optional(string)<br/> size = number<br/> }))</pre> | `[]` | no |
| <a name="input_pool_lambda_memory_size"></a> [pool\_lambda\_memory\_size](#input\_pool\_lambda\_memory\_size) | Memory size limit for scale-up lambda. | `number` | `512` | no |
| <a name="input_pool_lambda_reserved_concurrent_executions"></a> [pool\_lambda\_reserved\_concurrent\_executions](#input\_pool\_lambda\_reserved\_concurrent\_executions) | Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. | `number` | `1` | no |
| <a name="input_pool_lambda_timeout"></a> [pool\_lambda\_timeout](#input\_pool\_lambda\_timeout) | Time out for the pool lambda in seconds. | `number` | `60` | no |
| <a name="input_pool_runner_owner"></a> [pool\_runner\_owner](#input\_pool\_runner\_owner) | The pool will deploy runners to the GitHub org ID, set this value to the org to which you want the runners deployed. Repo level is not supported. | `string` | `null` | no |
| <a name="input_pool_runner_owners"></a> [pool\_runner\_owners](#input\_pool\_runner\_owners) | The pool will deploy runners to the GitHub org/repo ID(s), set this value to the org/repo(s) to which you want the runners deployed. Separate the entries by a comma. | `string` | `null` | no |
| <a name="input_prefix"></a> [prefix](#input\_prefix) | The prefix used for naming resources | `string` | `"github-actions"` | no |
| <a name="input_queue_encryption"></a> [queue\_encryption](#input\_queue\_encryption) | Configure how data on queues managed by the modules in ecrypted at REST. Options are encryped via SSE, non encrypted and via KMSS. By default encryptes via SSE is enabled. See for more details the Terraform `aws_sqs_queue` resource https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/sqs_queue. | <pre>object({<br/> kms_data_key_reuse_period_seconds = number<br/> kms_master_key_id = string<br/> sqs_managed_sse_enabled = bool<br/> })</pre> | <pre>{<br/> "kms_data_key_reuse_period_seconds": null,<br/> "kms_master_key_id": null,<br/> "sqs_managed_sse_enabled": true<br/>}</pre> | no |
| <a name="input_redrive_build_queue"></a> [redrive\_build\_queue](#input\_redrive\_build\_queue) | Set options to attach (optional) a dead letter queue to the build queue, the queue between the webhook and the scale up lambda. You have the following options. 1. Disable by setting `enabled` to false. 2. Enable by setting `enabled` to `true`, `maxReceiveCount` to a number of max retries. | <pre>object({<br/> enabled = bool<br/> maxReceiveCount = number<br/> })</pre> | <pre>{<br/> "enabled": false,<br/> "maxReceiveCount": null<br/>}</pre> | no |
Expand Down
13 changes: 11 additions & 2 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,16 +62,17 @@ module "runners" {

## Pool

The module supports two options for keeping a pool of runners. One is via a pool which only supports org-level runners, the second option is [keeping runners idle](#idle-runners).
The module supports two options for keeping a pool of runners. One is via a pool, the second option is [keeping runners idle](#idle-runners).

The pool is introduced in combination with the ephemeral runners and is primarily meant to ensure if any event is unexpectedly dropped and no runner was created, the pool can pick up the job. The pool is maintained by a lambda. Each time the lambda is triggered a check is performed to ensure the number of idle runners managed by the module matches the expected pool size. If not, the pool will be adjusted. Keep in mind that the scale down function is still active and will terminate instances that are detected as idle.

```hcl
pool_runner_owner = "my-org" # Org to which the runners are added
pool_runner_owners = "my-org" # Org to which the runners are added
pool_config = [{
size = 20 # size of the pool
schedule_expression = "cron(* * * * ? *)" # cron expression to trigger the adjustment of the pool
schedule_expression_timezone = "Australia/Sydney" # optional time zone (defaults to UTC)
dynamic_pool_scaling_enabled = false # EXPERIMENTAL: if optionaly enabled, the pool will be scaled dynamically, up to the pool size, based on the number of queued jobs (defaults to false)
}]
```

Expand Down Expand Up @@ -334,3 +335,11 @@ resource "aws_iam_role_policy" "event_rule_firehose_role" {


NOTE: By default, a runner AMI update requires a re-apply of this terraform config (the runner AMI ID is looked up by a terraform data source). To avoid this, you can use `ami_id_ssm_parameter_name` to have the scale-up lambda dynamically lookup the runner AMI ID from an SSM parameter at instance launch time. Said SSM parameter is managed outside of this module (e.g. by a runner AMI build workflow).

### Dynamic Pool Scaling

This feature allows the pool to grow dynamically based on the number of queued jobs. It can be enabled by setting the `pool_config.dynamic_pool_scaling_enabled` to `true`.

If the feature is enabled, the expected pool size will be calculated based on the number of queued jobs. The effective size of the pool will be set to the minimum of the number of queued jobs and the configured pool size.

This feature is disabled by default because the retrieval of queued jobs may exhause the GitHub API for larger deployments and cause rate limits. For larger deployments with a lot of frequent jobs having a permanent small pool available could be a better choice.
2 changes: 1 addition & 1 deletion examples/ephemeral/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ module "runners" {
enable_ephemeral_runners = true

# # Example of simple pool usages
# pool_runner_owner = "YOUR_ORG"
# pool_runner_owners = "YOUR_ORG"
# pool_config = [{
# size = 3
# schedule_expression = "cron(0/3 14 * * ? *)" # every 3 minutes between 14:00 and 15:00
Expand Down
4 changes: 2 additions & 2 deletions lambdas/functions/control-plane/src/lambda.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -152,15 +152,15 @@ describe('Adjust pool.', () => {
resolve();
});
});
await expect(adjustPool({ poolSize: 2 }, context)).resolves.not.toThrow();
await expect(adjustPool({ poolSize: 2, dynamicPoolScalingEnabled: false }, context)).resolves.not.toThrow();
});

it('Handle error for adjusting pool.', async () => {
const mock = mocked(adjust);
const error = new Error('Handle error for adjusting pool.');
mock.mockRejectedValue(error);
const logSpy = jest.spyOn(logger, 'error');
await adjustPool({ poolSize: 0 }, context);
await adjustPool({ poolSize: 0, dynamicPoolScalingEnabled: false }, context);
expect(logSpy).lastCalledWith(expect.stringContaining(error.message), expect.anything());
});
});
Expand Down
2 changes: 1 addition & 1 deletion lambdas/functions/control-plane/src/local-pool.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import { adjust } from './pool/pool';

export function run(): void {
adjust({ poolSize: 1 })
adjust({ poolSize: 1, dynamicPoolScalingEnabled: false })
.then()
.catch((e) => {
console.log(e);
Expand Down
2 changes: 1 addition & 1 deletion lambdas/functions/control-plane/src/modules.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ declare namespace NodeJS {
PARAMETER_GITHUB_APP_CLIENT_SECRET_NAME: string;
PARAMETER_GITHUB_APP_ID_NAME: string;
PARAMETER_GITHUB_APP_KEY_BASE64_NAME: string;
RUNNER_OWNER: string;
RUNNER_OWNERS: string;
SCALE_DOWN_CONFIG: string;
SSM_TOKEN_PATH: string;
SSM_CLEANUP_CONFIG: string;
Expand Down
Loading
Loading