Skip to content

Commit

Permalink
[Issue #2600] scale up, memory scaling, operational docs (#2720)
Browse files Browse the repository at this point in the history
## Summary

Fixes #2600

### Time to review: __5 mins__

## Changes proposed

- Adds scaling docs. Half in `OPERATIONS.md` and half inside the
terraform files
- [ API, frontend ] x [ dev, staging, prod ] up to calculated values
- Adds memory scaling in addition to the existing CPU scaling
- Makes scale out more aggressive (300 seconds => 60 seconds)

## Context for reviewers

This should be enough for a week or two. I'll re-evaluate prod weekly as
we send out emails advertising the new system.
  • Loading branch information
coilysiren authored Nov 12, 2024
1 parent 3a1903a commit 306d118
Show file tree
Hide file tree
Showing 15 changed files with 248 additions and 49 deletions.
51 changes: 51 additions & 0 deletions OPERATIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Maintenances and Operation of Runtime System

## Scaling

All scaling options can be found in the following files:

API:

- [infra/api/app-config/dev.tf](infra/api/app-config/dev.tf)
- [infra/api/app-config/staging.tf](infra/api/app-config/staging.tf)
- [infra/api/app-config/prod.tf](infra/api/app-config/prod.tf)

Frontend:

- [infra/frontend/app-config/dev.tf](infra/frontend/app-config/dev.tf)
- [infra/frontend/app-config/staging.tf](infra/frontend/app-config/staging.tf)
- [infra/frontend/app-config/prod.tf](infra/frontend/app-config/prod.tf)

### ECS

Scaling is handled by configuring the following values:

- instance desired instance count
- instance scaling minimum capacity
- instance scaling maximum capacity

Our ECS instances auto scale based on both memory and CPU. You can view the autoscaling configuration
here: [infra/modules/service/autoscaling.tf](infra/modules/service/autoscaling.tf)

### Database

Scaling is handled by configuring the following values:

- Database minimum capacity
- Database maximum capacity
- Database instance count

In prod, the database maximum capacity is as high as it goes. Further scaling past the point will require scaling
out the instance count. Effectively using the instance count scaling might require changes to our application layer.

### OpenSearch

- Search master instance type
- Search data instance type
- Search data volume size
- Search data instance count
- Search availability zone count

When scaling openSearch, consider which attribute changes will trigger blue/green deploys, versus which attributes
can be edited in place. [You can find that information here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-configuration-changes.html). Requiring blue/green changes for the average configuration change is a
notable constraint of OpenSearch, relative to ECS and the Database.
21 changes: 18 additions & 3 deletions infra/api/app-config/dev.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,26 @@ module "dev_config" {
default_region = module.project_config.default_region
environment = "dev"
has_database = local.has_database
database_instance_count = 2
database_enable_http_endpoint = true
has_incident_management_service = local.has_incident_management_service
database_max_capacity = 16
database_min_capacity = 2

# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
# https://us-east-1.console.aws.amazon.com/ecs/v2/clusters/api-dev/services/api-dev/health?region=us-east-1
# instance_desired_instance_count and instance_scaling_min_capacity are scaled for the average CPU and Memory
# seen over 12 months, as of November 2024 exlucing an outlier range around February 2024.
# With a minimum of 2, so CPU doesn't spike to infinity on deploys.
instance_desired_instance_count = 2
instance_scaling_min_capacity = 2
# instance_scaling_max_capacity is 5x the instance_scaling_min_capacity
instance_scaling_max_capacity = 10

# https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.setting-capacity.html
# https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=api-dev;is-cluster=true;tab=monitoring
# database_min_capacity is average api-dev ServerlessDatabaseCapacity seen over 12 months, as of November 2024
database_min_capacity = 2
# database_max_capacity is 5x the database_min_capacity
database_max_capacity = 10
database_instance_count = 2

has_search = true
# https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html#choosing-version
Expand Down
16 changes: 10 additions & 6 deletions infra/api/app-config/env-config/outputs.tf
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
output "search_config" {
value = var.has_search ? {
instance_type = var.search_data_instance_type
instance_count = var.search_data_instance_count
dedicated_master_type = var.search_master_instance_type
engine_version = var.search_engine_version
volume_size = var.search_data_volume_size
instance_type = var.search_data_instance_type
instance_count = var.search_data_instance_count
dedicated_master_type = var.search_master_instance_type
engine_version = var.search_engine_version
volume_size = var.search_data_volume_size
search_availability_zone_count = var.search_availability_zone_count
} : null
}

Expand All @@ -27,7 +28,10 @@ output "database_config" {

output "service_config" {
value = {
region = var.default_region
region = var.default_region
instance_desired_instance_count = var.instance_desired_instance_count
instance_scaling_max_capacity = var.instance_scaling_max_capacity
instance_scaling_min_capacity = var.instance_scaling_min_capacity
extra_environment_variables = merge(
local.default_extra_environment_variables,
var.service_override_extra_environment_variables
Expand Down
21 changes: 21 additions & 0 deletions infra/api/app-config/env-config/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,11 @@ variable "search_data_volume_size" {
default = 20
}

variable "search_availability_zone_count" {
type = number
default = 1
}

variable "has_database" {
type = bool
}
Expand All @@ -67,6 +72,22 @@ variable "database_min_capacity" {
type = number
}

variable "instance_desired_instance_count" {
description = "Number of desired ECS container instances for the service"
type = number
default = 1
}

variable "instance_scaling_max_capacity" {
description = "Maximum number of ECS container instances for the service"
type = number
}

variable "instance_scaling_min_capacity" {
description = "Minimum number of ECS container instances for the service"
type = number
}

variable "has_incident_management_service" {
type = bool
}
Expand Down
39 changes: 29 additions & 10 deletions infra/api/app-config/prod.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,41 @@ module "prod_config" {
environment = "prod"
has_database = local.has_database
domain = "api.simpler.grants.gov"
database_instance_count = 2
database_enable_http_endpoint = true
has_incident_management_service = local.has_incident_management_service
database_max_capacity = 32
database_min_capacity = 2

# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
# https://us-east-1.console.aws.amazon.com/ecs/v2/clusters/api-prod/services/api-prod/health?region=us-east-1
# instance_desired_instance_count and instance_scaling_min_capacity are scaled for 5x the average CPU and Memory
# seen over 12 months, as of November 2024 exlucing an outlier range around February 2024.
# The math is: 5 * max(average CPU or average Memory) * 1.3. The 1.3 is for a buffer.
instance_desired_instance_count = 2
instance_scaling_min_capacity = 2
# instance_scaling_max_capacity is 5x the instance_scaling_min_capacity
instance_scaling_max_capacity = 10

# https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.setting-capacity.html
# https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=api-prod;is-cluster=true;tab=monitoring
# database_min_capacity is 5x the average api-prod ServerlessDatabaseCapacity seen over 12 months, as of November 2024
# The math is: 5 * (ServerlessDatabaseCapacity) * 1.3. The 1.3 is for a buffer.
database_min_capacity = 20
# max capacity is as high as it goes
database_max_capacity = 128
database_instance_count = 2

has_search = true
# https://aws.amazon.com/opensearch-service/pricing/
# Pricing: https://aws.amazon.com/opensearch-service/pricing/
search_master_instance_type = "m6g.large.search"
# 20 is the minimum volume size for the or1.medium.search instance type.
# Scale the `search_data_volume_size` number to meet your storage needs.
search_data_instance_type = "or1.medium.search"
search_data_volume_size = 20
search_data_instance_count = 3
# Scale this number to meet your compute needs.
# https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html#choosing-version
# Scale the search_data_volume_size number to meet your storage needs.
# Scale the search_data_instance_count number to meet your compute needs.
# The search_data_instance_count should be a multiple of the number of availability zones.
# Use the AWS Console to determine the number of availability zones in your region.
search_data_instance_type = "or1.medium.search"
search_data_volume_size = 20
search_data_instance_count = 3
search_availability_zone_count = 3
# Versions: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html#choosing-version
search_engine_version = "OpenSearch_2.15"

service_override_extra_environment_variables = {
Expand Down
21 changes: 18 additions & 3 deletions infra/api/app-config/staging.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,26 @@ module "staging_config" {
default_region = module.project_config.default_region
environment = "staging"
has_database = local.has_database
database_instance_count = 2
database_enable_http_endpoint = true
has_incident_management_service = local.has_incident_management_service
database_max_capacity = 16
database_min_capacity = 2

# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
# https://us-east-1.console.aws.amazon.com/ecs/v2/clusters/api-staging/services/api-staging/health?region=us-east-1
# instance_desired_instance_count and instance_scaling_min_capacity are scaled for the average CPU and Memory
# seen over 12 months, as of November 2024 exlucing an outlier range around February 2024.
# With a minimum of 2, so CPU doesn't spike to infinity on deploys.
instance_desired_instance_count = 2
instance_scaling_min_capacity = 2
# instance_scaling_max_capacity is 5x the instance_scaling_min_capacity
instance_scaling_max_capacity = 10

# https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.setting-capacity.html
# https://us-east-1.console.aws.amazon.com/rds/home?region=us-east-1#database:id=api-dev;is-cluster=true;tab=monitoring
# database_min_capacity is average api-staging ServerlessDatabaseCapacity seen over 12 months, as of November 2024
database_min_capacity = 2
# database_max_capacity is 5x the database_min_capacity
database_max_capacity = 10
database_instance_count = 2

has_search = true
# https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html#choosing-version
Expand Down
2 changes: 1 addition & 1 deletion infra/api/database/search.tf
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ module "search" {
source = "../../modules/search"

service_name = local.service_name
availability_zone_count = 3
availability_zone_count = local.search_config.availability_zone_count
zone_awareness_enabled = var.environment_name == "prod" ? true : false
multi_az_with_standby_enabled = var.environment_name == "prod" ? true : false
dedicated_master_enabled = var.environment_name == "prod" ? true : false
Expand Down
24 changes: 14 additions & 10 deletions infra/api/service/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -111,16 +111,20 @@ data "aws_ssm_parameter" "incident_management_service_integration_url" {
}

module "service" {
source = "../../modules/service"
service_name = local.service_name
is_temporary = local.is_temporary
image_repository_name = module.app_config.image_repository_name
image_tag = local.image_tag
vpc_id = data.aws_vpc.network.id
public_subnet_ids = data.aws_subnets.public.ids
private_subnet_ids = data.aws_subnets.private.ids
cpu = 1024
memory = 2048
source = "../../modules/service"
service_name = local.service_name
is_temporary = local.is_temporary
image_repository_name = module.app_config.image_repository_name
image_tag = local.image_tag
vpc_id = data.aws_vpc.network.id
public_subnet_ids = data.aws_subnets.public.ids
private_subnet_ids = data.aws_subnets.private.ids
desired_instance_count = local.service_config.instance_desired_instance_count
max_capacity = local.service_config.instance_scaling_max_capacity
min_capacity = local.service_config.instance_scaling_min_capacity
enable_autoscaling = true
cpu = 1024
memory = 2048

cert_arn = local.domain != null ? data.aws_acm_certificate.cert[0].arn : null

Expand Down
10 changes: 10 additions & 0 deletions infra/frontend/app-config/dev.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,14 @@ module "dev_config" {
environment = "dev"
has_database = local.has_database
has_incident_management_service = local.has_incident_management_service

# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
# https://us-east-1.console.aws.amazon.com/ecs/v2/clusters/frontend-dev/services/frontend-dev/health?region=us-east-1
# instance_desired_instance_count and instance_scaling_min_capacity are scaled for the average CPU and Memory
# seen over 12 months, as of November 2024 exlucing an outlier range around February 2024.
# With a minimum of 2, so CPU doesn't spike to infinity on deploys.
instance_desired_instance_count = 2
instance_scaling_min_capacity = 2
# instance_scaling_max_capacity is 5x the instance_scaling_min_capacity
instance_scaling_max_capacity = 10
}
5 changes: 4 additions & 1 deletion infra/frontend/app-config/env-config/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ output "database_config" {

output "service_config" {
value = {
region = var.default_region
region = var.default_region
instance_desired_instance_count = var.instance_desired_instance_count
instance_scaling_max_capacity = var.instance_scaling_max_capacity
instance_scaling_min_capacity = var.instance_scaling_min_capacity
extra_environment_variables = merge(
local.default_extra_environment_variables,
var.service_override_extra_environment_variables
Expand Down
15 changes: 15 additions & 0 deletions infra/frontend/app-config/env-config/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,21 @@ variable "has_incident_management_service" {
type = bool
}

variable "instance_desired_instance_count" {
type = number
default = 1
}

variable "instance_scaling_min_capacity" {
type = number
default = 1
}

variable "instance_scaling_max_capacity" {
type = number
default = 5
}

variable "domain" {
description = "Public domain for the website, which is managed by HHS ITS."
type = string
Expand Down
10 changes: 10 additions & 0 deletions infra/frontend/app-config/prod.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,14 @@ module "prod_config" {
has_database = local.has_database
has_incident_management_service = local.has_incident_management_service
domain = "simpler.grants.gov"

# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
# https://us-east-1.console.aws.amazon.com/ecs/v2/clusters/frontend-prod/services/frontend-prod/health?region=us-east-1
# instance_desired_instance_count and instance_scaling_min_capacity are scaled for 5x the average CPU and Memory
# seen over 12 months, as of November 2024 exlucing an outlier range around February 2024.
# The math is: 5 * max(average CPU or average Memory) * 1.3. The 1.3 is for a buffer.
instance_desired_instance_count = 4
instance_scaling_min_capacity = 4
# instance_scaling_max_capacity is 5x the instance_scaling_min_capacity
instance_scaling_max_capacity = 20
}
10 changes: 10 additions & 0 deletions infra/frontend/app-config/staging.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,14 @@ module "staging_config" {
environment = "staging"
has_database = local.has_database
has_incident_management_service = local.has_incident_management_service

# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
# https://us-east-1.console.aws.amazon.com/ecs/v2/clusters/frontend-dev/services/frontend-dev/health?region=us-east-1
# instance_desired_instance_count and instance_scaling_min_capacity are scaled for the average CPU and Memory
# seen over 12 months, as of November 2024 exlucing an outlier range around February 2024.
# With a minimum of 2, so CPU doesn't spike to infinity on deploys.
instance_desired_instance_count = 2
instance_scaling_min_capacity = 2
# instance_scaling_max_capacity is 5x the instance_scaling_min_capacity
instance_scaling_max_capacity = 10
}
27 changes: 16 additions & 11 deletions infra/frontend/service/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -114,17 +114,22 @@ output "environment_name" {
value = var.environment_name
}
module "service" {
source = "../../modules/service"
service_name = local.service_name
is_temporary = local.is_temporary
image_repository_name = module.app_config.image_repository_name
image_tag = local.image_tag
vpc_id = data.aws_vpc.network.id
public_subnet_ids = data.aws_subnets.public.ids
private_subnet_ids = data.aws_subnets.private.ids
enable_autoscaling = module.app_config.enable_autoscaling
cert_arn = local.domain != null ? data.aws_acm_certificate.cert[0].arn : null
hostname = module.app_config.hostname
source = "../../modules/service"
service_name = local.service_name
is_temporary = local.is_temporary
image_repository_name = module.app_config.image_repository_name
image_tag = local.image_tag
vpc_id = data.aws_vpc.network.id
public_subnet_ids = data.aws_subnets.public.ids
private_subnet_ids = data.aws_subnets.private.ids
cert_arn = local.domain != null ? data.aws_acm_certificate.cert[0].arn : null
hostname = module.app_config.hostname
desired_instance_count = local.service_config.instance_desired_instance_count
max_capacity = local.service_config.instance_scaling_max_capacity
min_capacity = local.service_config.instance_scaling_min_capacity
enable_autoscaling = true
cpu = 256 // these are probably too small
memory = 512 // these are probably too small

app_access_policy_arn = null
migrator_access_policy_arn = null
Expand Down
Loading

0 comments on commit 306d118

Please sign in to comment.