Skip to content

feat: support EC2 DescribeInstanceStatus health checks in the interruption controller#8899

Draft
bwagner5 wants to merge 2 commits intoaws:mainfrom
bwagner5:instance-health
Draft

feat: support EC2 DescribeInstanceStatus health checks in the interruption controller#8899
bwagner5 wants to merge 2 commits intoaws:mainfrom
bwagner5:instance-health

Conversation

@bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Jan 29, 2026

Fixes #8821

Description

Adds support for EC2 Status Checks (i.e. health checks) for consumption by the Interruption Controller.
Before this PR, the interruption controller only consumed health checks from EventBridge (AWS Health, Spot, and EC2 Instance State Changes). These events only cover EC2 Scheduled Maintenance Events, Spot Interruptions, and EC2 Terminations/Stops.

EC2:DescribeInstanceStatus is an API that provides the result of EC2 health checks.
The currently vended health checks are:

  • Instance Status (reachability via an ARP to the EC2 instance via the primary ENI)
  • System Status (reachability to the underlying physical host the EC2 instance is running on)
  • EBS Volume Status (problems with attached EBS volumes)
  • Scheduled Maintenance Events

Instance Status, System Status, and EBS Volume Status is not available via EventBridge. This PR only supports Instance Status and System Status checks for interruption handling.

Scheduled Maintenance Events are already consumed via EventBridge, so we are ignoring them when parsing DescribeInstanceStatus for now. We may want to revisit this later to remove the need for those events for the interruption queue.

Additionally, EBS Volume Status indicates that at least 1 volume is unhealthy. This is insufficient information to take action from Karpenter's perspective since it could be indicating that a pod's storage is unhealthy via a PVC. EBS Volume checking should be completed in a follow-up PR that uses the EBS Status Checks API which can narrow down which volume is unhealthy.

NOTE: THIS FEATURE REQUIRES A NEW PERMISSION EC2:DescribeInstanceStatus

If you launch Karpenter without EC2:DescribeInstanceStatus permissions, the following error is logged:

karpenter-6c47d6799-lth2w controller {"level":"ERROR","time":"2026-01-29T15:07:01.234Z","logger":"controller","caller":"controller/controller.go:474","message":"Reconciler error","commit":"3c50d4a-dirty","controller":"interruption","namespace":"","name":"","reconcileID":"2365d24c-500a-4b88-9422-866efc679e4d","aws-error-code":"UnauthorizedOperation","aws-operation-name":"DescribeInstanceStatus","aws-request-id":"032b63a8-52c2-4e42-be18-ff7c3afc6405","aws-service-name":"EC2","aws-status-code":403,"error":"reconciling interruptions, getting instance statusesm failed describing ec2 instance status checks, operation error EC2: DescribeInstanceStatus, https response error StatusCode: 403, RequestID: 032b63a8-52c2-4e42-be18-ff7c3afc6405, api error UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:sts::xxxxxxxxxxxxx:assumed-role/xxxxxxxx-karpenter-dev-karpenter/eks-xxxxxxx-k-karpenter--f33a7ef1-4cf6-40ce-ad50-f3c17999b158 is not authorized to perform: ec2:DescribeInstanceStatus because no identity-based policy allows the ec2:DescribeInstanceStatus action (aws-error-code=UnauthorizedOperation, aws-operation-name=DescribeInstanceStatus, aws-request-id=032b63a8-52c2-4e42-be18-ff7c3afc6405, aws-service-name=EC2, aws-status-code=403)"}

It does not inhibit the interruption controller if a queue is configured via --interruption-queue.

How was this change tested?

Added suite tests but also performed manual testing:

For manual testing, I had Karpenter create a node, and then I SSH'd to the node and down'd the primary network interface:

> k scale deploy inflate --replicas=2
> ec2-connect i-023383730ebc0d024
   > ip link set dev enp39s0 down

Applicable Logs:

karpenter-6c47d6799-lth2w controller {"level":"INFO","time":"2026-01-29T15:31:05.949Z","logger":"controller","caller":"interruption/controller.go:324","message":"initiating delete from interruption message","commit":"3c50d4a-dirty","controller":"interruption","namespace":"","name":"","reconcileID":"14060d4c-63b6-4841-abac-1f551cdade4f","messageKind":"instance_status_failure","NodeClaim":{"name":"default-nj2pf"},"action":"CordonAndDrain","Node":{"name":"ip-192-168-56-163.us-east-2.compute.internal"}}

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 29, 2026

Preview deployment ready!

Preview URL: https://pr-8899.d18coufmbnnaag.amplifyapp.com

Built from commit 985bf4ad44240c9ec0281ce6a02c56493cb16aef

@coveralls
Copy link

coveralls commented Jan 29, 2026

Pull Request Test Coverage Report for Build 21762298795

Details

  • 147 of 183 (80.33%) changed or added relevant lines in 8 files are covered.
  • 14 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.1%) to 67.738%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/providers/sqs/sqs.go 0 1 0.0%
pkg/operator/operator.go 0 2 0.0%
pkg/fake/ec2api.go 7 10 70.0%
pkg/controllers/controllers.go 0 4 0.0%
pkg/providers/instancestatus/instancestatus.go 80 84 95.24%
pkg/controllers/interruption/controller.go 49 71 69.01%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/interruption/controller.go 1 72.38%
pkg/fake/iamapi.go 5 57.69%
pkg/providers/instanceprofile/instanceprofile.go 8 81.22%
Totals Coverage Status
Change from base Build 21760092331: 0.1%
Covered Lines: 7968
Relevant Lines: 11763

💛 - Coveralls

@bwagner5 bwagner5 force-pushed the instance-health branch 3 times, most recently from b87717c to 17cc576 Compare January 29, 2026 19:30
env.EventuallyExpectNotFound(node)
env.EventuallyExpectHealthyPodCount(selector, 1)
})
FIt("should terminate the node when receiving an instance status failure", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: F

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EC2 Simplified Automatic Recovery conflicts with Karpenter's termination behavior

3 participants