feat: support EC2 DescribeInstanceStatus health checks in the interruption controller#8899
Draft
feat: support EC2 DescribeInstanceStatus health checks in the interruption controller#8899
Conversation
Contributor
|
Preview deployment ready! Preview URL: https://pr-8899.d18coufmbnnaag.amplifyapp.com Built from commit |
24ad5de to
47828dd
Compare
Pull Request Test Coverage Report for Build 21762298795Details
💛 - Coveralls |
b87717c to
17cc576
Compare
17cc576 to
985bf4a
Compare
DerekFrank
reviewed
Feb 23, 2026
| env.EventuallyExpectNotFound(node) | ||
| env.EventuallyExpectHealthyPodCount(selector, 1) | ||
| }) | ||
| FIt("should terminate the node when receiving an instance status failure", func() { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #8821
Description
Adds support for EC2 Status Checks (i.e. health checks) for consumption by the Interruption Controller.
Before this PR, the interruption controller only consumed health checks from EventBridge (AWS Health, Spot, and EC2 Instance State Changes). These events only cover EC2 Scheduled Maintenance Events, Spot Interruptions, and EC2 Terminations/Stops.
EC2:DescribeInstanceStatus is an API that provides the result of EC2 health checks.
The currently vended health checks are:
Instance Status, System Status, and EBS Volume Status is not available via EventBridge. This PR only supports Instance Status and System Status checks for interruption handling.
Scheduled Maintenance Events are already consumed via EventBridge, so we are ignoring them when parsing DescribeInstanceStatus for now. We may want to revisit this later to remove the need for those events for the interruption queue.
Additionally, EBS Volume Status indicates that at least 1 volume is unhealthy. This is insufficient information to take action from Karpenter's perspective since it could be indicating that a pod's storage is unhealthy via a PVC. EBS Volume checking should be completed in a follow-up PR that uses the EBS Status Checks API which can narrow down which volume is unhealthy.
NOTE: THIS FEATURE REQUIRES A NEW PERMISSION EC2:DescribeInstanceStatus
If you launch Karpenter without EC2:DescribeInstanceStatus permissions, the following error is logged:
It does not inhibit the interruption controller if a queue is configured via
--interruption-queue.How was this change tested?
Added suite tests but also performed manual testing:
For manual testing, I had Karpenter create a node, and then I SSH'd to the node and down'd the primary network interface:
Applicable Logs:
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.