OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off #441

ratnam915 · 2025-05-12T13:49:39Z

OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off

Attached is the successful test case execution.
OSD-29470_test.txt

codecov-commenter · 2025-05-12T14:18:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.08%. Comparing base (71ed508) to head (4876423).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #441   +/-   ##
=======================================
  Coverage   32.08%   32.08%           
=======================================
  Files          35       35           
  Lines        2425     2425           
=======================================
  Hits          778      778           
  Misses       1587     1587           
  Partials       60       60

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bergmannf · 2025-05-12T14:30:39Z

test/e2e/configuration_anomaly_detection_test.go

+			}
+		}
+		if !newLogsFound {
+			fmt.Println("No new service logs found.")


As this is the failure case, I'd love for this to actual Fail but stil cleanup.

Maybe starting all stopped Infras could be a AfterEach function in the context?

Hey @bergmannf : The above code has been fixed to restart the nodes irrespective the test status, also the test case was run post the change and it was a success

RaphaelBut

Otherwise looks great to me. Good stuff!

RaphaelBut · 2025-05-16T12:20:40Z

test/e2e/configuration_anomaly_detection_test.go

+				fmt.Printf("ID: %s\nSummary: %s\nDescription: %s\n\n", log.ID(), log.Summary(), log.Description())
+			}
+		}
+		Expect(newLogsFound).To(BeTrue(), "No new service logs were found after infrastructure node shutdown")


I like how we keep the test flexible by not checking the content of the servicelogs, but on the other hand, would it be possible for other automations to interfere with this test by sending unrelated servicelogs which would make this test pass then?

Uh btw, just took a look, it seems to be a limited support reason, so maybe its worth to check if limited support has been set? ( I am not sure if ocm sends a servicelog when limited support is set, but seems to be the case other wise you test would have failed? :D)
https://github.com/openshift/configuration-anomaly-detection/blob/main/pkg/investigations/chgm/chgm.go#L25

Hi @RaphaelBut : Changes have been made to check the number of service logs before and after the change, only if new service logs are present they are printed and the test case fails if new service logs are not present.

Also for this test case this is the directive that we got :

AWS CCS: cluster has gone missing (infra nodes turned off)

Can be triggered by continuously turning off infrastructure nodes for 20 minutes. Expectation: [new service log for turned off infra](https://github.com/openshift/configuration-anomaly-detection/blob/179db6ae2797352e6485ce75e9e3c0f256075418/pkg/investigations/chgm/chgm.go#L29) in OCM Recovery after the test: start the stopped instances again. Hence that is what we are looking for, for the other test cases Limited Support Reason is the expectation and hence we are checking that

ratnam915 · 2025-05-21T15:58:30Z

Below changes have been carried out in the latest commit:

Defer functions have been added to all the test cases to make sure any resource that is stopped is restarted irrespective of the outcome of the test case
Gingko Recover has been added to the test cases to make sure that even if they fail the Test Suite is executed through and through.
Infrastructure nodes test case has been fixed to check for Limited Support instead of Service Logs.
generate_incident.go file has been added to generate PD incldents, this will create a PD incident for the necessary test cases.
GET and RESOLVE methods have been added in the generate_incident.go file which are currently for support.

ratnam915 · 2025-05-22T07:15:17Z

/label tide/merge-method-squash

typeid · 2025-05-23T07:20:15Z

test/e2e/generate_incident.go

+	"github.com/google/uuid"
+)
+
+type PagerDutyClient interface {


To note that this might get confusing, as we already have a pd client in the main code. Maybe calling it TestPagerDutyClient would offer more clarity.

typeid · 2025-05-23T07:23:49Z

test/e2e/generate_incident.go

+		alertMappings: map[string]string{
+			"ClusterHasGoneMissing":                         "cadtest has gone missing",
+			"ClusterProvisioningDelay":                      "ClusterProvisioningDelay -",
+			"ClusterMonitoringErrorBudgetBurnSRE":           "ClusterMonitoringErrorBudgetBurnSRE Critical (1)",
+			"InsightsOperatorDown":                          "InsightsOperatorDown",
+			"MachineHealthCheckUnterminatedShortCircuitSRE": "MachineHealthCheckUnterminatedShortCircuitSRE CRITICAL (1)",
+			"ApiErrorBudgetBurn":                            "api-ErrorBudgetBurn k8sgpt test CRITICAL (1)",
+		},


Let's avoid having to pass raw strings.

Suggestion:

const ( ClusterHasGoneMissing = "ClusterHasGoneMissing" ClusterProvisioningDelay = "ClusterProvisioningDelay" ClusterMonitoringErrorBudgetBurnSRE = "ClusterMonitoringErrorBudgetBurnSRE" // ... ) func AlertTitle(alertName string) (string, error) { switch alertName { case ClusterHasGoneMissing: return "cadtest has gone missing", nil case ClusterProvisioningDelay: return "ClusterProvisioningDelay -", nil // ... default: return "", fmt.Errorf("unknown alert name: %s", alertName) } }

The client we are creating here does not need to contain this mapping internally.

typeid · 2025-05-23T07:24:46Z

test/e2e/generate_incident.go

@@ -0,0 +1,201 @@
+package osde2etests


Let's move this in a new /test/e2e/util/. folder.

typeid · 2025-05-23T07:25:21Z

test/e2e/generate_incident.go

+type payload struct {
+	Payload struct {
+		Summary   string            `json:"summary"`
+		Timestamp string            `json:"timestamp"`
+		Severity  string            `json:"severity"`
+		Source    string            `json:"source"`
+		Details   map[string]string `json:"custom_details"`
+	} `json:"payload"`
+	RoutingKey  string `json:"routing_key"`
+	EventAction string `json:"event_action"`
+	DedupKey    string `json:"dedup_key"`
+}


There is a pagerduty go sdk that can be used, so we don't have to manually re-create structs and api calls here and in the other functions: https://github.com/PagerDuty/go-pagerduty

typeid · 2025-05-23T07:26:34Z

test/e2e/generate_incident.go

+		eventsURL:  "https://events.pagerduty.com/v2/enqueue",
+		apiURL:     "https://api.pagerduty.com/incidents",


Why are those part of the client?

typeid · 2025-05-23T07:27:14Z

test/e2e/generate_incident.go

+	DedupKey    string `json:"dedup_key"`
+}
+
+func (c *client) CreateSilentRequest(alertName, clusterID string) (string, error) {


Why Silent? This is a normal incident creation.

typeid · 2025-05-23T07:27:54Z

test/e2e/configuration_anomaly_detection_test.go

@@ -32,6 +34,7 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {
 		region    string
 		provider  string
 		clusterID string
+		pdClient  PagerDutyClient


Suggested change

pdClient PagerDutyClient

testPdClient TestPagerDutyClient

typeid · 2025-05-23T07:28:26Z

test/e2e/configuration_anomaly_detection_test.go

+		pdRoutingKey := os.Getenv("CAD_PD_TOKEN")
+		pdToken := os.Getenv("CAD_PD_TOKEN")


Assuming these will not yet be set in the e2e pipelines, are you already tracking adding them? I'm also a bit worried about re-using the same production token for e2e. As we're not using those for now (only needed for getting and resolving an incident), let's just skip loading these for now and revisit when we need to utilize the functions.

Hey @typeid: Currently since this E2E test has been set to run in the Stage environment we are using the Stage routing key to carry out the generation of incident, also the environment variable name has been set as CAD_PD_TOKEN similar to the actual pagerduty.go to avoid issues in the Pipeline.

These tokens are necessary even for incident generation, else the test case would fail.

You can generate an incident indirectly with only the routing key, see https://developer.pagerduty.com/api-reference/368ae3d938c9e-send-an-event-to-pager-duty which is also used in generate_incident.sh.

I'm unsure why PD differentiates between creating incident and events, when events end up creating incidents.

@typeid : I've made changes to all the comments that we have recieved, please go through the code and let me know your thoughts

test/e2e/configuration_anomaly_detection_test.go

typeid · 2025-05-23T07:31:05Z

test/e2e/configuration_anomaly_detection_test.go

@@ -266,4 +288,113 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {
 			fmt.Println("Test completed: All components restored to original replica counts.")
 		}
 	})
-})
+
+	It("AWS CCS: can shutdown and restart infrastructure nodes", Label("aws", "ccs", "infra-nodes", "limited-support"), func(ctx context.Context) {


The test here is not whether or not we can shutdown and restart the infra nodes, but rather whether or not the cluster lands in limited support :)

rolandmkunkel · 2025-05-26T10:39:09Z

test/e2e/configuration_anomaly_detection_test.go

-			Expect(RestoreEgress(ctx, ec2Wrapper, sgID)).To(Succeed(), "Failed to restore egress")
-			ginkgo.GinkgoWriter.Printf("Egress restored\n")
+			// Clean up: restore egress before checking test conditions
+			defer func() {


Could this be moved up, so that there are less possible exits between restoring egress and blocking it?

@ratnam915 I think this is still valid

@typeid : This is done

rolandmkunkel · 2025-05-26T10:48:36Z

test/e2e/configuration_anomaly_detection_test.go

+		Expect(err).NotTo(HaveOccurred(), "Failed to create AWS config")
+		ec2Client := ec2.NewFromConfig(awsCfg)
+
+		// Step 1: Get cluster object


Can this unused step be removed and the numbering for the steps changed?

openshift-ci-robot · 2025-05-26T18:18:39Z

@ratnam915: This pull request references OSD-29470 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off

Attached is the successful test case execution.
OSD-29470_test.txt

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

typeid · 2025-05-27T06:20:41Z

test/e2e/configuration_anomaly_detection_test.go

@@ -62,6 +66,10 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {

 		provider, err = k8s.GetProvider(ctx)
 		Expect(err).NotTo(HaveOccurred(), "Could not determine provider")
+
+		pdRoutingKey := os.Getenv("CAD_PD_TOKEN")


Is this supposed to be the following?

pdRoutingKey := os.Getenv("PAGERDUTY_ROUTING_KEY")

Also, would these variables already be set up in the e2e pipelines if we merge this now?

typeid · 2025-05-27T06:24:51Z

test/e2e/utils/generate_incident.go

+	DedupKey string `json:"dedup_key"`
+}
+
+func GetAlertSummary(alertName string) (string, error) {


This would be the title, not the summary.

typeid · 2025-05-27T06:25:23Z

test/e2e/utils/generate_incident.go

+	case AlertManagerDown:
+		return "Alert Manager Down", nil


This is not an existing alert, it wouldn't route to anything within CAD.

typeid · 2025-05-27T06:29:28Z

test/e2e/utils/generate_incident.go

+}
+
+// EventResponse represents the response from PagerDuty Events API
+type EventResponse struct {


Already exists in the pagerduty sdk: https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#EventResponse

typeid · 2025-05-27T06:29:37Z

test/e2e/utils/generate_incident.go

+}
+
+// Event represents the complete event structure for PagerDuty Events API
+type Event struct {


As mentioned in a previous comment, let's not re-define everything and actually use the pagerduty sdk.

https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#Event & https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#CreateEvent

test/e2e/utils/generate_incident.go

typeid · 2025-05-27T06:34:58Z

test/e2e/configuration_anomaly_detection_test.go

@@ -205,13 +226,16 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {
 			Expect(err).ToNot(HaveOccurred(), "failed to scale down alertmanager")
 			fmt.Printf("Alertmanager scaled down from %d to 0 replicas. Waiting...\n", originalAMReplicas)

-			time.Sleep(20 * time.Minute)
+			_, err = testPdClient.CreateRequest("AlertManagerDown", clusterID)


This test is still for ClusterHasGoneMissing, but we're initiating it with a different misconfiguration.

typeid · 2025-05-27T07:58:54Z

test/e2e/utils/generate_incident.go

+		apiClient:  sdk.NewClient(routingKey),
+	}
+}
+func (c *client) CreateRequest(alertName, clusterID string) (string, error) {


CreateRequest is a bit broad, could we rename this to TriggerIncident?

@typeid : Changes have been made

typeid

/lgtm
/approve

openshift-ci · 2025-05-27T08:33:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ratnam915, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [typeid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-05-27T08:44:33Z

@ratnam915: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ratnam915 added 3 commits May 12, 2025 19:14

Final changes

a4d54d8

Merge branch 'openshift:main' into feature/OSD-29470

33193e0

Final changes

d3983f7

openshift-ci bot requested review from bng0y and rafael-azevedo May 12, 2025 13:50

bergmannf reviewed May 12, 2025

View reviewed changes

ratnam915 added 2 commits May 16, 2025 11:15

Made changes as per the comments recieved

b56b542

Merge branch 'openshift:main' into feature/OSD-29470

d73f99a

RaphaelBut reviewed May 16, 2025

View reviewed changes

ratnam915 added 3 commits May 16, 2025 18:33

Made changes as per the commits

4ea9c58

Merge branch 'openshift:main' into feature/OSD-29470

2bee8ab

Final changes

2e886c2

ratnam915 added 4 commits May 21, 2025 21:29

Changed the PagerDuty token

5451228

Fixed lint issue

60759ed

Fixed lint issue

8df9a97

Fixed lint issue

c250a8e

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 22, 2025

typeid requested changes May 23, 2025

View reviewed changes

rolandmkunkel reviewed May 26, 2025

View reviewed changes

ratnam915 added 3 commits May 26, 2025 23:14

Made all the changes as per the comments

b7ef617

Made all the changes as per the comments

402ffa5

Made all the changes as per the comments

322c48b

ratnam915 changed the title ~~OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off~~ OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off May 26, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 26, 2025

typeid reviewed May 27, 2025

View reviewed changes

ratnam915 added 2 commits May 27, 2025 13:11

Made changes as per the comments

0904bf1

Made changes as per the comments

d29cbdd

typeid reviewed May 27, 2025

View reviewed changes

Made changes as per the comments

4876423

typeid approved these changes May 27, 2025

View reviewed changes

openshift-ci bot assigned typeid May 27, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2025

openshift-merge-bot bot merged commit b1691b0 into openshift:main May 27, 2025
5 checks passed

		eventsURL: "https://events.pagerduty.com/v2/enqueue",
		apiURL: "https://api.pagerduty.com/incidents",

		pdRoutingKey := os.Getenv("CAD_PD_TOKEN")
		pdToken := os.Getenv("CAD_PD_TOKEN")

OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off #441

OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off #441

Uh oh!

Conversation

ratnam915 commented May 12, 2025

Uh oh!

codecov-commenter commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RaphaelBut left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RaphaelBut May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ratnam915 commented May 21, 2025

Uh oh!

ratnam915 commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

typeid May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

typeid May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

typeid May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented May 26, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

typeid May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov-commenter commented May 12, 2025 •

edited

Loading

RaphaelBut May 16, 2025 •

edited

Loading

typeid May 23, 2025 •

edited

Loading

typeid May 23, 2025 •

edited

Loading

typeid May 26, 2025 •

edited

Loading

openshift-ci-robot commented May 26, 2025 •

edited by openshift-ci bot

Loading

typeid May 27, 2025 •

edited

Loading