Skip to content

OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off #441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 27, 2025

Conversation

ratnam915
Copy link
Contributor

OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off

Attached is the successful test case execution.
OSD-29470_test.txt

@openshift-ci openshift-ci bot requested review from bng0y and rafael-azevedo May 12, 2025 13:50
@codecov-commenter
Copy link

codecov-commenter commented May 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.08%. Comparing base (71ed508) to head (4876423).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #441   +/-   ##
=======================================
  Coverage   32.08%   32.08%           
=======================================
  Files          35       35           
  Lines        2425     2425           
=======================================
  Hits          778      778           
  Misses       1587     1587           
  Partials       60       60           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

}
}
if !newLogsFound {
fmt.Println("No new service logs found.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is the failure case, I'd love for this to actual Fail but stil cleanup.

Maybe starting all stopped Infras could be a AfterEach function in the context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bergmannf : The above code has been fixed to restart the nodes irrespective the test status, also the test case was run post the change and it was a success

Copy link
Contributor

@RaphaelBut RaphaelBut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks great to me. Good stuff!

fmt.Printf("ID: %s\nSummary: %s\nDescription: %s\n\n", log.ID(), log.Summary(), log.Description())
}
}
Expect(newLogsFound).To(BeTrue(), "No new service logs were found after infrastructure node shutdown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how we keep the test flexible by not checking the content of the servicelogs, but on the other hand, would it be possible for other automations to interfere with this test by sending unrelated servicelogs which would make this test pass then?

Copy link
Contributor

@RaphaelBut RaphaelBut May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh btw, just took a look, it seems to be a limited support reason, so maybe its worth to check if limited support has been set? ( I am not sure if ocm sends a servicelog when limited support is set, but seems to be the case other wise you test would have failed? :D)
https://github.com/openshift/configuration-anomaly-detection/blob/main/pkg/investigations/chgm/chgm.go#L25

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @RaphaelBut : Changes have been made to check the number of service logs before and after the change, only if new service logs are present they are printed and the test case fails if new service logs are not present.

Also for this test case this is the directive that we got :

AWS CCS: cluster has gone missing (infra nodes turned off)

Can be triggered by continuously turning off infrastructure nodes for 20 minutes. 
Expectation: [new service log for turned off infra](https://github.com/openshift/configuration-anomaly-detection/blob/179db6ae2797352e6485ce75e9e3c0f256075418/pkg/investigations/chgm/chgm.go#L29) in OCM
Recovery after the test: start the stopped instances again.

Hence that is what we are looking for, for the other test cases Limited Support Reason is the expectation and hence we are checking that

@ratnam915
Copy link
Contributor Author

Below changes have been carried out in the latest commit:

  1. Defer functions have been added to all the test cases to make sure any resource that is stopped is restarted irrespective of the outcome of the test case
  2. Gingko Recover has been added to the test cases to make sure that even if they fail the Test Suite is executed through and through.
  3. Infrastructure nodes test case has been fixed to check for Limited Support instead of Service Logs.
  4. generate_incident.go file has been added to generate PD incldents, this will create a PD incident for the necessary test cases.
  5. GET and RESOLVE methods have been added in the generate_incident.go file which are currently for support.

@ratnam915
Copy link
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label May 22, 2025
"github.com/google/uuid"
)

type PagerDutyClient interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To note that this might get confusing, as we already have a pd client in the main code. Maybe calling it TestPagerDutyClient would offer more clarity.

Comment on lines 37 to 44
alertMappings: map[string]string{
"ClusterHasGoneMissing": "cadtest has gone missing",
"ClusterProvisioningDelay": "ClusterProvisioningDelay -",
"ClusterMonitoringErrorBudgetBurnSRE": "ClusterMonitoringErrorBudgetBurnSRE Critical (1)",
"InsightsOperatorDown": "InsightsOperatorDown",
"MachineHealthCheckUnterminatedShortCircuitSRE": "MachineHealthCheckUnterminatedShortCircuitSRE CRITICAL (1)",
"ApiErrorBudgetBurn": "api-ErrorBudgetBurn k8sgpt test CRITICAL (1)",
},
Copy link
Member

@typeid typeid May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid having to pass raw strings.

Suggestion:

const (
	ClusterHasGoneMissing                = "ClusterHasGoneMissing"
	ClusterProvisioningDelay             = "ClusterProvisioningDelay"
	ClusterMonitoringErrorBudgetBurnSRE  = "ClusterMonitoringErrorBudgetBurnSRE"
	// ...
)

func AlertTitle(alertName string) (string, error) {
	switch alertName {
	case ClusterHasGoneMissing:
		return "cadtest has gone missing", nil
	case ClusterProvisioningDelay:
		return "ClusterProvisioningDelay -", nil
	// ...
	default:
		return "", fmt.Errorf("unknown alert name: %s", alertName) 
	}
}

The client we are creating here does not need to contain this mapping internally.

@@ -0,0 +1,201 @@
package osde2etests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this in a new /test/e2e/util/. folder.

Comment on lines 49 to 60
type payload struct {
Payload struct {
Summary string `json:"summary"`
Timestamp string `json:"timestamp"`
Severity string `json:"severity"`
Source string `json:"source"`
Details map[string]string `json:"custom_details"`
} `json:"payload"`
RoutingKey string `json:"routing_key"`
EventAction string `json:"event_action"`
DedupKey string `json:"dedup_key"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pagerduty go sdk that can be used, so we don't have to manually re-create structs and api calls here and in the other functions: https://github.com/PagerDuty/go-pagerduty

Comment on lines 33 to 34
eventsURL: "https://events.pagerduty.com/v2/enqueue",
apiURL: "https://api.pagerduty.com/incidents",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are those part of the client?

DedupKey string `json:"dedup_key"`
}

func (c *client) CreateSilentRequest(alertName, clusterID string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Silent? This is a normal incident creation.

@@ -32,6 +34,7 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {
region string
provider string
clusterID string
pdClient PagerDutyClient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pdClient PagerDutyClient
testPdClient TestPagerDutyClient

Comment on lines 69 to 70
pdRoutingKey := os.Getenv("CAD_PD_TOKEN")
pdToken := os.Getenv("CAD_PD_TOKEN")
Copy link
Member

@typeid typeid May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming these will not yet be set in the e2e pipelines, are you already tracking adding them? I'm also a bit worried about re-using the same production token for e2e. As we're not using those for now (only needed for getting and resolving an incident), let's just skip loading these for now and revisit when we need to utilize the functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @typeid: Currently since this E2E test has been set to run in the Stage environment we are using the Stage routing key to carry out the generation of incident, also the environment variable name has been set as CAD_PD_TOKEN similar to the actual pagerduty.go to avoid issues in the Pipeline.

These tokens are necessary even for incident generation, else the test case would fail.

Copy link
Member

@typeid typeid May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generate an incident indirectly with only the routing key, see https://developer.pagerduty.com/api-reference/368ae3d938c9e-send-an-event-to-pager-duty which is also used in generate_incident.sh.

I'm unsure why PD differentiates between creating incident and events, when events end up creating incidents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typeid : I've made changes to all the comments that we have recieved, please go through the code and let me know your thoughts

@@ -266,4 +288,113 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {
fmt.Println("Test completed: All components restored to original replica counts.")
}
})
})

It("AWS CCS: can shutdown and restart infrastructure nodes", Label("aws", "ccs", "infra-nodes", "limited-support"), func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here is not whether or not we can shutdown and restart the infra nodes, but rather whether or not the cluster lands in limited support :)

Expect(RestoreEgress(ctx, ec2Wrapper, sgID)).To(Succeed(), "Failed to restore egress")
ginkgo.GinkgoWriter.Printf("Egress restored\n")
// Clean up: restore egress before checking test conditions
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved up, so that there are less possible exits between restoring egress and blocking it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ratnam915 I think this is still valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typeid : This is done

Expect(err).NotTo(HaveOccurred(), "Failed to create AWS config")
ec2Client := ec2.NewFromConfig(awsCfg)

// Step 1: Get cluster object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this unused step be removed and the numbering for the steps changed?

@ratnam915 ratnam915 changed the title OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off May 26, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 26, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 26, 2025

@ratnam915: This pull request references OSD-29470 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off

Attached is the successful test case execution.
OSD-29470_test.txt

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@@ -62,6 +66,10 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {

provider, err = k8s.GetProvider(ctx)
Expect(err).NotTo(HaveOccurred(), "Could not determine provider")

pdRoutingKey := os.Getenv("CAD_PD_TOKEN")
Copy link
Member

@typeid typeid May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be the following?

pdRoutingKey := os.Getenv("PAGERDUTY_ROUTING_KEY")

Also, would these variables already be set up in the e2e pipelines if we merge this now?

DedupKey string `json:"dedup_key"`
}

func GetAlertSummary(alertName string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the title, not the summary.

Comment on lines 67 to 68
case AlertManagerDown:
return "Alert Manager Down", nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an existing alert, it wouldn't route to anything within CAD.

}

// EventResponse represents the response from PagerDuty Events API
type EventResponse struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// Event represents the complete event structure for PagerDuty Events API
type Event struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in a previous comment, let's not re-define everything and actually use the pagerduty sdk.

https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#Event & https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#CreateEvent

@@ -205,13 +226,16 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() {
Expect(err).ToNot(HaveOccurred(), "failed to scale down alertmanager")
fmt.Printf("Alertmanager scaled down from %d to 0 replicas. Waiting...\n", originalAMReplicas)

time.Sleep(20 * time.Minute)
_, err = testPdClient.CreateRequest("AlertManagerDown", clusterID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still for ClusterHasGoneMissing, but we're initiating it with a different misconfiguration.

apiClient: sdk.NewClient(routingKey),
}
}
func (c *client) CreateRequest(alertName, clusterID string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateRequest is a bit broad, could we rename this to TriggerIncident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typeid : Changes have been made

Copy link
Member

@typeid typeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2025
Copy link
Contributor

openshift-ci bot commented May 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ratnam915, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2025
Copy link
Contributor

openshift-ci bot commented May 27, 2025

@ratnam915: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit b1691b0 into openshift:main May 27, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants