-
Notifications
You must be signed in to change notification settings - Fork 55
OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSD-29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off #441
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #441 +/- ##
=======================================
Coverage 32.08% 32.08%
=======================================
Files 35 35
Lines 2425 2425
=======================================
Hits 778 778
Misses 1587 1587
Partials 60 60 🚀 New features to boost your workflow:
|
} | ||
} | ||
if !newLogsFound { | ||
fmt.Println("No new service logs found.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is the failure case, I'd love for this to actual Fail
but stil cleanup.
Maybe starting all stopped Infras could be a AfterEach
function in the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bergmannf : The above code has been fixed to restart the nodes irrespective the test status, also the test case was run post the change and it was a success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks great to me. Good stuff!
fmt.Printf("ID: %s\nSummary: %s\nDescription: %s\n\n", log.ID(), log.Summary(), log.Description()) | ||
} | ||
} | ||
Expect(newLogsFound).To(BeTrue(), "No new service logs were found after infrastructure node shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how we keep the test flexible by not checking the content of the servicelogs, but on the other hand, would it be possible for other automations to interfere with this test by sending unrelated servicelogs which would make this test pass then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh btw, just took a look, it seems to be a limited support reason, so maybe its worth to check if limited support has been set? ( I am not sure if ocm sends a servicelog when limited support is set, but seems to be the case other wise you test would have failed? :D)
https://github.com/openshift/configuration-anomaly-detection/blob/main/pkg/investigations/chgm/chgm.go#L25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @RaphaelBut : Changes have been made to check the number of service logs before and after the change, only if new service logs are present they are printed and the test case fails if new service logs are not present.
Also for this test case this is the directive that we got :
AWS CCS: cluster has gone missing (infra nodes turned off)
Can be triggered by continuously turning off infrastructure nodes for 20 minutes.
Expectation: [new service log for turned off infra](https://github.com/openshift/configuration-anomaly-detection/blob/179db6ae2797352e6485ce75e9e3c0f256075418/pkg/investigations/chgm/chgm.go#L29) in OCM
Recovery after the test: start the stopped instances again.
Hence that is what we are looking for, for the other test cases Limited Support Reason is the expectation and hence we are checking that
Below changes have been carried out in the latest commit:
|
/label tide/merge-method-squash |
test/e2e/generate_incident.go
Outdated
"github.com/google/uuid" | ||
) | ||
|
||
type PagerDutyClient interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To note that this might get confusing, as we already have a pd client in the main code. Maybe calling it TestPagerDutyClient
would offer more clarity.
test/e2e/generate_incident.go
Outdated
alertMappings: map[string]string{ | ||
"ClusterHasGoneMissing": "cadtest has gone missing", | ||
"ClusterProvisioningDelay": "ClusterProvisioningDelay -", | ||
"ClusterMonitoringErrorBudgetBurnSRE": "ClusterMonitoringErrorBudgetBurnSRE Critical (1)", | ||
"InsightsOperatorDown": "InsightsOperatorDown", | ||
"MachineHealthCheckUnterminatedShortCircuitSRE": "MachineHealthCheckUnterminatedShortCircuitSRE CRITICAL (1)", | ||
"ApiErrorBudgetBurn": "api-ErrorBudgetBurn k8sgpt test CRITICAL (1)", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid having to pass raw strings.
Suggestion:
const (
ClusterHasGoneMissing = "ClusterHasGoneMissing"
ClusterProvisioningDelay = "ClusterProvisioningDelay"
ClusterMonitoringErrorBudgetBurnSRE = "ClusterMonitoringErrorBudgetBurnSRE"
// ...
)
func AlertTitle(alertName string) (string, error) {
switch alertName {
case ClusterHasGoneMissing:
return "cadtest has gone missing", nil
case ClusterProvisioningDelay:
return "ClusterProvisioningDelay -", nil
// ...
default:
return "", fmt.Errorf("unknown alert name: %s", alertName)
}
}
The client we are creating here does not need to contain this mapping internally.
test/e2e/generate_incident.go
Outdated
@@ -0,0 +1,201 @@ | |||
package osde2etests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this in a new /test/e2e/util/
. folder.
test/e2e/generate_incident.go
Outdated
type payload struct { | ||
Payload struct { | ||
Summary string `json:"summary"` | ||
Timestamp string `json:"timestamp"` | ||
Severity string `json:"severity"` | ||
Source string `json:"source"` | ||
Details map[string]string `json:"custom_details"` | ||
} `json:"payload"` | ||
RoutingKey string `json:"routing_key"` | ||
EventAction string `json:"event_action"` | ||
DedupKey string `json:"dedup_key"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a pagerduty go sdk that can be used, so we don't have to manually re-create structs and api calls here and in the other functions: https://github.com/PagerDuty/go-pagerduty
test/e2e/generate_incident.go
Outdated
eventsURL: "https://events.pagerduty.com/v2/enqueue", | ||
apiURL: "https://api.pagerduty.com/incidents", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are those part of the client?
test/e2e/generate_incident.go
Outdated
DedupKey string `json:"dedup_key"` | ||
} | ||
|
||
func (c *client) CreateSilentRequest(alertName, clusterID string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Silent
? This is a normal incident creation.
@@ -32,6 +34,7 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() { | |||
region string | |||
provider string | |||
clusterID string | |||
pdClient PagerDutyClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdClient PagerDutyClient | |
testPdClient TestPagerDutyClient |
pdRoutingKey := os.Getenv("CAD_PD_TOKEN") | ||
pdToken := os.Getenv("CAD_PD_TOKEN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming these will not yet be set in the e2e pipelines, are you already tracking adding them? I'm also a bit worried about re-using the same production token for e2e. As we're not using those for now (only needed for getting and resolving an incident), let's just skip loading these for now and revisit when we need to utilize the functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @typeid: Currently since this E2E test has been set to run in the Stage environment we are using the Stage routing key to carry out the generation of incident, also the environment variable name has been set as CAD_PD_TOKEN similar to the actual pagerduty.go to avoid issues in the Pipeline.
These tokens are necessary even for incident generation, else the test case would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can generate an incident indirectly with only the routing key, see https://developer.pagerduty.com/api-reference/368ae3d938c9e-send-an-event-to-pager-duty which is also used in generate_incident.sh
.
I'm unsure why PD differentiates between creating incident and events, when events end up creating incidents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typeid : I've made changes to all the comments that we have recieved, please go through the code and let me know your thoughts
@@ -266,4 +288,113 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() { | |||
fmt.Println("Test completed: All components restored to original replica counts.") | |||
} | |||
}) | |||
}) | |||
|
|||
It("AWS CCS: can shutdown and restart infrastructure nodes", Label("aws", "ccs", "infra-nodes", "limited-support"), func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test here is not whether or not we can shutdown and restart the infra nodes, but rather whether or not the cluster lands in limited support :)
Expect(RestoreEgress(ctx, ec2Wrapper, sgID)).To(Succeed(), "Failed to restore egress") | ||
ginkgo.GinkgoWriter.Printf("Egress restored\n") | ||
// Clean up: restore egress before checking test conditions | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be moved up, so that there are less possible exits between restoring egress and blocking it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ratnam915 I think this is still valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typeid : This is done
Expect(err).NotTo(HaveOccurred(), "Failed to create AWS config") | ||
ec2Client := ec2.NewFromConfig(awsCfg) | ||
|
||
// Step 1: Get cluster object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this unused step be removed and the numbering for the steps changed?
@ratnam915: This pull request references OSD-29470 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@@ -62,6 +66,10 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() { | |||
|
|||
provider, err = k8s.GetProvider(ctx) | |||
Expect(err).NotTo(HaveOccurred(), "Could not determine provider") | |||
|
|||
pdRoutingKey := os.Getenv("CAD_PD_TOKEN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be the following?
pdRoutingKey := os.Getenv("PAGERDUTY_ROUTING_KEY")
Also, would these variables already be set up in the e2e pipelines if we merge this now?
test/e2e/utils/generate_incident.go
Outdated
DedupKey string `json:"dedup_key"` | ||
} | ||
|
||
func GetAlertSummary(alertName string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the title, not the summary.
test/e2e/utils/generate_incident.go
Outdated
case AlertManagerDown: | ||
return "Alert Manager Down", nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an existing alert, it wouldn't route to anything within CAD.
test/e2e/utils/generate_incident.go
Outdated
} | ||
|
||
// EventResponse represents the response from PagerDuty Events API | ||
type EventResponse struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already exists in the pagerduty sdk: https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#EventResponse
test/e2e/utils/generate_incident.go
Outdated
} | ||
|
||
// Event represents the complete event structure for PagerDuty Events API | ||
type Event struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in a previous comment, let's not re-define everything and actually use the pagerduty sdk.
https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#Event & https://pkg.go.dev/github.com/PagerDuty/go-pagerduty#CreateEvent
@@ -205,13 +226,16 @@ var _ = Describe("Configuration Anomaly Detection", Ordered, func() { | |||
Expect(err).ToNot(HaveOccurred(), "failed to scale down alertmanager") | |||
fmt.Printf("Alertmanager scaled down from %d to 0 replicas. Waiting...\n", originalAMReplicas) | |||
|
|||
time.Sleep(20 * time.Minute) | |||
_, err = testPdClient.CreateRequest("AlertManagerDown", clusterID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is still for ClusterHasGoneMissing, but we're initiating it with a different misconfiguration.
test/e2e/utils/generate_incident.go
Outdated
apiClient: sdk.NewClient(routingKey), | ||
} | ||
} | ||
func (c *client) CreateRequest(alertName, clusterID string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateRequest
is a bit broad, could we rename this to TriggerIncident
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typeid : Changes have been made
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ratnam915, typeid The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ratnam915: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
OSD - 29470: To create E2E Tests for CAD - Cluster has gone missing - Infra Nodes turned off
Attached is the successful test case execution.
OSD-29470_test.txt