-
Notifications
You must be signed in to change notification settings - Fork 49
OSD-18645 - Initial implementation for CannotRetrieveUpdatesSRE #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anispate The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #404 +/- ##
==========================================
- Coverage 32.04% 32.01% -0.03%
==========================================
Files 35 36 +1
Lines 2431 2505 +74
==========================================
+ Hits 779 802 +23
- Misses 1593 1643 +50
- Partials 59 60 +1
🚀 New features to boost your workflow:
|
bcb9b3c
to
f246d1f
Compare
pkg/investigations/CannotRetrieveUpdatesSRE/CannotRetrieveUpdatesSRE.go
Outdated
Show resolved
Hide resolved
pkg/investigations/CannotRetrieveUpdatesSRE/CannotRetrieveUpdatesSRE.go
Outdated
Show resolved
Hide resolved
pkg/investigations/CannotRetrieveUpdatesSRE/CannotRetrieveUpdatesSRE.go
Outdated
Show resolved
Hide resolved
pkg/investigations/CannotRetrieveUpdatesSRE/CannotRetrieveUpdatesSRE.go
Outdated
Show resolved
Hide resolved
c933fc3
to
02d2c44
Compare
pkg/investigations/cannotretrieveupdatesre/cannotRetrieveUpdateSRE.go
Outdated
Show resolved
Hide resolved
ac7a009
to
ea1ba08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is still the boilerplate document from running the bootstrap-new-investigation
make target.
Can you revise this to include detailed steps for another SRE to test your work? Additional objects, scripts, etc can also be added to this testing/
directory
defer func() { | ||
deferErr := k8sclient.Cleanup(r.Cluster.ID(), r.OcmClient, remediationName) | ||
if deferErr != nil { | ||
logging.Error(deferErr) | ||
err = errors.Join(err, deferErr) | ||
} | ||
}() | ||
|
||
defer func(r *investigation.Resources) { | ||
logging.Infof("Cleaning up investigation resources for cluster %s", r.Cluster.ID()) | ||
if cleanupErr := k8sclient.Cleanup(r.Cluster.ID(), r.OcmClient, remediationName); cleanupErr != nil { | ||
logging.Errorf("Failed to cleanup Kubernetes client: %v", cleanupErr) | ||
} else { | ||
logging.Infof("Cleanup completed successfully for cluster %s", r.Cluster.ID()) | ||
} | ||
}(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're cleaning up twice here
logging.Error("Network verifier ran into an error: %s", err.Error()) | ||
notes.AppendWarning("NetworkVerifier failed to run:\n\t %s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we need to both log and note this - using notes.AppendWarning
should be sufficient to add this message to the logs as well
err = r.PdClient.AddNote(notes.String()) | ||
if err != nil { | ||
// We do not return as we want the alert to be escalated either no matter what. | ||
logging.Error("could not add failure reason incident notes") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling AddNote()
here without returning will mean that the network verifier results the network verifier error response potentially gets added to the incident twice: once here and again below, no?
Would it make more sense to notes.AppendWarning
in this block, and just add those notes at the end?
logging.Infof("ClusterVersion channel: %s", clusterVersion.Spec.Channel) | ||
logging.Infof("ClusterVersion found: %s", clusterVersion.Status.Desired.Version) | ||
logging.Debugf("ClusterVersion conditions: %+v", clusterVersion.Status.Conditions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to log this info? Or is it leftover from debugging?
We also log the condition below, so it seems rather repetitive
logging.Warnf("Detected ClusterVersion error: Reason=%s, Message=%s", condition.Reason, condition.Message) | ||
return "", fmt.Sprintf("ClusterVersion error detected: %s. Current version %s not found in channel %s", | ||
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel), | ||
fmt.Errorf("clusterversion validation failed: %s", condition.Reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a tad bit confusing that we're returning an error here: it's not the state we want to see the clusterversion
in, but I'd expect an error return to mean that we couldn't check the clusterversion
, not that we were able to check the clusterversion but didn't like the state we saw it in
74aa572
to
7f8407a
Compare
# Testing CannotRetrieveUpdatesSRE Investigation | ||
|
||
TODO: | ||
- Add a test script or test objects to this directory for future maintainers to use | ||
- Edit this README file and add detailed instructions on how to use the script/objects to recreate the conditions for the investigation. Be sure to include any assumptions or prerequisites about the environment (disable hive syncsetting, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This README still contains just the default text. Can we populate it with instructions on how to test this investigation?
switch verifierResult { | ||
case networkverifier.Failure: | ||
logging.Infof("Network verifier reported failure: %s", failureReason) | ||
// XXX: metrics.Inc(metrics.ServicelogPrepared, investigationName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// XXX: metrics.Inc(metrics.ServicelogPrepared, investigationName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to remove it but it exists on other cpd.go and insightoperatordown.go file so, i kept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove it for now 🙂 If/when we add metrics to CAD, we'll have to scan the codebase for instances of servicelogs being sent anyway
} else { | ||
switch verifierResult { | ||
case networkverifier.Failure: | ||
logging.Infof("Network verifier reported failure: %s", failureReason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.Infof("Network verifier reported failure: %s", failureReason) |
We probably can exclude this logging line since we're noting the failure below too?
logging.Warnf("Detected ClusterVersion issue: Reason=%s, Message=%s", condition.Reason, condition.Message) | ||
return "", fmt.Sprintf("ClusterVersion issue detected: %s. Current version %s not found in channel %s", | ||
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel), | ||
fmt.Errorf("clusterversion has undesirable state: %s", condition.Reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this function is written and called above, returning an error here won't allow the note to be added to the PD incident.
I'd prefer if we didn't return an error at all, honestly, because that should be reserved for cases where we couldn't check the state of the cluster, rather than cases where we did check the state of the cluster, but didn't like what we saw.
for _, condition := range clusterVersion.Status.Conditions { | ||
if condition.Type == "RetrievedUpdates" && condition.Status == "False" { | ||
if (condition.Reason == "VersionNotFound" || condition.Reason == "RemoteFailed") && | ||
strings.Contains(strings.TrimSpace(condition.Message), "Unable to retrieve available updates") { | ||
logging.Warnf("Detected ClusterVersion issue: Reason=%s, Message=%s", condition.Reason, condition.Message) | ||
return "", fmt.Sprintf("ClusterVersion issue detected: %s. Current version %s not found in channel %s", | ||
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel), | ||
fmt.Errorf("clusterversion has undesirable state: %s", condition.Reason) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, condition := range clusterVersion.Status.Conditions { | |
if condition.Type == "RetrievedUpdates" && condition.Status == "False" { | |
if (condition.Reason == "VersionNotFound" || condition.Reason == "RemoteFailed") && | |
strings.Contains(strings.TrimSpace(condition.Message), "Unable to retrieve available updates") { | |
logging.Warnf("Detected ClusterVersion issue: Reason=%s, Message=%s", condition.Reason, condition.Message) | |
return "", fmt.Sprintf("ClusterVersion issue detected: %s. Current version %s not found in channel %s", | |
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel), | |
fmt.Errorf("clusterversion has undesirable state: %s", condition.Reason) | |
} | |
} | |
} | |
for _, condition := range clusterVersion.Status.Conditions { | |
if condition.Type == "RetrievedUpdates" { | |
note, err := checkRetrievedUpdatesCondition(condition) | |
return clusterVersion.Status.DesiredVersion, note, err | |
} | |
} |
It looks like we're only looking to analyze the state of one condition here, correct? Would it make more sense to break some of this out into it's own function?
Something like the following may make the nested conditionals more readable:
func checkRetrievedUpdatesCondition(condition corev1.StatusCondition) (string, err) {
if condition.Status == corev1.ConditionFalse {
return ...
}
if (condition.Reason == ...) {
return ...
}
}
Additionally, we won't waste CPU cycles continuing to loop through conditions we don't need to analyze once we find the one we're after.
@anispate: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
func (c *Investigation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) { | ||
result := investigation.InvestigationResult{} | ||
notes := notewriter.New("CannotRetrieveUpdatesSRE", logging.RawLogger) | ||
k8scli, err := k8sclient.New(r.Cluster.ID(), r.OcmClient, remediationName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I think this will fail as backplane will look for the metadata.yaml
in the path pkg/investigations/CannotRetrieveUpdatesSRE/metadata.yaml
, but the path is lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, i have updated the path as part of the PR to the lowercase for the metadata.yaml as part of the PR so, would it still cause an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remediationName
uses camel case in your PR, correct? And the path is not camel case in this PR, so it's a mismatch as far as I can tell
notes.AppendWarning("Alert escalated to on-call primary for review.") | ||
logging.Infof("Escalating incident with notes for cluster %s", r.Cluster.ID()) | ||
err = r.PdClient.EscalateIncidentWithNote(notes.String()) | ||
if err != nil { | ||
logging.Errorf("Failed to escalate incident to PagerDuty: %v", err) | ||
return result, fmt.Errorf("failed to escalate incident: %w", err) | ||
} | ||
logging.Infof("Investigation completed and escalated successfully for cluster %s", r.Cluster.ID()) | ||
|
||
return result, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to add in the PD note that we escalated the alert to SRE, it should be obvious if the SRE has the alert on their board ;)
There might be a few superfluous logs here as well, the following should suffice:
notes.AppendWarning("Alert escalated to on-call primary for review.") | |
logging.Infof("Escalating incident with notes for cluster %s", r.Cluster.ID()) | |
err = r.PdClient.EscalateIncidentWithNote(notes.String()) | |
if err != nil { | |
logging.Errorf("Failed to escalate incident to PagerDuty: %v", err) | |
return result, fmt.Errorf("failed to escalate incident: %w", err) | |
} | |
logging.Infof("Investigation completed and escalated successfully for cluster %s", r.Cluster.ID()) | |
return result, nil | |
return result, r.PdClient.EscalateIncidentWithNote(notes.String()) |
} | ||
|
||
// checkClusterVersion retrieves the cluster version | ||
func checkClusterVersion(k8scli client.Client, clusterID string) (version string, note string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does more than the name suggests:
- it retrieves the cluster version
- checks if the updates could be retrieved
- pre-formats a note
Could we separate the logic here for the functions to be more clear cut on what they are doing?
E.g.
func getClusterVersion()
-> no logs, just gets the cluster version object
func getUpdateRetrievalFailures(clusterversion)
-> looks for update retrieval failures in the clusterversion
Logs and note creation should be outside of this logic, ideally in the main investigation function logic. This way, we could even move the above functions in common packages as they are re-usable.
OSD-18645 - CAD implementation for CannotRetrieveUpdatesSRE
Sample ticket: https://redhat.pagerduty.com/incidents/Q1S45W54TK1QKU#:~:text=%E2%9A%A0%EF%B8%8F%20ClusterVersion%20error%20detected,primary%20for%20review
Updated sample ticket: https://redhat.pagerduty.com/incidents/Q2UVRI8YGLPP3G