Skip to content

OSD-18645 - Initial implementation for CannotRetrieveUpdatesSRE #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anispate
Copy link
Contributor

@anispate anispate commented Apr 4, 2025

Copy link
Contributor

openshift-ci bot commented Apr 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anispate
Once this PR has been reviewed and has the lgtm label, please assign dustman9000 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov-commenter
Copy link

codecov-commenter commented Apr 7, 2025

Codecov Report

Attention: Patch coverage is 28.37838% with 53 lines in your changes missing coverage. Please review.

Project coverage is 32.01%. Comparing base (6ab773b) to head (685883f).

Files with missing lines Patch % Lines
...nnotretrieveupdatessre/cannotretrieveupdatessre.go 28.37% 52 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #404      +/-   ##
==========================================
- Coverage   32.04%   32.01%   -0.03%     
==========================================
  Files          35       36       +1     
  Lines        2431     2505      +74     
==========================================
+ Hits          779      802      +23     
- Misses       1593     1643      +50     
- Partials       59       60       +1     
Files with missing lines Coverage Δ
pkg/investigations/registry.go 0.00% <ø> (ø)
...nnotretrieveupdatessre/cannotretrieveupdatessre.go 28.37% <28.37%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anispate anispate force-pushed the OSD-18645 branch 4 times, most recently from bcb9b3c to f246d1f Compare April 8, 2025 18:17
@anispate anispate force-pushed the OSD-18645 branch 3 times, most recently from c933fc3 to 02d2c44 Compare April 10, 2025 21:11
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2025
@anispate anispate force-pushed the OSD-18645 branch 5 times, most recently from ac7a009 to ea1ba08 Compare May 9, 2025 04:09
@anispate anispate changed the title [DRAFT] - OSD-18645 - Initial implementation for CannotRetrieveUpdatesSRE OSD-18645 - Initial implementation for CannotRetrieveUpdatesSRE May 9, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is still the boilerplate document from running the bootstrap-new-investigation make target.

Can you revise this to include detailed steps for another SRE to test your work? Additional objects, scripts, etc can also be added to this testing/ directory

Comment on lines 33 to 48
defer func() {
deferErr := k8sclient.Cleanup(r.Cluster.ID(), r.OcmClient, remediationName)
if deferErr != nil {
logging.Error(deferErr)
err = errors.Join(err, deferErr)
}
}()

defer func(r *investigation.Resources) {
logging.Infof("Cleaning up investigation resources for cluster %s", r.Cluster.ID())
if cleanupErr := k8sclient.Cleanup(r.Cluster.ID(), r.OcmClient, remediationName); cleanupErr != nil {
logging.Errorf("Failed to cleanup Kubernetes client: %v", cleanupErr)
} else {
logging.Infof("Cleanup completed successfully for cluster %s", r.Cluster.ID())
}
}(r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're cleaning up twice here

Comment on lines 52 to 47
logging.Error("Network verifier ran into an error: %s", err.Error())
notes.AppendWarning("NetworkVerifier failed to run:\n\t %s", err.Error())
Copy link
Member

@tnierman tnierman May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we need to both log and note this - using notes.AppendWarning should be sufficient to add this message to the logs as well

Comment on lines 55 to 59
err = r.PdClient.AddNote(notes.String())
if err != nil {
// We do not return as we want the alert to be escalated either no matter what.
logging.Error("could not add failure reason incident notes")
}
Copy link
Member

@tnierman tnierman May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling AddNote() here without returning will mean that the network verifier results the network verifier error response potentially gets added to the incident twice: once here and again below, no?

Would it make more sense to notes.AppendWarning in this block, and just add those notes at the end?

Comment on lines 130 to 95
logging.Infof("ClusterVersion channel: %s", clusterVersion.Spec.Channel)
logging.Infof("ClusterVersion found: %s", clusterVersion.Status.Desired.Version)
logging.Debugf("ClusterVersion conditions: %+v", clusterVersion.Status.Conditions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log this info? Or is it leftover from debugging?

We also log the condition below, so it seems rather repetitive

logging.Warnf("Detected ClusterVersion error: Reason=%s, Message=%s", condition.Reason, condition.Message)
return "", fmt.Sprintf("ClusterVersion error detected: %s. Current version %s not found in channel %s",
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel),
fmt.Errorf("clusterversion validation failed: %s", condition.Reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a tad bit confusing that we're returning an error here: it's not the state we want to see the clusterversion in, but I'd expect an error return to mean that we couldn't check the clusterversion, not that we were able to check the clusterversion but didn't like the state we saw it in

@anispate anispate force-pushed the OSD-18645 branch 4 times, most recently from 74aa572 to 7f8407a Compare May 12, 2025 21:15
Comment on lines +1 to +5
# Testing CannotRetrieveUpdatesSRE Investigation

TODO:
- Add a test script or test objects to this directory for future maintainers to use
- Edit this README file and add detailed instructions on how to use the script/objects to recreate the conditions for the investigation. Be sure to include any assumptions or prerequisites about the environment (disable hive syncsetting, etc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README still contains just the default text. Can we populate it with instructions on how to test this investigation?

switch verifierResult {
case networkverifier.Failure:
logging.Infof("Network verifier reported failure: %s", failureReason)
// XXX: metrics.Inc(metrics.ServicelogPrepared, investigationName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// XXX: metrics.Inc(metrics.ServicelogPrepared, investigationName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to remove it but it exists on other cpd.go and insightoperatordown.go file so, i kept it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove it for now 🙂 If/when we add metrics to CAD, we'll have to scan the codebase for instances of servicelogs being sent anyway

} else {
switch verifierResult {
case networkverifier.Failure:
logging.Infof("Network verifier reported failure: %s", failureReason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logging.Infof("Network verifier reported failure: %s", failureReason)

We probably can exclude this logging line since we're noting the failure below too?

logging.Warnf("Detected ClusterVersion issue: Reason=%s, Message=%s", condition.Reason, condition.Message)
return "", fmt.Sprintf("ClusterVersion issue detected: %s. Current version %s not found in channel %s",
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel),
fmt.Errorf("clusterversion has undesirable state: %s", condition.Reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this function is written and called above, returning an error here won't allow the note to be added to the PD incident.

I'd prefer if we didn't return an error at all, honestly, because that should be reserved for cases where we couldn't check the state of the cluster, rather than cases where we did check the state of the cluster, but didn't like what we saw.

Comment on lines 96 to 105
for _, condition := range clusterVersion.Status.Conditions {
if condition.Type == "RetrievedUpdates" && condition.Status == "False" {
if (condition.Reason == "VersionNotFound" || condition.Reason == "RemoteFailed") &&
strings.Contains(strings.TrimSpace(condition.Message), "Unable to retrieve available updates") {
logging.Warnf("Detected ClusterVersion issue: Reason=%s, Message=%s", condition.Reason, condition.Message)
return "", fmt.Sprintf("ClusterVersion issue detected: %s. Current version %s not found in channel %s",
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel),
fmt.Errorf("clusterversion has undesirable state: %s", condition.Reason)
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, condition := range clusterVersion.Status.Conditions {
if condition.Type == "RetrievedUpdates" && condition.Status == "False" {
if (condition.Reason == "VersionNotFound" || condition.Reason == "RemoteFailed") &&
strings.Contains(strings.TrimSpace(condition.Message), "Unable to retrieve available updates") {
logging.Warnf("Detected ClusterVersion issue: Reason=%s, Message=%s", condition.Reason, condition.Message)
return "", fmt.Sprintf("ClusterVersion issue detected: %s. Current version %s not found in channel %s",
condition.Message, clusterVersion.Status.Desired.Version, clusterVersion.Spec.Channel),
fmt.Errorf("clusterversion has undesirable state: %s", condition.Reason)
}
}
}
for _, condition := range clusterVersion.Status.Conditions {
if condition.Type == "RetrievedUpdates" {
note, err := checkRetrievedUpdatesCondition(condition)
return clusterVersion.Status.DesiredVersion, note, err
}
}

It looks like we're only looking to analyze the state of one condition here, correct? Would it make more sense to break some of this out into it's own function?

Something like the following may make the nested conditionals more readable:

func checkRetrievedUpdatesCondition(condition corev1.StatusCondition) (string, err) {
    if condition.Status == corev1.ConditionFalse {
        return ...
    }
    if (condition.Reason == ...) {
        return ...
    }
}

Additionally, we won't waste CPU cycles continuing to loop through conditions we don't need to analyze once we find the one we're after.

Copy link
Contributor

openshift-ci bot commented May 13, 2025

@anispate: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

func (c *Investigation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) {
result := investigation.InvestigationResult{}
notes := notewriter.New("CannotRetrieveUpdatesSRE", logging.RawLogger)
k8scli, err := k8sclient.New(r.Cluster.ID(), r.OcmClient, remediationName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I think this will fail as backplane will look for the metadata.yaml in the path pkg/investigations/CannotRetrieveUpdatesSRE/metadata.yaml, but the path is lowercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, i have updated the path as part of the PR to the lowercase for the metadata.yaml as part of the PR so, would it still cause an issue?

Copy link
Member

@typeid typeid May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remediationName uses camel case in your PR, correct? And the path is not camel case in this PR, so it's a mismatch as far as I can tell

Comment on lines +74 to +83
notes.AppendWarning("Alert escalated to on-call primary for review.")
logging.Infof("Escalating incident with notes for cluster %s", r.Cluster.ID())
err = r.PdClient.EscalateIncidentWithNote(notes.String())
if err != nil {
logging.Errorf("Failed to escalate incident to PagerDuty: %v", err)
return result, fmt.Errorf("failed to escalate incident: %w", err)
}
logging.Infof("Investigation completed and escalated successfully for cluster %s", r.Cluster.ID())

return result, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to add in the PD note that we escalated the alert to SRE, it should be obvious if the SRE has the alert on their board ;)

There might be a few superfluous logs here as well, the following should suffice:

Suggested change
notes.AppendWarning("Alert escalated to on-call primary for review.")
logging.Infof("Escalating incident with notes for cluster %s", r.Cluster.ID())
err = r.PdClient.EscalateIncidentWithNote(notes.String())
if err != nil {
logging.Errorf("Failed to escalate incident to PagerDuty: %v", err)
return result, fmt.Errorf("failed to escalate incident: %w", err)
}
logging.Infof("Investigation completed and escalated successfully for cluster %s", r.Cluster.ID())
return result, nil
return result, r.PdClient.EscalateIncidentWithNote(notes.String())

}

// checkClusterVersion retrieves the cluster version
func checkClusterVersion(k8scli client.Client, clusterID string) (version string, note string, err error) {
Copy link
Member

@typeid typeid May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does more than the name suggests:

  • it retrieves the cluster version
  • checks if the updates could be retrieved
  • pre-formats a note

Could we separate the logic here for the functions to be more clear cut on what they are doing?

E.g.

func getClusterVersion() -> no logs, just gets the cluster version object
func getUpdateRetrievalFailures(clusterversion) -> looks for update retrieval failures in the clusterversion

Logs and note creation should be outside of this logic, ideally in the main investigation function logic. This way, we could even move the above functions in common packages as they are re-usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants