Skip to content
This repository was archived by the owner on Aug 16, 2023. It is now read-only.

Detect when the nightly CI fails because of a cluster shutdown #150

Closed
kpouget opened this issue May 10, 2021 · 3 comments
Closed

Detect when the nightly CI fails because of a cluster shutdown #150

kpouget opened this issue May 10, 2021 · 3 comments
Labels
area/ci good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@kpouget
Copy link
Collaborator

kpouget commented May 10, 2021

Every once in a while, the nightly testing fails because the cluster becomes unreachable:

roles/capture_environment/tasks/main.yml:8
TASK: capture_environment : Store OpenShift YAML version
----- FAILED ----
msg: non-zero return code

<command> oc version -oyaml > /logs/artifacts/233800__cluster__capture_environment/ocp_version.yml

<stderr> The connection to the server api.ci-op-75fhpdb3-3c6fc.origin-ci-int-aws.dev.rhcloud.com:6443 was refused - did you specify the right host or port?
----- FAILED ----

This kind of failure is independent from the GPU Operator testing, and it should be made clear in the CI-Dashboard (Prow infrastructure restarts the testing when this happens). An orange dot could do the job, with a label cluster issue detected.

To detect this, the must-gather script could simply create a file cluster-down when oc version doesn't work.
This the presence of this file would tell the ci-dashboard to set the orange flag.

@kpouget kpouget added area/ci kind/flake Categorizes issue or PR as related to a flaky test. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. labels May 10, 2021
@kpouget
Copy link
Collaborator Author

kpouget commented Jun 9, 2021

some steps and thoughs:

a) look at the test history in our slack #psap-ci-alerts and see that sometimes, the cluster gets disconnected in the middle of our testing
see this recent test for an example: step 002__entitlement__deploy failed because of the cluster being unreachable:

2021-06-08 23:39:24,514 p=630 u=psap-ci-runner n=ansible | <stderr> from server for: "STDIN": Get "https://api.ci-op-cw4km932-3a604.origin-ci-int-aws.dev.rhcloud.com:6443/apis/machineconfiguration.openshift.io/v1/machineconfigs/50-rhsm-conf": dial tcp 52.200.204.34:6443: connect: connection refused
2021-06-08 23:39:24,514 p=630 u=psap-ci-runner n=ansible | ----- FAILED ----

b) see in ci-dashboard that when this appear, the test is shown in :red_jenkins_circle: before the GPU Operator test failed

so to improve that:

  1. in the nightly tests, add trap ERR, and test if the cluster is still reachable (oc version). If unreachable, touch $ARTIFACTS/CLUSTER_DISCONNECTED

  2. in the ci-dashboard, if the test fail, test if $ARTIFACTS/CLUSTER_DISCONNECTED exists and if it does, indicate one way or another that the test could not complete

@kpouget
Copy link
Collaborator Author

kpouget commented Nov 29, 2021

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 29, 2021

@kpouget: Closing this issue.

In response to this:

/close

done, see https://github.com/openshift-psap/ci-artifacts/blob/master/testing/run#L100

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this as completed Nov 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/ci good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

1 participant