Detect when the nightly CI fails because of a cluster shutdown #150

kpouget · 2021-05-10T07:01:01Z

Every once in a while, the nightly testing fails because the cluster becomes unreachable:

roles/capture_environment/tasks/main.yml:8
TASK: capture_environment : Store OpenShift YAML version
----- FAILED ----
msg: non-zero return code

<command> oc version -oyaml > /logs/artifacts/233800__cluster__capture_environment/ocp_version.yml

<stderr> The connection to the server api.ci-op-75fhpdb3-3c6fc.origin-ci-int-aws.dev.rhcloud.com:6443 was refused - did you specify the right host or port?
----- FAILED ----

This kind of failure is independent from the GPU Operator testing, and it should be made clear in the CI-Dashboard (Prow infrastructure restarts the testing when this happens). An orange dot could do the job, with a label cluster issue detected.

To detect this, the must-gather script could simply create a file cluster-down when oc version doesn't work.
This the presence of this file would tell the ci-dashboard to set the orange flag.

The text was updated successfully, but these errors were encountered:

kpouget · 2021-06-09T07:32:30Z

some steps and thoughs:

a) look at the test history in our slack #psap-ci-alerts and see that sometimes, the cluster gets disconnected in the middle of our testing
see this recent test for an example: step 002__entitlement__deploy failed because of the cluster being unreachable:

2021-06-08 23:39:24,514 p=630 u=psap-ci-runner n=ansible | <stderr> from server for: "STDIN": Get "https://api.ci-op-cw4km932-3a604.origin-ci-int-aws.dev.rhcloud.com:6443/apis/machineconfiguration.openshift.io/v1/machineconfigs/50-rhsm-conf": dial tcp 52.200.204.34:6443: connect: connection refused
2021-06-08 23:39:24,514 p=630 u=psap-ci-runner n=ansible | ----- FAILED ----

b) see in ci-dashboard that when this appear, the test is shown in :red_jenkins_circle: before the GPU Operator test failed

so to improve that:

in the nightly tests, add trap ERR, and test if the cluster is still reachable (oc version). If unreachable, touch $ARTIFACTS/CLUSTER_DISCONNECTED
in the ci-dashboard, if the test fail, test if $ARTIFACTS/CLUSTER_DISCONNECTED exists and if it does, indicate one way or another that the test could not complete

kpouget · 2021-11-29T08:05:05Z

/close

done, see https://github.com/openshift-psap/ci-artifacts/blob/master/testing/run#L100

openshift-ci · 2021-11-29T08:05:06Z

@kpouget: Closing this issue.

In response to this:

/close

done, see https://github.com/openshift-psap/ci-artifacts/blob/master/testing/run#L100

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kpouget added area/ci kind/flake Categorizes issue or PR as related to a flaky test. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. labels May 10, 2021

openshift-ci bot closed this as completed Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect when the nightly CI fails because of a cluster shutdown #150

Detect when the nightly CI fails because of a cluster shutdown #150

kpouget commented May 10, 2021

kpouget commented Jun 9, 2021

kpouget commented Nov 29, 2021

openshift-ci bot commented Nov 29, 2021

Detect when the nightly CI fails because of a cluster shutdown #150

Detect when the nightly CI fails because of a cluster shutdown #150

Comments

kpouget commented May 10, 2021

kpouget commented Jun 9, 2021

kpouget commented Nov 29, 2021

openshift-ci bot commented Nov 29, 2021