janitor: track when cleanup fails repeatedly for the same resource #15

ixdy · 2020-05-29T00:50:14Z

Originally filed as kubernetes/test-infra#15866

Due to programming errors, the janitor may continuously fail to clean up a resource. Two examples I just discovered:

possibly an order-of-deletion issue:

{"error":"exit status 1","level":"info","msg":"failed to clean up project kube-gke-upg-1-2-1-3-upg-clu-n, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.networks.delete) Could not fetch resource:\n - The network resource 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/networks/jenkins-e2e' is already being used by 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/routes/default-route-92807148d5aa60d1'\n\nError try to delete resources networks: CalledProcessError()\n[=== Start Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' with status 1 ===]\n","time":"2020-01-10T21:03:14Z"}

likely incorrect flags (gcloud changed but we didn't?):

{"error":"exit status 1","level":"info","msg":"failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --region=https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gke-ci-canary/regions/us-central1 \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\n[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]\n","time":"2020-01-10T21:18:55Z"}

It'd be good to have some way of detecting when we're repeatedly failing to clean up a resource.
Not sure yet what the best way would be to track that.

The text was updated successfully, but these errors were encountered:

ixdy · 2020-08-26T00:16:26Z

Two motivations for this feature:

Sometimes resources get into a weird/broken state where they can't be fixed by the janitor and require a human to intervene. Examples were given in the original description.
Sometimes, a roll-out of new GCP features or APIs (e.g. GCP janitor failing when trying to clean up logging sinks #37) may cause the janitor to start failing on some resources before others. Recognizing repeat failures on a subset of the resources may help us fix the issue before the janitor starts failing universally.

fejta-bot · 2020-11-24T01:13:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

ixdy · 2020-11-24T18:46:36Z

/remove-lifecycle stale

fejta-bot · 2021-02-22T19:43:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-24T20:29:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-23T20:31:53Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-23T20:31:58Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ixdy · 2021-05-25T18:05:29Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2021-05-25T18:05:34Z

@ixdy: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2021-11-15T23:19:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-12-16T00:14:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-01-15T00:24:27Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-01-15T00:24:38Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ixdy added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2020

ixdy mentioned this issue May 29, 2020

boskos/janitor: track when cleanup fails repeatedly for the same resource kubernetes/test-infra#15866

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2021

k8s-ci-robot closed this as completed Apr 23, 2021

k8s-ci-robot reopened this May 25, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 25, 2021

spiffxp added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 17, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 16, 2021

k8s-ci-robot closed this as completed Jan 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

janitor: track when cleanup fails repeatedly for the same resource #15

janitor: track when cleanup fails repeatedly for the same resource #15

ixdy commented May 29, 2020

ixdy commented Aug 26, 2020 •

edited

Loading

fejta-bot commented Nov 24, 2020

ixdy commented Nov 24, 2020

fejta-bot commented Feb 22, 2021

fejta-bot commented Mar 24, 2021

fejta-bot commented Apr 23, 2021

k8s-ci-robot commented Apr 23, 2021

ixdy commented May 25, 2021

k8s-ci-robot commented May 25, 2021

k8s-triage-robot commented Nov 15, 2021

k8s-triage-robot commented Dec 16, 2021

k8s-triage-robot commented Jan 15, 2022

k8s-ci-robot commented Jan 15, 2022

janitor: track when cleanup fails repeatedly for the same resource #15

janitor: track when cleanup fails repeatedly for the same resource #15

Comments

ixdy commented May 29, 2020

ixdy commented Aug 26, 2020 • edited Loading

fejta-bot commented Nov 24, 2020

ixdy commented Nov 24, 2020

fejta-bot commented Feb 22, 2021

fejta-bot commented Mar 24, 2021

fejta-bot commented Apr 23, 2021

k8s-ci-robot commented Apr 23, 2021

ixdy commented May 25, 2021

k8s-ci-robot commented May 25, 2021

k8s-triage-robot commented Nov 15, 2021

k8s-triage-robot commented Dec 16, 2021

k8s-triage-robot commented Jan 15, 2022

k8s-ci-robot commented Jan 15, 2022

ixdy commented Aug 26, 2020 •

edited

Loading