-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clearing Old Conditions #1011
Comments
here's example output, i'll add a PR shortly
|
removed 120k+ conditions over 20k+ nodes using the linked PR's code just this morning. this is in addition to 4k-6k worth of conditions i removed while testing this 3 weeks back |
NPD treat node condition as permanent issue of the node. If there is any remedy system that fixes the issue, that should also be responsible for cleaning up the conditions. Why would we need NPD to do such cleanup?
Is this mostly for dev and test purpose instead of for production use cases? |
the solution presented in the attached PR is for a situation where a condition has been removed from NPD's config, but still exists in whatever state the. config would have applied it on nodes in your fleet. say, for example, you're running NPD as a daemonset and you've configured a journald monitor to apply conditions as follows:
and 5 out of 100 nodes get this condition applied. you then go and change the config to look for a more broad error pattern, maybe it looks like this:
after the pods come back up, all 100 nodes will still have the condition the issue is, we are no longer watching for the pattern which presented the using at my workplace, this is actively in production today. |
I see. So the use case is adding additional APIs to the config, so that you can declare which conditions were previously added by NPD, but now no longer needed and want to have them cleaned up. Then during NPD startup time, those get removed. How often do you need this btw? We had the assumption that the config would be stable in general. Why don't you need those deprecated conditions any more? |
in some cases, log lines change and language for the condition name may need to follow (this can happen on kernel upgrades, driver upgrades etc). sometimes we change in some cases we have taken a single pattern and split it into multiple, more specific patterns. in all cases we don't want "unmanaged" conditions laying around on our fleet, so we remove them with these flags. i wrote this feature late last year (this issue and PR are quite old now ;) ) and we have updated the list of deprecated conditions twice by now. |
The feature of clearing the condition is very useful. We have recently encountered a similar problem. If this PR has been merged into the master branch? How should I use it? |
we have observed that when changing our system log monitor configurations to omit a previously watched condition, the condition persists on the node object.
I have added a bool flag
--delete-deprecated-conditions
and stringSliceFlag--deprecated-condition-types
, plus a handler into thek8sexporter
that will delete conditions from a node object on exporter initialization.would this community be interested in a PR that supplies this feature?
The text was updated successfully, but these errors were encountered: