Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearing Old Conditions #1011

Open
daveoy opened this issue Jan 15, 2025 · 7 comments · May be fixed by #1022
Open

Clearing Old Conditions #1011

daveoy opened this issue Jan 15, 2025 · 7 comments · May be fixed by #1022

Comments

@daveoy
Copy link
Contributor

daveoy commented Jan 15, 2025

we have observed that when changing our system log monitor configurations to omit a previously watched condition, the condition persists on the node object.

I have added a bool flag --delete-deprecated-conditions and stringSliceFlag --deprecated-condition-types, plus a handler into the k8sexporter that will delete conditions from a node object on exporter initialization.

would this community be interested in a PR that supplies this feature?

@daveoy
Copy link
Contributor Author

daveoy commented Feb 4, 2025

here's example output, i'll add a PR shortly

I0204 17:20:21.902108       7 problem_client.go:127] Deleting deprecated conditions [GPUMMUErrorXid31 JournaldCGroupOOMKilling JournaldGPUApplicationError JournaldGPUECCUncorrectableError JournaldGPUFallenOffBus JournaldGPUFault JournaldGPUGSPTimeoutXid119 JournaldGPUInvalidPushBuffer JournaldGPURowRemapFailure JournaldGPUWantsReset JournaldHardwareErrorCorrected JournaldHardwareErrorFatal JournaldHardwareErrorInfo JournaldHardwareErrorInterruptCPU JournaldHardwareErrorInterruptMemory JournaldHardwareErrorInterruptPCIe JournaldHardwareErrorInterruptUnknown JournaldHardwareErrorRecoverable JournaldKernelDeadlock JournaldKernelFailedToGetEntry JournaldKernelFailedToGetNextEntry JournaldKernelHardlock JournaldKernelOops JournaldKernelWatchLoopStarted JournaldLocalDiskErrors JournaldNFSStorageFault JournaldNVSwitchFailure JournaldNVSXidNonFatal JournaldPCIAER JournaldPersistentStorageFault JournaldReadonlyFilesystem JournaldSystemOOMKilling JournaldTaskHung JournaldUnregisterNetDevice] (if present)...
I0204 17:20:21.911443       7 problem_client.go:140] Deleting deprecated condition JournaldGPUApplicationError
I0204 17:20:21.911461       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptUnknown
I0204 17:20:21.911465       7 problem_client.go:140] Deleting deprecated condition JournaldGPUFallenOffBus
I0204 17:20:21.911467       7 problem_client.go:140] Deleting deprecated condition JournaldPersistentStorageFault
I0204 17:20:21.911470       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptMemory
I0204 17:20:21.911472       7 problem_client.go:140] Deleting deprecated condition JournaldReadonlyFilesystem
I0204 17:20:21.911474       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorFatal
I0204 17:20:21.911477       7 problem_client.go:140] Deleting deprecated condition JournaldKernelDeadlock
I0204 17:20:21.911480       7 problem_client.go:140] Deleting deprecated condition JournaldLocalDiskErrors
I0204 17:20:21.911483       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptPCIe
I0204 17:20:21.911486       7 problem_client.go:140] Deleting deprecated condition JournaldKernelHardlock
I0204 17:20:21.911490       7 problem_client.go:140] Deleting deprecated condition JournaldHardwareErrorInterruptCPU
I0204 17:20:21.911492       7 problem_client.go:140] Deleting deprecated condition JournaldGPURowRemapFailure
I0204 17:20:21.911494       7 problem_client.go:140] Deleting deprecated condition JournaldGPUECCUncorrectableError
I0204 17:20:21.911496       7 problem_client.go:140] Deleting deprecated condition JournaldGPUWantsReset
I0204 17:20:21.911498       7 problem_client.go:140] Deleting deprecated condition JournaldGPUFault
I0204 17:20:21.911500       7 problem_client.go:140] Deleting deprecated condition JournaldGPUGSPTimeoutXid119
I0204 17:20:21.911502       7 problem_client.go:140] Deleting deprecated condition JournaldGPUInvalidPushBuffer

@daveoy
Copy link
Contributor Author

daveoy commented Feb 4, 2025

removed 120k+ conditions over 20k+ nodes using the linked PR's code just this morning. this is in addition to 4k-6k worth of conditions i removed while testing this 3 weeks back

@wangzhen127
Copy link
Member

NPD treat node condition as permanent issue of the node. If there is any remedy system that fixes the issue, that should also be responsible for cleaning up the conditions. Why would we need NPD to do such cleanup?

when changing our system log monitor configurations

Is this mostly for dev and test purpose instead of for production use cases?

@daveoy
Copy link
Contributor Author

daveoy commented Mar 11, 2025

the solution presented in the attached PR is for a situation where a condition has been removed from NPD's config, but still exists in whatever state the. config would have applied it on nodes in your fleet.

say, for example, you're running NPD as a daemonset and you've configured a journald monitor to apply conditions as follows:

{
  "plugin": "journald",
  "pluginConfig": {
        "source": "kubelet"
  },
  "lookback": "5m",
  "logPath": "/var/log/journal",
  "bufferSize": 10,
  "conditions": [
    {
      "type": "RunContainerError",
      "reason": "NoRunContainerError",
      "message": "No RunContainerErrors present"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "RunContainerError",
      "reason": "ContextDeadlineExceeded",
      "pattern": ".*rror syncing pod.*RunContainerError.*context deadline exceeded.*"
    }
  ]
}

and 5 out of 100 nodes get this condition applied.

you then go and change the config to look for a more broad error pattern, maybe it looks like this:

{
  "plugin": "journald",
  "pluginConfig": {
        "source": "kubelet"
  },
  "lookback": "5m",
  "logPath": "/var/log/journal",
  "bufferSize": 10,
  "conditions": [
    {
      "type": "ErrorSyncingPod",
      "reason": "NoErrorSyncingPod",
      "message": "No ErrorSyncingPod present"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "ErrorSyncingPod",
      "reason": "RunContainerError",
      "pattern": ".*rror syncing pod.*"
    }
  ]
}

after the pods come back up, all 100 nodes will still have the condition RunContainerError present (either true or false), and will also now have ErrorSyncingPod present.

the issue is, we are no longer watching for the pattern which presented the RunContainerError condition, which will leave the 5 nodes with True status on the condition lingering, as well as all nodes will have a lingering condition that isn't being monitored by anything.

using --delete-deprecated-conditions --deprecated-condition-types="RunContainerError" will remove that condition from nodes using the supplied solution.

at my workplace, this is actively in production today.

@wangzhen127
Copy link
Member

I see. So the use case is adding additional APIs to the config, so that you can declare which conditions were previously added by NPD, but now no longer needed and want to have them cleaned up. Then during NPD startup time, those get removed.

How often do you need this btw? We had the assumption that the config would be stable in general. Why don't you need those deprecated conditions any more?

@daveoy
Copy link
Contributor Author

daveoy commented Mar 11, 2025

in some cases, log lines change and language for the condition name may need to follow (this can happen on kernel upgrades, driver upgrades etc).

sometimes we change permanent conditions to temporary conditions, and we want to remove the permanent ones from the nodes

in some cases we have taken a single pattern and split it into multiple, more specific patterns.

in all cases we don't want "unmanaged" conditions laying around on our fleet, so we remove them with these flags.

i wrote this feature late last year (this issue and PR are quite old now ;) ) and we have updated the list of deprecated conditions twice by now.

@617406160
Copy link

The feature of clearing the condition is very useful. We have recently encountered a similar problem. If this PR has been merged into the master branch? How should I use it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants