-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-feature-discovery sends excessive LIST requests to the API server #1891
Comments
@jslouisyou thank you for reporting the issue. I think you're hitting the issue what #1811 (and #1810, #1815) addresses. Those will be part of the upcoming v0.17 release of NFD. A possible workaround for NFD v0.16 could be to run with NFD with Looks like need for v0.17 is urgent. |
Thanks @marquiz for updating this issue! It seems that
But I'm not sure whether this feature is required from Thanks! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
What happened: node-feature-discovery of gpu-operator sends excessive LIST requests to the API server
What you expected to happen:
Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a
LIST
request fromgpu-operator
. Here's the alert and rule that I'm using:histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10
I also found all
gpu-operator-node-feature-discovery-worker
pods are tried to sendGET
verb to API server to query thenodefeatures
resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:I think this is strange that it takes this long to process
LIST
requests when my k8s cluster only has 300 GPU nodes and whynode-feature-discovery-worker
pods are sendingGET
request every minute.Do you have any information about this problem?
If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.
Thanks!
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
node-feature-discovery
was deployed during installation withgpu-operator
from NVIDIA. I usedgpu-operator
v23.3.2 version.Environment:
kubectl version
): k8s v1.21.6, v1.29.5cat /etc/os-release
): Ubuntu 20.04.4 LTSuname -a
): 5.4.0-113-genericnfd
was deployed bygpu-operator
from NVIDIAcalico
The text was updated successfully, but these errors were encountered: