Skip to content

Mode where nfd-worker updates the labels #2022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ivelichkovich opened this issue Jan 15, 2025 · 7 comments
Open

Mode where nfd-worker updates the labels #2022

ivelichkovich opened this issue Jan 15, 2025 · 7 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ivelichkovich
Copy link
Contributor

What would you like to be added:

a mode where NFD worker can update labels without needing to run nfd-master with informer cache of entire nodefeatures, maybe some subset map of nodefeatures just for gc if this issue is implemented: #2021 i.e. just store nodefeatures by name/node.

Why is this needed:

nodefeature CRs can be a footgun if users list them i.e. k get nodefeatures in a high scale environment.

nfd master also uses a ton of memory at scale.

If nodefeature-worker just handled the labels for its own node then it would alleviate a lot of scale concerns

@ivelichkovich ivelichkovich added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 15, 2025
@marquiz
Copy link
Contributor

marquiz commented Jan 22, 2025

One big issue with the nodefeature objects currently is their size. Quickly thinking, I can see two big culprits adding to that. One is the "managed fields metadata", basically every feature (like kernel.enabledmodule.e1000 is listed there. Not sure how we could avoid listing every possible member of the CRD there, or alternatively filtering out this metadata in listers. Second one is the kernel.config feature which lists every kernel config option, most of which nobody is interested in, and there's a ton of those. We could start building a deny list for filtering out the uninteresting ones or smth.

Second improvent, which I think we really need (and which I have thought for a long time) would be sharding of nfd-master. I.e. distribute the nodes across multiple instances of nfd-master. E.g. calculate a checksum of the nodename and do mod number-of-shards to get the instance (shard) which is responsible for that node.

Splitting the functionality to two daemons is deliberate, e.g. from the security considerations.

Thoughts?

@ivelichkovich
Copy link
Contributor Author

ivelichkovich commented Jan 22, 2025

Yeah I think the deny list as suggested here #2026 is a great idea and smaller change then the other suggestions. This would be a great place to start asap if possible.

Sharding is an interesting idea for sure, would love to discuss that further. The problem though is that even as we paginate the list calls or shard the master it helps automation work smoothly but it still leaves a footgun if a user gets curious and calls k get nodefeatures, although this is a bit mitigated by the allow/deny list. I wonder if the apiserver ListWatch/LIstStreaming work helps here though maybe?

Just for my understanding though, what are the security concerns for the split? Just giving the daemonset permissions to edit nodes?

@marquiz
Copy link
Contributor

marquiz commented Jan 23, 2025

Yeah I think the deny list as suggested here #2026 is a great idea and smaller change then the other suggestions. This would be a great place to start asap if possible.

We can start there for sure

Sharding is an interesting idea for sure, would love to discuss that further. The problem though is that even as we paginate the list calls or shard the master it helps automation work smoothly but it still leaves a footgun if a user gets curious and calls k get nodefeatures, although this is a bit mitigated by the allow/deny list. I wonder if the apiserver ListWatch/LIstStreaming work helps here though maybe?

That might help, need to follow up the work more closely. One thing would be to try to make kubectl smarter, support pagination there, too. Another thing we could do is documentation in NFD, warn about the consequences in big clusters.

Just for my understanding though, what are the security concerns for the split? Just giving the daemonset permissions to edit nodes?

Yes. Run nfd-worker with smallest possible privileges.

We could also explore the NodeFeature-less nfd-worker-standalone option too, as an alternative operating mode. But then nfd-worker should replicate all the functionality of nfd-master (NodeFeatureRules, NodeFeatureGroups etc).

@marquiz
Copy link
Contributor

marquiz commented Jan 23, 2025

We could/should create a separate issuea about sharding.

@ivelichkovich
Copy link
Contributor Author

Yeah that sounds good, do you want me to create specific issues for the items discussed and close this one?

  1. Filtering nodefeatures not just labels
  2. nfd-worker standalone

@marquiz
Copy link
Contributor

marquiz commented Feb 5, 2025

Yeah that sounds good, do you want me to create specific issues for the items discussed and close this one?

IMO separate issues about filtering and sharding.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants