Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm-controller: make IPAM code production-ready #1278

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Omrigan
Copy link
Contributor

@Omrigan Omrigan commented Feb 21, 2025

Split into commits, to be merged with rebase.

Part of #1281

Copy link

github-actions bot commented Feb 21, 2025

No changes to the coverage.

HTML Report

Click to open

@cicdteam
Copy link
Contributor

I haven't gone into the changes in depth, but I just want to ask what issues are being addressed?

@Omrigan
Copy link
Contributor Author

Omrigan commented Feb 23, 2025

I haven't gone into the changes in depth, but I just want to ask what issues are being addressed?

Key issues are:

  1. Leader election times out when the rate of allocations is significant.
  2. When a release of an IP fails, it is not retried, so we have 13k stale allocations.
  3. k8s client is recreated on every request.

Plus a few other things.

@Omrigan Omrigan changed the title neonvm-controller: overhaul IPAM neonvm-controller: make IPAM code production-ready Feb 23, 2025
Base automatically changed from oleg/migration-tests to main February 23, 2025 21:42
Leader election is too heavy to run it on every IP allocation.

We can achieve the same with a leader election on neonvm-controller +
a local mutex.

Signed-off-by: Oleg Vasilev <[email protected]>
@Omrigan Omrigan marked this pull request as ready for review February 25, 2025 12:04
@Omrigan Omrigan force-pushed the oleg/ipam-fixes branch 3 times, most recently from b76f43e to 1f6080e Compare February 25, 2025 14:24
We will know if it is broken when we try to use it.

Signed-off-by: Oleg Vasilev <[email protected]>
Non-functional change.

Signed-off-by: Oleg Vasilev <[email protected]>
Two things:
1. Use NamespacedName.
2. Make the loop less verbose

Signed-off-by: Oleg Vasilev <[email protected]>
This supresses the following warning, which would be introduced by the
next commit:
 exitAfterDefer: os.Exit will exit, and `defer ipam.Close()` will not run

Signed-off-by: Oleg Vasilev <[email protected]>
There is no need to create it for every request.

Signed-off-by: Oleg Vasilev <[email protected]>
Non-functional change.

Preparation to have a cleanup action.

Signed-off-by: Oleg Vasilev <[email protected]>
We need it to cleanup stale allocations.

Signed-off-by: Oleg Vasilev <[email protected]>
The previous version is marked deprecated.

Signed-off-by: Oleg Vasilev <[email protected]>
@Omrigan Omrigan force-pushed the oleg/ipam-fixes branch 2 times, most recently from e675b7b to ffed348 Compare February 25, 2025 16:14
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, just skimming through -- broadly looks good, need to dig a little deeper into the new core of the ipam actions.

One thing that I'm not sure about: What happens if we get a lot of reconcile operations simultaneously waiting on ipam operations? Should we have some mechanism to immediately bail if there's too many other threads also waiting? (so that, under load, other operations don't completely grind to a halt as well)


// NOTE: THE CONTROLLER MUST IMMEDIATELY EXIT AFTER RUNNING THE MANAGER.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (maybe blocking?): This is no longer upheld, right? IIRC it's required due to releasing the leader election lease immediately when the manager exits

return err
log.Error(err, "Failed to acquire overlay IP", "VirtualMachine", vm.Name)
r.Recorder.Event(vm, "Warning", "OverlayNet", "Failed to acquire overlay IP")
// Don't return error, we will keep running without an overlay IP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): IIUC this means that if we have some transient issue where we fail to initially acquire the overlay IP, we'll never get one? Is that accurate, and is that the behavior we want?

// read network-attachment-definition from Kubernetes
nad, err := kClient.nadClient.K8sCniCncfIoV1().NetworkAttachmentDefinitions(nadNamespace).Get(ctx, nadName, metav1.GetOptions{})
nad, err := kClient.NADClient.K8sCniCncfIoV1().NetworkAttachmentDefinitions(nadNamespace).Get(context.Background(), nadName, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We should not make a network call without a timeout, otherwise this can potentially block forever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants