-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neonvm-controller: make IPAM code production-ready #1278
base: main
Are you sure you want to change the base?
Conversation
No changes to the coverage.
HTML Report |
I haven't gone into the changes in depth, but I just want to ask what issues are being addressed? |
Key issues are:
Plus a few other things. |
40bd3bb
to
49ef878
Compare
Signed-off-by: Oleg Vasilev <[email protected]>
Leader election is too heavy to run it on every IP allocation. We can achieve the same with a leader election on neonvm-controller + a local mutex. Signed-off-by: Oleg Vasilev <[email protected]>
49ef878
to
382bcfa
Compare
b76f43e
to
1f6080e
Compare
We will know if it is broken when we try to use it. Signed-off-by: Oleg Vasilev <[email protected]>
Non-functional change. Signed-off-by: Oleg Vasilev <[email protected]>
Two things: 1. Use NamespacedName. 2. Make the loop less verbose Signed-off-by: Oleg Vasilev <[email protected]>
This supresses the following warning, which would be introduced by the next commit: exitAfterDefer: os.Exit will exit, and `defer ipam.Close()` will not run Signed-off-by: Oleg Vasilev <[email protected]>
There is no need to create it for every request. Signed-off-by: Oleg Vasilev <[email protected]>
Non-functional change. Preparation to have a cleanup action. Signed-off-by: Oleg Vasilev <[email protected]>
We need it to cleanup stale allocations. Signed-off-by: Oleg Vasilev <[email protected]>
Signed-off-by: Oleg Vasilev <[email protected]>
The previous version is marked deprecated. Signed-off-by: Oleg Vasilev <[email protected]>
e675b7b
to
ffed348
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, just skimming through -- broadly looks good, need to dig a little deeper into the new core of the ipam actions.
One thing that I'm not sure about: What happens if we get a lot of reconcile operations simultaneously waiting on ipam operations? Should we have some mechanism to immediately bail if there's too many other threads also waiting? (so that, under load, other operations don't completely grind to a halt as well)
|
||
// NOTE: THE CONTROLLER MUST IMMEDIATELY EXIT AFTER RUNNING THE MANAGER. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (maybe blocking?): This is no longer upheld, right? IIRC it's required due to releasing the leader election lease immediately when the manager exits
return err | ||
log.Error(err, "Failed to acquire overlay IP", "VirtualMachine", vm.Name) | ||
r.Recorder.Event(vm, "Warning", "OverlayNet", "Failed to acquire overlay IP") | ||
// Don't return error, we will keep running without an overlay IP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): IIUC this means that if we have some transient issue where we fail to initially acquire the overlay IP, we'll never get one? Is that accurate, and is that the behavior we want?
// read network-attachment-definition from Kubernetes | ||
nad, err := kClient.nadClient.K8sCniCncfIoV1().NetworkAttachmentDefinitions(nadNamespace).Get(ctx, nadName, metav1.GetOptions{}) | ||
nad, err := kClient.NADClient.K8sCniCncfIoV1().NetworkAttachmentDefinitions(nadNamespace).Get(context.Background(), nadName, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: We should not make a network call without a timeout, otherwise this can potentially block forever
Split into commits, to be merged with rebase.
Part of #1281