Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use VMCache instead of VMICache to judge if the NAD is in use #131

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mingshuoqiu
Copy link
Contributor

The VMI status will be none if the VM is turned off from the GUI. The VMI information can't not be relied to make sure if particular NAD is attched to any VM or not. Use VMCache instead to prevent from deleting the NAD if the VM is turned off.

Problem:
Fail to delete an Off VM if its network was deleted. The VM network can be deleted if the attached VM is off. It will be a problem if user try to turn the VM back to on.

Solution:
Do not allow VM network removing if the VM still exists, even it is OFF.

Related Issue:
harvester/harvester#6961

Test plan:

  1. Create a VM network
  2. Create a VM and attached to the created VM network
  3. Turn the VM from Running to Off.
  4. Remove the VM network. Should not be allowed.

@rrajendran17
Copy link
Contributor

@mingshuoqiu can you also add steps in Testplan for verifying deletion of vlan config as well with stopped vm.

@mingshuoqiu mingshuoqiu force-pushed the issue_6961 branch 3 times, most recently from bec79e5 to 1477278 Compare December 9, 2024 06:26
for _, vm := range vms {
if vm.Namespace != nad.Namespace {
continue
}
Copy link
Contributor

@rrajendran17 rrajendran17 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check needed ?

if err := v.checkVmi(vc, nodes); err != nil {
return fmt.Errorf(deleteErr, vc.Name, err)
}

nads, err := v.nadCache.List(util.HarvesterSystemNamespaceName, labels.Set(map[string]string{
utils.KeyClusterNetworkLabel: vc.Spec.ClusterNetwork,
}).AsSelector())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace must be "" or metav1.NamespaceAll as we need to fetch all nads from all namespaces.
HarvesterSystemNamespaceName will fetch nads only from harvester-system namespace

@rrajendran17
Copy link
Contributor

Thanks @mingshuoqiu for taking care of the comments.
Overall LGTM, except few minor comments.Can you please validate the code changes for nad deletion, vlan config deletion with vms stopped scenarios and add the steps to testplan also.Thanks.

pkg/utils/vmi.go Outdated

for _, vm := range vmsTmp {
if vm.Namespace != nad.Namespace {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this existing check and if its needed, we can still attach a nad from any namespace to a vm in any namespace.This operation is supported currently.And with this check we might miss vms that have attached a nad from another namespace and the nad could be deleted while the vm is still present.

Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question regarding the fix: Do we still need the VmiGetter? Under what circumstances do we rely on VMIs to check whether an action is allowed or not? Maybe the logic in the WhoUseNad for VMI and VM is redundant, and we can completely replace VMI with VM. Please advise. Thank you.

@mingshuoqiu
Copy link
Contributor Author

Need to re-implement the WhoUseNad since the indexeres.VMByNetworkIndex has been removed from harvester v1.4. Will replace the WhoUseNad by different logic.

@mingshuoqiu
Copy link
Contributor Author

I have a general question regarding the fix: Do we still need the VmiGetter? Under what circumstances do we rely on VMIs to check whether an action is allowed or not? Maybe the logic in the WhoUseNad for VMI and VM is redundant, and we can completely replace VMI with VM. Please advise. Thank you.

I tried to replace all VMI handler to VM. But I will get the log from the webhook pod. How should I add the resource to the access list? CRD? @starbops

W0116 09:04:53.736063       1 reflector.go:547] github.com/rancher/lasso/pkg/cache/cache.go:145: failed to list *v1.VirtualMachine: virtualmachines.kubevirt.io is forbidden: User "system:serviceaccount:harvester-system:harvester-network-webhook" cannot list resource "virtualmachines" in API group "kubevirt.io" at the cluster scope
E0116 09:04:53.736170       1 reflector.go:150] github.com/rancher/lasso/pkg/cache/cache.go:145: Failed to watch *v1.VirtualMachine: failed to list *v1.VirtualMachine: virtualmachines.kubevirt.io is forbidden: User "system:serviceaccount:harvester-system:harvester-network-webhook" cannot list resource "virtualmachines" in API group "kubevirt.io" at the cluster scope

@rrajendran17
Copy link
Contributor

@mingshuoqiu Also please add a step to testcase for NAD update scenario and it should not be allowed on non-running VMs as well.Thanks

@w13915984028
Copy link
Member

@mingshuoqiu What's the current status & plan of this PR? Will it be merged on v1.5.0? thanks.

The #149 is better to on top of this PR, thus it can have an unified check on NAD, VM, VMI, storagenetwork.

@w13915984028
Copy link
Member

@mingshuoqiu
Copy link
Contributor Author

I'm verifying with the updated rbac. Will update when verification done.

@@ -126,18 +126,18 @@ func (v *Validator) checkRoute(config string) error {
}

func (v *Validator) checkVmi(nad *cniv1.NetworkAttachmentDefinition) error {
Copy link
Member

@w13915984028 w13915984028 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix, will break the update nad feature, as checkVmi is also used in func (v *Validator) Update(_ *admission.Request, oldObj, newObj runtime.Object).

When user changes nad, all this nad referred VMIs need to be stopped. If it is storagenetwork nad, user must clear storagenetwork/set-to-another storagenetwork nad first.

When user deletes nad, all this nad referred VMs need to be changed to remove the network referrences at least, or delete the related VMs. If it is storagenetwork nad, user must clear storagenetwork/set-to-another storagenetwork nad first.

And, checkVmi is still needed, as: some VMs may be changed but their VMIs are not rebooted (this is allowed from Harvester UI), you can't skip the check of VMI.

An additional new function checkVM is needed for nad deletion webhook.

Other changes need to be reviewed similarly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@w13915984028 For the NAD updates, idea is to prevent users from making any such updates which is already being referenced by VMs.
So I guess the error message should be changed accordingly.
As you said deleting/readding the VMs on already planned infrastructure is not a good option.But the idea is to stop users from updating the NAD on VMs which is already using it.

But IMO, we should allow such updates on NADs which is not referenced by any of the VMs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrajendran17 Change the Vlan,HairpinMode,PromiscMode and some others of the NAD params is reasonable, stop & start VM is okay; it is not a good idea to force user to delete the VM/VM's related interface.

type NetConf struct {
	cniv1.NetConf
	BrName       string `json:"bridge"`
	IsGW         bool   `json:"isGateway"`
	IsDefaultGW  bool   `json:"isDefaultGateway"`
	ForceAddress bool   `json:"forceAddress"`
	IPMasq       bool   `json:"ipMasq"`
	MTU          int    `json:"mtu"`
	HairpinMode  bool   `json:"hairpinMode"`
	PromiscMode  bool   `json:"promiscMode"`
	Vlan         int    `json:"vlan"`
}

change BrName (clusternetwork) is rare or should be denied.

@w13915984028
Copy link
Member

w13915984028 commented Feb 25, 2025

BTW:

In this commit 9f06c54#diff-900397cb9002ed992e8565ae266dc4ec231a187b1d3cf016f5fa0cd6ebf82d04, I add two additional checks on nad webhook:

checkClusterNetwork: nad can't refer to a none-existing clusternetwork

checkClusterNetworkUnchanged: nad can't change its referring clusternetwork

In this commit 7dfe2d6#diff-900397cb9002ed992e8565ae266dc4ec231a187b1d3cf016f5fa0cd6ebf82d04

checkStorageNetwork: storagenetwork nad can't be changed or deleted by user

@w13915984028
Copy link
Member

w13915984028 commented Feb 25, 2025

More about the scenarios:

In current stage, the most import param of NAD is VID.

(1) Currently, NAD is allowed to be changed, as long as all attached VMIs are stopped.

If we force the VM should also remove the referrences to NAD before NAD change, it is a break change. The broken scenarios are like:

(a) User migrates a group of VMs from VID 100 to VID 200 due to network re-plan; or change the L3 IP subnets for this VID.

(b) User creates a dedicated NAD for network testing, and creates a couple of VMs attaching to this NAD. User plans to open a new VID 300, stop this group of VMs, change NAD, start the VMs, then test the new VID. User can use this groups of VMs to test any existing VIDs at any time.

(c) ...

(2) Change the BrName (clusternetwork) of NAD.

This is a bug on Harvester now, it can be done if (1) is met, but, all the related VMs are not reflected. Or I missed something, Harvester has code to care it. Please remind me if you know.

Luckily, we did not receive related bug reports now, if it is an real bug.

We need to:

(i) Enhance (1) via this PR #131, add additional check on VM when NAD is deleted.

Be very careful, do NOT import break change, VM is the core feature of Harvester.

(ii) Enhance (2) via issue harvester/harvester#7701 and related PR, deny the change of BrName (clusternetwork).


But IMO, we should allow such updates on NADs which is not referenced by any of the VMs.

@rrajendran17 There are race cases, create VM & change NAD happen in parallel, it may still cause some issues.

When a NAD is attached to wrong clusternetwork (very rarely), delete & recreate is not so unaccetable. We may assume this is a corner case, and add this limitation to the document.

If some user really needs this, we can try to support it later.

@w13915984028
Copy link
Member

w13915984028 commented Feb 27, 2025

FYI:

When deleting a NAD, alone with vmi, vm is also checked. This is done via 97eadd5 to cooprate with other functions. The test code is also added.

harvester/charts#339 is also filed to update rbac.

The VMI status will be none if the VM is turned off from the GUI.
The VMI information can't not be relied to make sure if particular
NAD can be deleted or not. Use VMCache instead to prevent from
deleting the NAD if the VM is attached.

Signed-off-by: Chris Chiu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants