Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Fast Deploy Part Deux (Experimental) #851

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

akutz
Copy link
Collaborator

@akutz akutz commented Jan 7, 2025

What does this PR do, and why is it needed?

This patch adds support for the Fast Deploy Direct and Linked features, i.e. the ability to cache images per-datastore and quickly provision a VM from these caches, either directly or as a linked clone. This is an experimental feature that must be enabled manually. There are many things about this feature that may change prior to it being ready for production.

The patch notes below are broken down into several sections:

  • Goals -- What is currently supported
  • Non-goals -- What is not on the table right now
  • Architecture
    • Activation -- How to enable this experimental feature
    • Placement -- Request datastore recommendations
    • Image cache -- A general-purpose VM image cache
    • Create VM -- Create directly from cached disk

Goals

The following goals are what is considered in-scope for this experimental feature at this time. Just because something is not listed, it does not mean it will not be added before the feature is made generally available:

  • Support all VM images that are OVFs
  • Support multiple zones
  • Support workload-domain isolation
  • Support all datastore types, including host-local and vSAN
  • Support for configuring a default fast-deploy mode
  • Support picking the fast-deploy mode per VM (direct, linked)
  • Support disabling fast-deploy per VM
  • Support VM encryption for VMs deployed with fast deploy direct
  • Support backup/restore for VMs deployed with fast deploy direct
  • Support site replication for VMs deployed with fast deploy direct
  • Support datastore maintenance/migration for VMs deployed with fast deploy direct

Non-goals

The following is a list of non-goals that are not in scope at this time, although most of them should be revisited prior to this feature graduating to production:

  • Support VM images that are VM templates (VMTX)

    The architecture behind Fast Deploy makes it trivial to support deploying VM images that point to VM templates. While not in scope at this time, it is likely this becomes part of the feature prior to it graduating to production-ready.

Architecture

The architecture is broken down into the following sections:

  • Activation -- How to enable this experimental feature
  • Placement -- Request datastore recommendations
  • Image cache -- A general-purpose VM image cache
  • Create VM -- Create directly from cached disk

Activation

Enabling the experimental Fast Deploy feature requires setting the environment variable FSS_WCP_VMSERVICE_FAST_DEPLOY to true in the VM Operator deployment. The environment variable FAST_DEPLOY_MODE may be set to one of the following values to configure the default mode for the fast-deploy feature:

  • direct -- VMs are deployed using cached disks
  • linked -- VMs are deployed as a linked clone
  • the value is empty -- direct mode is used
  • the value is anything else -- fast deploy is disabled

It is possible to override the default mode per-VM by setting the annotation vmoperator.vmware.com/fast-deploy. The values of this annotation follow the same rules described above.

Please note, setting the environment variable FAST_DEPLOY_MODE or the annotation vmoperator.vmware.com/fast-deploy has no effect if the feature is not enabled.

Placement

Please refer to PR #823 for information on placement as the logic from that change has stayed the same in this one.

Image cache

The way the images/disks are cached has completely changed since PR #823. There is now a new API named VirtualMachineImageCache that is:

  • not visible to DevOps users
  • a namespace-scoped resource that only exists in the same namespace as the VM Operator controller pod
  • used to cache the OVF and an image's disks

A VirtualMachineImageCache resource is created per unique library item resource. That means even if there are 20,000 VMI resources spread across a multitude of namespaces or at the cluster scope, if they all point to the same underlying library item, then for all those VMI resources there will be a single VirtualMachineImageCache resource in the VM Operator namespace.

The VirtualMachineImageCache controller caches the OVF for the image in a ConfigMap resource in the VM Operator namespace. This completely obviates the need to maintain a bespoke, in-memory OVF cache.

The VirtualMachineImageCache resource caches the image's disks on specified datastores by setting spec.locations with entries that map to unique datacenter/datastore IDs. The resource's status reveals the location(s) of the cached disk(s).

For a more in-depth look on how the disks are actually cached, please refer to PR #823.

Create VM

If the VirtualMachineImageCache object is not ready with the cached OVF or disks, then the VM will be re-enqueued once the VirtualMachineImageCache is ready. Please note, while placement is required to know where to cache the disks, additional placement calls are not issued if a VM is actively awaiting a VirtualMachineImageCache resource. Beyond that, the create VM workflow depends on the fast-deploy mode:

Direct

  1. The cached disks are copied into the VM's folder.

  2. The ConfigSpec is updated to reference the disks.

  3. Please note, if the VM is encrypted, the disks are not as part of the create call. This is because it is not possible to change the encryption state of disks when adding them to a VM. Thus the disks are encrypted after the VM is created, before it is powered on.

  4. The CreateVM_Task VMODL1 API is used to create the VM.

Linked

  1. The VirtualDisk devices in the ConfigSpec used to create the VM are updated with VirtualDiskFlatVer2BackingInfo backings that specify a parent backing which refers to the cached, base disk from above.

    The path to each of the VM's disks is constructed based on the index of the disk, ex.: [<DATASTORE>] <KUBE_VM_OBJ_UUID>/<KUBE_VM_NAME>-<DISK_INDEX>.vmdk.

  2. The CreateVM_Task VMODL1 API is used to create the VM. Because the the VM's disks have parent backings, this new VM is effectively a linked clone.

Which issue(s) is/are addressed by this PR? (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes NA

Are there any special notes for your reviewer:

Please add a release note if necessary:

Support Fast Deploy direct and linked modes, both deploying VMs from a per-datastore image cache

@github-actions github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines. label Jan 7, 2025
@akutz akutz force-pushed the feature/fast-deploy-direct branch 28 times, most recently from b55a286 to 76b9698 Compare January 10, 2025 18:52
@akutz akutz force-pushed the feature/fast-deploy-direct branch from 8b2f127 to adb3412 Compare January 14, 2025 14:18
@akutz
Copy link
Collaborator Author

akutz commented Jan 14, 2025

Awesome, lgtm @akutz , just some minor comments.

Thanks @dougm! I addressed all of the feedback.

@akutz akutz force-pushed the feature/fast-deploy-direct branch 9 times, most recently from 10913d7 to 99087f0 Compare January 15, 2025 20:19
@akutz akutz force-pushed the feature/fast-deploy-direct branch 3 times, most recently from da24b49 to da124f5 Compare January 22, 2025 19:27
api/v1alpha3/virtualmachineimagecache_types.go Outdated Show resolved Hide resolved
controllers/contentlibrary/utils/controller_builder.go Outdated Show resolved Hide resolved
pkg/config/config.go Outdated Show resolved Hide resolved
pkg/providers/vsphere/vmlifecycle/create_fastdeploy.go Outdated Show resolved Hide resolved
pkg/util/vmopv1/image.go Outdated Show resolved Hide resolved
@bryanv bryanv self-assigned this Jan 22, 2025
@akutz akutz force-pushed the feature/fast-deploy-direct branch from da124f5 to c8f82aa Compare January 23, 2025 14:47
This patch adds support for the Fast Deploy Direct and Linked features,
i.e. the ability to cache images per-datastore and quickly provision a
VM from these caches, either directly or as a linked clone. This is an
experimental feature that must be enabled manually. There are many
things about this feature that may change prior to it being ready for
production.

The patch notes below are broken down into several sections:

* **Goals** -- What is currently supported
* **Non-goals** -- What is not on the table right now
* **Architecture**
    * **Activation** -- How to enable this experimental feature
    * **Placement** --  Request datastore recommendations
    * **Image cache** -- A general-purpose VM image cache
    * **Create VM** -- Create directly from cached disk

The following goals are what is considered in-scope for this
experimental feature at this time. Just because something is not listed,
it does not mean it will not be added before the feature is made
generally available:

* Support all VM images that are OVFs
* Support multiple zones
* Support workload-domain isolation
* Support all datastore types, including host-local and vSAN
* Support for configuring a default fast-deploy mode
* Support picking the fast-deploy mode per VM (direct, linked)
* Support disabling fast-deploy per VM
* Support VM encryption for VMs deployed with fast deploy direct
* Support backup/restore for VMs deployed with fast deploy direct
* Support site replication for VMs deployed with fast deploy direct
* Support datastore maintenance/migration for VMs deployed with fast
  deploy direct

The following is a list of non-goals that are not in scope at this time,
although most of them should be revisited prior to this feature
graduating to production:

* Support VM images that are VM templates (VMTX)

    The architecture behind Fast Deploy makes it trivial to support
    deploying VM images that point to VM templates. While not in scope
    at this time, it is likely this becomes part of the feature prior to
    it graduating to production-ready.

The architecture is broken down into the following sections:

* **Activation** -- How to enable this experimental feature
* **Placement**  -- Request datastore recommendations
* **Image cache** -- A general-purpose VM image cache
* **Create VM**  -- Create directly from cached disk

Enabling the experimental Fast Deploy feature requires setting the
environment variable `FSS_WCP_VMSERVICE_FAST_DEPLOY` to `true` in the VM
Operator deployment. The environment variable `FAST_DEPLOY_MODE` may be
set to one of the following values to configure the default mode for the
fast-deploy feature:

* `direct` -- VMs are deployed using cached disks
* `linked` -- VMs are deployed as a linked clone
* the value is empty -- `direct` mode is used
* the value is anything else -- fast deploy is disabled

It is possible to override the default mode per-VM by setting the
annotation `vmoperator.vmware.com/fast-deploy`. The values of this
annotation follow the same rules described above.

Please note, setting the environment variable `FAST_DEPLOY_MODE` or the
annotation `vmoperator.vmware.com/fast-deploy` has no effect if the
feature is not enabled.

Please refer to PR vmware-tanzu#823 for information on placement as the logic from
that change has stayed the same in this one.

The way the images/disks are cached has completely changed since PR

* not visible to DevOps users
* a namespace-scoped resource that only exists in the same namespace as
  the VM Operator controller pod
* used to cache the OVF and an image's disks

A `VirtualMachineImageCache` resource is created per unique library item
resource. That means even if there are 20,000 VMI resources spread
across a multitude of namespaces or at the cluster scope, if they all
point to the same underlying library item, then for all those VMI
resources there will be a single `VirtualMachineImageCache` resource in
the VM Operator namespace.

The `VirtualMachineImageCache` controller caches the OVF for the image
in a `ConfigMap` resource in the VM Operator namespace. This completely
obviates the need to maintain a bespoke, in-memory OVF cache.

The `VirtualMachineImageCache` resource caches the image's disks on
specified datastores by setting `spec.locations` with entries that map
to unique datacenter/datastore IDs. The resource's status reveals the
location(s) of the cached disk(s).

For a more in-depth look on how the disks are actually cached, please
refer to PR vmware-tanzu#823.

If the `VirtualMachineImageCache` object is not ready with the cached
OVF or disks, then the VM will be re-enqueued once the
`VirtualMachineImageCache` _is_ ready. Please note, while placement is
required to know where to cache the disks, additional placement calls
are not issued if a VM is actively awaiting a `VirtualMachineImageCache`
resource. Beyond that, the create VM workflow depends on the fast-deploy
mode:

1. The cached disks are copied into the VM's folder.

2. The ConfigSpec is updated to reference the disks.

  a. Please note, if the VM is encrypted, the disks are not as part of
     the create call. This is because it is not possible to change the
     encryption state of disks when adding them to a VM. Thus the disks
     are encrypted after the VM is created, before it is powered on.

3. The `CreateVM_Task` VMODL1 API is used to create the VM.

1. The `VirtualDisk` devices in the ConfigSpec used to create the VM are
   updated with `VirtualDiskFlatVer2BackingInfo` backings that specify a
   parent backing which refers to the cached, base disk from above.

   The path to each of the VM's disks is constructed based on the index
   of the disk, ex.:
   `[<DATASTORE>] <KUBE_VM_OBJ_UUID>/<KUBE_VM_NAME>-<DISK_INDEX>.vmdk`.

2. The `CreateVM_Task` VMODL1 API is used to create the VM. Because the
   the VM's disks have parent backings, this new VM is effectively a
   linked clone.
@akutz akutz force-pushed the feature/fast-deploy-direct branch from c8f82aa to 27bb711 Compare January 23, 2025 14:48
Copy link

Code Coverage

Package Line Rate Health
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/clustercontentlibraryitem 100%
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/contentlibraryitem 100%
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/utils 88%
github.com/vmware-tanzu/vm-operator/controllers/infra/capability/configmap 86%
github.com/vmware-tanzu/vm-operator/controllers/infra/capability/crd 93%
github.com/vmware-tanzu/vm-operator/controllers/infra/configmap 71%
github.com/vmware-tanzu/vm-operator/controllers/infra/node 77%
github.com/vmware-tanzu/vm-operator/controllers/infra/secret 77%
github.com/vmware-tanzu/vm-operator/controllers/infra/validatingwebhookconfiguration 85%
github.com/vmware-tanzu/vm-operator/controllers/infra/zone 73%
github.com/vmware-tanzu/vm-operator/controllers/storageclass 95%
github.com/vmware-tanzu/vm-operator/controllers/storagepolicyquota 97%
github.com/vmware-tanzu/vm-operator/controllers/util/encoding 73%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/storagepolicyusage 99%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/virtualmachine 69%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/volume 87%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineclass 75%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineimagecache 89%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinepublishrequest 81%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinereplicaset 68%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineservice 83%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineservice/providers 92%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinesetresourcepolicy 82%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest 72%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1 72%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1/conditions 88%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1/patch 78%
github.com/vmware-tanzu/vm-operator/pkg/bitmask 100%
github.com/vmware-tanzu/vm-operator/pkg/builder 95%
github.com/vmware-tanzu/vm-operator/pkg/conditions 90%
github.com/vmware-tanzu/vm-operator/pkg/config 100%
github.com/vmware-tanzu/vm-operator/pkg/config/capabilities 100%
github.com/vmware-tanzu/vm-operator/pkg/config/env 100%
github.com/vmware-tanzu/vm-operator/pkg/context/generic 100%
github.com/vmware-tanzu/vm-operator/pkg/context/operation 100%
github.com/vmware-tanzu/vm-operator/pkg/errors 100%
github.com/vmware-tanzu/vm-operator/pkg/mem 100%
github.com/vmware-tanzu/vm-operator/pkg/patch 78%
github.com/vmware-tanzu/vm-operator/pkg/prober 91%
github.com/vmware-tanzu/vm-operator/pkg/prober/probe 90%
github.com/vmware-tanzu/vm-operator/pkg/prober/worker 77%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere 75%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/client 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/clustermodules 71%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/config 89%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/contentlibrary 72%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/credentials 100%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/network 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/placement 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/session 71%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/storage 44%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/sysprep 100%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/vcenter 81%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/virtualmachine 84%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/vmlifecycle 61%
github.com/vmware-tanzu/vm-operator/pkg/record 87%
github.com/vmware-tanzu/vm-operator/pkg/topology 91%
github.com/vmware-tanzu/vm-operator/pkg/util 88%
github.com/vmware-tanzu/vm-operator/pkg/util/cloudinit 89%
github.com/vmware-tanzu/vm-operator/pkg/util/cloudinit/validate 91%
github.com/vmware-tanzu/vm-operator/pkg/util/image 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube 89%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/cource 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/internal 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/proxyaddr 73%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/spq 100%
github.com/vmware-tanzu/vm-operator/pkg/util/netplan 100%
github.com/vmware-tanzu/vm-operator/pkg/util/ovfcache 75%
github.com/vmware-tanzu/vm-operator/pkg/util/ovfcache/internal 100%
github.com/vmware-tanzu/vm-operator/pkg/util/paused 100%
github.com/vmware-tanzu/vm-operator/pkg/util/ptr 100%
github.com/vmware-tanzu/vm-operator/pkg/util/resize 97%
github.com/vmware-tanzu/vm-operator/pkg/util/vmopv1 81%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/client 64%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/library 100%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/vm 79%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/watcher 88%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig 95%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/crypto 91%
github.com/vmware-tanzu/vm-operator/pkg/webconsolevalidation 100%
github.com/vmware-tanzu/vm-operator/services/vm-watcher 93%
github.com/vmware-tanzu/vm-operator/webhooks/common 100%
github.com/vmware-tanzu/vm-operator/webhooks/persistentvolumeclaim/validation 95%
github.com/vmware-tanzu/vm-operator/webhooks/unifiedstoragequota/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachine/mutation 87%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachine/validation 95%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineclass/mutation 62%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineclass/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinepublishrequest/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinereplicaset/validation 90%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineservice/mutation 67%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineservice/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinesetresourcepolicy/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinewebconsolerequest/v1alpha1/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinewebconsolerequest/validation 92%
Summary 82% (11285 / 13718)

Minimum allowed line rate is 79%

@akutz akutz merged commit 823a044 into vmware-tanzu:main Jan 23, 2025
9 checks passed
@akutz akutz deleted the feature/fast-deploy-direct branch January 23, 2025 16:31
akutz added a commit to akutz/vm-operator that referenced this pull request Jan 28, 2025
This patch fixes an issue where the labels on the CVMI resources
from the CCLI resources were missing. The labels were added via
vmware-tanzu#406 but were
accidentally removed via
vmware-tanzu#851.

When adding them back, there are now also tests to validate the
logic works as there were no tests in the original PR.
akutz added a commit to akutz/vm-operator that referenced this pull request Jan 28, 2025
This patch fixes an issue where the labels on the CVMI resources
from the CCLI resources were missing. The labels were added via
vmware-tanzu#406 but were
accidentally removed via
vmware-tanzu#851.

When adding them back, there are now also tests to validate the
logic works as there were no tests in the original PR.
akutz added a commit to akutz/vm-operator that referenced this pull request Jan 28, 2025
This patch fixes an issue where the labels on the CVMI resources
from the CCLI resources were missing. The labels were added via
vmware-tanzu#406 but were
accidentally removed via
vmware-tanzu#851.

When adding them back, there are now also tests to validate the
logic works as there were no tests in the original PR.
akutz added a commit to akutz/vm-operator that referenced this pull request Jan 28, 2025
This patch fixes an issue where the labels on the CVMI resources
from the CCLI resources were missing. The labels were added via
vmware-tanzu#406 but were
accidentally removed via
vmware-tanzu#851.

When adding them back, there are now also tests to validate the
logic works as there were no tests in the original PR.
akutz added a commit to akutz/vm-operator that referenced this pull request Jan 28, 2025
This patch fixes an issue where the labels on the CVMI resources
from the CCLI resources were missing. The labels were added via
vmware-tanzu#406 but were
accidentally removed via
vmware-tanzu#851.

When adding them back, there are now also tests to validate the
logic works as there were no tests in the original PR.
bryanv pushed a commit to bryanv/vm-operator that referenced this pull request Jan 29, 2025
This patch fixes an issue where the labels on the CVMI resources
from the CCLI resources were missing. The labels were added via
vmware-tanzu#406 but were
accidentally removed via
vmware-tanzu#851.

When adding them back, there are now also tests to validate the
logic works as there were no tests in the original PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-not-required size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants