Skip to content

Releases: neondatabase/autoscaling

v0.8.0

24 May 18:57
01b8884
Compare
Choose a tag to compare
This release contains bugfixes, a new component, minor public-facing API
changes, and significant changes to the deployed services, but no
inter-component API changes.

Breaking API changes:

- NeonVM: restart policy no longer applies directly to the pod (#293)

Features:

- Add patch for cluster-autoscaler compatability with VMs (#232)
- NeonVM: implement RestartPolicy (#293)
- NeonVM security and networking redesign (#245)
  - Runner pod no longer has Privileged: true
  - QEMU in the runner pod runs under its own user
  - Adapted generic-device-plugin for NeonVM, to give access to /dev/kvm
    and /dev/vhost-*
  - Switch from neonvm-vxlan-ipam to Whereabouts CNI
    -> Allows using overlay IP addresses in normal pods as well as VMs
  - Reconcile cycles improved
- NeonVM/vm-builder: Add --enable-file-cache flag (default: off) (#265)
- NeonVM: user RBAC roles (#284):
  - neonvm-virtualmachine-viewer-role
  - neonvm-virtualmachine-editor-role
  - neonvm-virtualmachinemigration-viewer-role
  - neonvm-virtualmachinemigration-editor-role
- More logs for autoscaler-agent (#290, #291)
- More autoscaler-agent metrics:
  - autoscaling_agent_runner_starts   (#273)
  - autoscaling_agent_runner_restarts (#273)
  - autoscaling_agent_runner_fatal_errors_total (#274)
  - autoscaling_errored_vm_runners_current      (#274)

Fixes:

- NeonVM/vm-builder: Fix command passthrough (#263)
- NeonVM/vm-builder: Fix cgexec being ignored (#281)
- NeonVM/vm-builder: Build without cgo (#255)
  - This removes the dependency on a dynamically loaded libc.
- informant: Fix cgroup memory.high throttling (#223)
- agent: Various logs fixes (#242, #267, #271, #272)
- agent: Restart panicked/errored runners (#273)
- agent/billing: Don't count VMs that aren't runnnig (#278)
- agent, sched: Add ports to pod spec for metrics (#282)
- agent, sched: Fix logging of MilliCPU (#261)
- sched: Don't output command help on error (#253)
- plugin: Handle completed pods as if deleted (#260)

No protocol changes.

Other changes:

- Many unused RBAC (and other) items removed:
  - Namespace autoscaler-config (#245)
  - ClusterRole vm-view (#284)
  - ClusterRole vm-patcher (#284)
  - ClusterRoleBinding kube-system/autoscaler-vm-view (#284)
  - ClusterRoleBinding kube-system/autoscale-scheduler-as-vm-patcher (#284)
  - Role kube-system/autoscale-scheduler-config-reader (#284)
  - RoleBinding kube-system/autoscale-scheduler-config-reader (#284)
- NeonVM: Rename 'runner' container to 'neonvm-runner' (#277)
- agent: Network error metrics include root cause (#287)

Upgrade path from v0.7.2:

- No ordering requirements.
- You may wish to remove old items as mentioned above.

v0.7.3-alpha3

22 May 20:24
a008e8e
Compare
Choose a tag to compare
This is a pre-release just for building and distributing images.
Do not deploy anything from this release.

v0.7.2

08 May 03:31
3eeeeee
Compare
Choose a tag to compare
This is a hotfix release that reverts a change in behavior from v0.7.0:
Alongside the change to allow fractional CPU, #172 changed the billing
value type to a float. This was incorrect, fixed by #244.

v0.7.1

08 May 01:39
af3fa22
Compare
Choose a tag to compare
This is a hotfix release that fixes a bug with v0.7.0: On Kubernetes
nodes with cgroups v1, the NeonVM runner was failing to read cgroup CPU
information due to a bad path. This, in turn, prevented any successful
reconciling for VMs on these nodes, which - among other things -
prevented autoscaling from functioning for these VMs.

v0.7.0

07 May 18:09
13a351e
Compare
Choose a tag to compare
This release contains bugfixes, new features, major public-facing API
changes, *and* inter-component API changes.

Live-upgrading is possible but must be done carefully. Read the "Upgrade
path from v0.6.0" section at the end for more info.

Breaking API changes:

- Upgraded to Kubernetes 1.24 (#132)
- VMs may have fractional CPU values (#172)

Features:

- Improve scaling bounds validation (#190)
- Make api.ScalingBounds (for scaling annotations) public (#181)
- informant: Respect max file cache size (#182)
- agent: Add runner panics metrics (#180)
- agent: Rework (improve!) scaling algorithm (#195)
  - In general, scaling should be much smoother now. There's still some
    work to do in this area (particularly around downscaling), but
    overall, a step that should be fairly impactful.
- agent->informant health checks (#203)
- Support for fractional CPU (#172)
  - !!!
- NeonVM: Add current usage annotation to runner pod (#231)
- NeonVM: Allow disabling service links (#235)

Fixes:

- VirtualMachineSpec.PodResources now sets the pod's resources (#138)
- autoscaler-agents no longer produce logs about VM updates that aren't
  on their node (#186)
- Fix NeonVM CRD still including VirtualMachineSpec.ServiceAccountName (#188)
- plugin: Fix Unreserve verdict format string in logs (#206)
- agent: Stop informant server when context canceled (#214)
  - This was the cause of a pretty notable goroutine leak that should
    now be fixed. See #196
- agent: Fix log for /unregister response (#224)
- agent: Fix inverted 'ErrServerClosed' check (#225)
  - This may have been causing spurious error logs and silencing actual
    errors.
- Add node affinity to NeonVM's kube-multus-ds DaemonSet (#236)
- agent: Fix deadlock on invalid plugin response (#237)

Protocol changes:

- agent->informant health checks are now supported, but not required (#203)
- NeonVM CRD now supports fractional CPU - all of min/use/max. (#172)
- NeonVM controller -> runner makes requests to /cpu_current and
  /cpu_change endpoints to get/set fractional CPU via the runner's
  cgroup manipulations. (#172)
- agent->plugin resource requests can now request fractional CPU (#172)
- plugin->agent permits can now return fractional CPU (#172)
  - note: plugin does not return fractional CPU unless the agent
    supports it. This makes it possible to do upgrades without
    significant downtime. (#238)

Other changes:

- Upgraded to Go 1.20 (#130)
- agent/metrics: Make request error labels self-consistent (#193)
- Mark scheduler with `priorityClassName: system-cluster-critical` (#227)

Upgrade path from v0.6.0:

  note: each step produces a "valid" state - the system will operate
  successfully. It is not recommended to stay in a partial upgrade for
  long, because they have not been tested as much.

1. Upgrade NeonVM controllers v0.6.0 -> v0.7.0
2. Upgrade autoscale-scheduler v0.6.0 -> v0.7.0
  - note: it is ok to change to a compute unit with fractional CPU at
    this step! Old autoscaler-agents will be given a multiplied CU so it
    has an integer number of CPUs.
3. Upgrade autoscaler-agent v0.6.0 -> v0.7.0

  note: Upgrading the vm-informant can be done at any point. Its
  protocol changes are opt-in.

v0.6.0

15 Apr 20:56
7374810
Compare
Choose a tag to compare
This release contains bugfixes, new features, and minor public-facing
API changes, but no inter-component API changes.

Breaking API changes:

- NeonVM: Removed VirtualMachineSpec.ServiceAccountName (#140)
- NeonVM: Make vm-builder specific to Neon, with new vm-builder-generic
  for general-purpose use. vm-builder-generic is *almost* the same as
  the previous vm-builder, but it does not include vector by default (#133)
- Require label "autoscaling.neon.tech/enabled=true" for autoscaling to
  be enabled (#38)

Features:

- Allow annotation "autoscaling.neon.tech/bounds=..." to override
  scaling bounds (#128)
- NeonVM: add --quiet flag to vm-builder[-generic], which is off by
  default. Builds are more verbose without it. (#169)
- agent, plugin: Add prometheus metrics (#92, #174, #175)
- agent: Better config validation (#177)

Fixes:

- agent: always log informant register errors (#165)
- agent: fix runner log prefix (#159)
- NeonVM: fix ENTRYPOINT, CMD handling when there's mutiple strings (#184)

No protocol changes.

Upgrade path from v0.5.2:

- No ordering requirements.

v0.5.2

09 Apr 20:46
5bd75b7
Compare
Choose a tag to compare
This release incorporates a handful of bugfixes and some new features.
It is entirely inter-compatible with v0.5.1, with the exception of a
minor change in the scheduler's "dump state" output.

Features:

- agent, plugin: Reimplement migration under load. (#112)
  - Note: The overlay network that allows VMs to preserve their IP
    addresses is not currently functional.

Fixes:

- plugin: Don't reject resource requests that aren't a multiple of the
  compute unit if the VM's resources are constrained to make satisfying
  that requirement impossible. (#108)
- plugin: Fix missing JSON tags for Buffer and CapacityPressure in
  podResourceState. (#107)
  - Note: this changes the "dump state" JSON output
- agent: Don't return from /suspend until NeonVM requests finished. This
  helps avoid possibilities of multiple autoscaler-agents acting at the
  same time.
- agent/billing: panic if VM store unexpectedly stopped (#110)

No protocol changes.

Upgrade path from v0.5.1:

- No ordering requirements.

v0.5.1

29 Mar 08:34
14bbb97
Compare
Choose a tag to compare
Hotfix release, fixes a panic on autoscaler-agent dump-state requests.

v0.5.0

29 Mar 07:51
a7d5ec0
Compare
Choose a tag to compare
This release marks the first release where NeonVM has been merged into
the same repository. It was last at v0.4.6, so we've bumped to v0.5.0 as
a kind of clean slate.

Features:

- Added "dump state" endpoints to autoscaler-agent and scheduler plugin.
  Refer to https://github.com/neondatabase/autoscaling/pull/76 for more
  information. (The endpoints are enabled by default).

No protocol changes.

There have been significant changes to testing - everything is run by
the Makefile now. Refer to https://github.com/neondatabase/autoscaling/pull/91
for more information.

Upgrade path from v0.1.17 / v0.4.6:

- No ordering requirements.

v0.1.17

21 Mar 23:10
bbc3bdd
Compare
Choose a tag to compare
No new features.

Fixes:

- agent/billing: consumption event duplication fixed (#94)

No protocol changes.

Upgrade path from v0.1.16:

- No ordering requirements.