Skip to content

Conversation

@richardcase
Copy link
Member

What this PR does / why we need it:

This adds documentation that details the contract for providers when implementing an infrastructure machine pool.

This has been created retrospectively from looking at a number of providers and the MachinePool controller.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #12799

/area machinepool

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/machinepool Issues or PRs related to machinepools labels Nov 7, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 7, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from elmiko November 7, 2025 14:35
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from sivchari November 7, 2025 14:35
@richardcase richardcase force-pushed the machinepool_contract_doc branch from d1d60f8 to 271c8df Compare November 7, 2025 14:40
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 7, 2025
@richardcase richardcase force-pushed the machinepool_contract_doc branch from 271c8df to 5cc7517 Compare November 7, 2025 15:00
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 7, 2025
This adds documentation that details the contract for providers when
implementing an infrastructure machine pool.

This has been created retrospectively from looking at a number of
providers and the MachinePool controller.

Signed-off-by: Richard Case <[email protected]>
@richardcase richardcase force-pushed the machinepool_contract_doc branch from 5cc7517 to 53b1241 Compare November 7, 2025 15:02
@richardcase richardcase changed the title [WIP] 📖 docs: machinepool contract spec 📖 docs: machinepool contract spec Nov 7, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2025
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!
This really helps folks like me to step up knowledge about MachinePools!


The value from this field is surfaced via the MachinePool's `status.replicas` field.

### InfraMachinePool: terminal failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should align ASAP to other CAPI resources WRT to terminal failures (deprecated, not anymore relevant for the controller)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and slightly changed the wording.


By opting into MachinePool Machines its the responsibility of the provider to create an instance of a InfraMachine for every replica and manage their lifecycle.

### InfraMachinePool: instances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add a definition to what an instance is in the context of MP

Also, if this field is not used by CAPI let's make it clear.
e.g.

Please note that this field is not used by CAPI. Nevertheless, it is documented in this contract to foster design choice that will ensure a consistent user experience across all the MachinePool implementations,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this section.

@sbueringer
Copy link
Member

/assign

Would like to review after Fabrizio lgtm and before merge

@richardcase
Copy link
Member Author

Thanks for your feedback @fabriziopandini . I will make updates based on this.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sbueringer. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Changes after the first review by Fabrizio.

Signed-off-by: Richard Case <[email protected]>
@richardcase richardcase force-pushed the machinepool_contract_doc branch from cbf9bc7 to 22f729e Compare November 10, 2025 15:44
@richardcase
Copy link
Member Author

@fabriziopandini - i have updated the doc based on your feedback.


The goal of an InfraMachinePool is to manage the lifecycle of a provider-specific pool of machines using a provider specific service (like auto-scale groups in AWS & virtual machine scalesets in Azure).

The machines in the pool may be physical or virtual instances (although most likely virtual), and they represent the infrastructure for Kubernetes nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The machines in the pool may be physical or virtual instances (although most likely virtual), and they represent the infrastructure for Kubernetes nodes.
The machines in the pool may be physical or virtual instances, and they represent the infrastructure for Kubernetes nodes.

(comment is not relevant for the contract)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theres a few lines in this document that are not directly relevant to the contract, but which instead provide background or hints. So i'm inclined to keep this as is.


The [MachinePool's controller](../../core/controllers/machine-pool.md) is responsible to coordinate operations of the InfraMachinePool, and the interaction between the MachinePool's controller and the InfraMachinePool is based on the contract rules defined in this page.

Once contract rules are satisfied by an InfraMachinePool implementation, other implementation details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this sentence mean? Can it be removed? Should we explicitly mention optional features such as single machine deletion here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence needs to stay as it's stating that the MachinePool controller coordinates with the InfraMachinePool and that this interaction is governed by the contract.

Its also consistent with the contract docs for InfraMachine: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/book/src/developer/providers/contracts/infra-machine.md?plain=1#L10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly mention optional features such as single machine deletion here?

Is there something that is exposed via the contract that is relevant to this feature? There are general features of MachinePools that are not covered here (and should be covered in the general MachinePool docs IMHO).

@bnallapeta
Copy link
Contributor

@richardcase are we not going to document anything on the upgrades? My understanding is that the status would change based on the update strategy. For example:

Atomic updates:

  • status.replicas temporarily drops to 0
  • All providerIDs disappear then reappear
  • InfrastructureReady may flip to False

Rolling updates:

  • status.replicas stays >= desired (with surge) or slightly below
  • providerIDs change gradually (old removed, new added)
  • InfrastructureReady stays True

Should providers declare their update strategy in the contract so CAPI can adapt behavior accordingly?

Changes after the first review by Fabrizio.

Signed-off-by: Richard Case <[email protected]>
Some updates after an additional review by Andreas.

Signed-off-by: Richard Case <[email protected]>
@richardcase
Copy link
Member Author

Should providers declare their update strategy in the contract so CAPI can adapt behavior accordingly?

@bnallapeta - i think we should follow up on this as we are trying to document the current state of the contract. Providers declaring an "update strategy" would be a change to the current contract.

We are going to need to update the MachinePool controller document based on the introduction of this contract doc. Perhaps we should include behaviour type stuff when we do that?

@bnallapeta
Copy link
Contributor

@richardcase quoting from #10496,

Furthermore I don't know if there is a documented contract as of today for MachinePools how BootstrapConfigs are supposed to be rolled out. I also don't know if MachinePools behave the same across providers today.

I think we should talk about this in the contract. A few questions on this:

Should providers watch for bootstrap config changes and trigger updates? If yes, what's the signal? Hash of the secret data? ConfigRef version?

@richardcase
Copy link
Member Author

Note for the maintainers, i will squash the commits when we are all happy with the doc.

@richardcase
Copy link
Member Author

Should providers watch for bootstrap config changes and trigger updates? If yes, what's the signal? Hash of the secret data? ConfigRef version?

I agree i do think we need to document this in the MachinPools documentation. However, i don't think this sits in the contract document. Lets add this elsewhere, perhaps where i suggested earlier: https://cluster-api.sigs.k8s.io/developer/core/controllers/machine-pool when we make changes to that.

@chrischdi
Copy link
Member

Note for the maintainers, i will squash the commits when we are all happy with the doc.

no need to

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/machinepool Issues or PRs related to machinepools cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document contract for machine pools

7 participants