Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of the HostClaim resource #408

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pierrecregut
Copy link

The HostClaim resource is introduced to address multi-tenancy in Metal3 and the definition of hybrid clusters with cluster-api.

Introduces four scenarios representing the three main use cases and how pivot is made.

Presentation of the design will be addressed in another merge request.

@metal3-io-bot
Copy link
Contributor

Hi @pierrecregut. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 3, 2024
@dtantsur
Copy link
Member

dtantsur commented Apr 3, 2024

/ok-to-test

@metal3-io-bot metal3-io-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2024
The HostClaim resource is introduced to address multi-tenancy in
Metal3 and the definition of hybrid clusters with cluster-api.

Introduces four scenarios representing the three main use cases and
how pivot is made.

Presentation of the design will be addressed in another merge request.

Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Laurent Roussarie <[email protected]>
Signed-off-by: Pierre Crégut <[email protected]>
Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this!
I see many things that I like. My main concern now is that we need to clarify the interaction between HostClaims and BareMetalHosts. Where each lives and which controllers would handle them. This is something I feel we must focus on before we can think about potential other resources (non-baremetal).


There is another model where a single management cluster is used to create and
manage several clusters across a set of bare-metal servers. This is the focus
<!-- cSpell:ignore Sylva Schiff -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add these in the cSpell config instead?

Comment on lines 38 to 43
So far, the primary use case of cluster-api-baremetal is the creation of a
single target cluster from a temporary management cluster. The pivot process
transfers the resources describing the target cluster from the management
cluster to the target cluster. Once the pivot process is complete, the target
cluster takes over all the servers. It can scale based on its workload but it
cannot share its servers with other clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can say this is the primary use case. It is just they bootstrap process to get a "self managed" management cluster. Once you have a management cluster (no matter if it is self managed or not) it can be used to create multiple other clusters. These can be created from BareMetalHosts in the same or other namespaces.

What I think we should bring up here instead is the following.
The Metal3 infrastructure provider for Cluster API as of today, does not implement the multi-tenancy contract. While some hacks and workarounds exists (e.g. running multiple namespaced instances), they are far from ideal and do not address all use cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the reference to the multi-tenancy contract is the right one because this contract has been designed for disposable compute resources not recyclable ones. With disposable compute resource, you have an API to create them usually protected by credentials and the problem is just to be able to isolate each set of credentials and make it usable for example in a single namespace. With recyclable resources the set of credentials is attached to a long living object.

You could trade tenant credentials to gain access to the real object. This is what was done in Kanod with the baremetalpool approach. It is powerful but also very complex and we depended on Redfish.

Here we rather hide the credentials and use the HostClaim controller to exchange control and information between the BareMetalHost in a protected namespace and a Host in the tenant namespace. It is true that we need limits on the binding performed:

  • the BareMetalHost should be able to limit the namespaces of the HostClaims that can be associated
  • at some point a mechanism to limit the number of compute resources a namespace can acquire is needed.

I think the second point can be addressed later. It can be implemented through a quota mechanism (see https://gitlab.com/Orange-OpenSource/kanod/host-quota in the scope of Kanod).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a new version of motivations. It also solve the previous remark as I have removed explicit mention to Sylva and das Schiff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the reference to the multi-tenancy contract is the right one because this contract has been designed for disposable compute resources not recyclable ones. With disposable compute resource, you have an API to create them usually protected by credentials and the problem is just to be able to isolate each set of credentials and make it usable for example in a single namespace. With recyclable resources the set of credentials is attached to a long living object.

My goal would be to achieve the exact same process for both disposable and recyclable resources. The API for creating is the Kubernetes API of the cluster holding the BareMetalHosts. The credentials are the credentials for accessing that API. In this way we achieve the same way of working as the other infrastructure providers, which simplifies a lot and also gives us multi-tenancy.

The proposal for how to handle interaction between namespaces (with HostClaims in one namespace and BareMetalHosts in another) worries me. It is generally not a good idea to allow cross namespace references. If we have to do this I think it must be one way only. Then all BareMetalHosts would be in a special namespace and no namespace reference is needed from the HostClaim side. This would be similar to ClusterIssuers from cert-manager.

If I understand correctly, the need to have the HostClaim in a separate namespace is so that we could keep it together with the Metal3Machine. I don't understand why this is needed though. Would it not make more sense to keep the HostClaim with the BareMetalHost and propagate the status back to the Metal3Machine. This is much closer to other providers (CAPA/CAPO, etc). The user has no insights at all into the actual cloud resources outside of what is propagated to the InfrastructureMachine (unless they check it through a separate API). I think propagating more of the BareMetalHost status back to the Metal3Machine makes perfect sense and can hopefully simplify this a lot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the need to have the HostClaim in a separate namespace is so that we could keep it together with the Metal3Machine. I don't understand why this is needed though. Would it not make more sense to keep the HostClaim with the BareMetalHost and propagate the status back to the Metal3Machine. This is much closer to other providers (CAPA/CAPO, etc). The user has no insights at all into the actual cloud resources outside of what is propagated to the InfrastructureMachine (unless they check it through a separate API). I think propagating more of the BareMetalHost status back to the Metal3Machine makes perfect sense and can hopefully simplify this a lot.

Except that if you do this, you have:

  • a clusterapi only solution (ie I cannot use it for a workload that is not defining a node).
  • a solution that will only work with bmh and cannot be so easily extended to other resources (even if this can be challenged: see other comment).

Comment on lines 221 to 222
associated with the cluster. The associated BareMetalHost is not in the same
namespace as the HostClaim. The exact definition of the BareMetalHost remains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we should clarify I think.
How should the BareMetalHost be picked? Across all namespaces? Should the HostClaim specify where they are? We should be able to have the BareMetalHosts in a completely different cluster than the Metal3Machines IMO. Where should the HostClaims be then?

My idea would be to add an identityRef to the Metal3Cluster. It references a Secret with credentials that would be used for accessing the BareMetalHosts or HostClaims. The same identityRef field could be added to Metal3MachineTemplate also. These could be propagated to the HostClaims so we get something like this.

apiVersion: metal3.io/v1alpha1
kind: HostClaim
metadata:
  name: my-host
spec:
  identityRef:
    name: my-credentials

Then the HostClaim can be in the same namespace as the Metal3Machine. However, it means that the user must be in possession of credentials that can access and modify the BareMetalHosts! This is because the controller (CAPM3) would need to use the credentials of the HostClaim to reach out to the BareMetalHost and modify it.

The other option is that the HostClaim itself lives next to the BareMetalHost. We then set the identityRef on the Metal3Cluster and Metal3MachineTemplate. Then CAPM3 creates HostClaims with the help of the provided credentials. It does not require access to modify the BareMetalHosts directly. This option seems more reasonable to me, but it has the drawback the the user cannot easily inspect the HostClaim.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another approach. For your approach a HostClaim represents a right to use a specific host and is the private part in the BMH realm. It is associated to a set of credentials representing the cluster. Then you will need a mechanism to create new HostClaims. I believe that your proposal is not so far from the Kanod BareMetalPool approach except that you never give access to the real credentials of the BMH: you only expose the standard API offered by the BMH.

I think there are different objectives that can be addressed in different ways in both solutions:

  • Having the BareMetalHosts in a different cluster. It is a legitimate need as hardware is often handled by different teams. But I did not put it in the objective of this document. It can be addressed later as experimented in Kanod. We use the capability of HostClaim to target different kind of compute resources (hybrid) to implement virtual HostClaim synchronized with a remote HostClaim in another cluster (https://gitlab.com/Orange-OpenSource/kanod/host-remote-operator). There is an equivalent of the identityRef in that solution. You have HostClaims on both sides. They must be synchronized. The HostClaim in the cluster namespace has its spec copied to the remote HostClaim and it inherit the status from the remote and yes the contract is tricky on metadata (labels and annotations).
  • Ensuring multi-tenancy isolation. This is done by the fact that HostClaim are not in the same namespace as BareMetalHost and only offer an API with the credentials hidden. The user and the metal3machine controller can completely inspect the HostClaim object. When there is a remote HostClaim, the one on the BMH cluster could be in the same namespace, but it may be harder to avoid name collision between objects and if we want to use quotas, namespaces are also useful to distinguish tenants.
    Your solution will not address the first use case (use without capm3) because your version of HostClaim is invisible to the end user and you rely on capm3 controller to talk directly with the service hiding the BareMetalHost.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lentzi90 I think the main point now is to decide if having BMH in different clusters is a goal and needs a scenario or a non goal. If it is a goal, then I also need to bring the remote hostclaim in the scope of this document, or we keep it as a non goal and mention that it will be addressed later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the capability of HostClaim to target different kind of compute resources (hybrid) to implement virtual HostClaim synchronized with a remote HostClaim in another cluster

This does not make sense to me and sounds too complicated. Why not just propagate the status back to the Metal3Machine instead? Then there is no need for the remote HostClaim.

Your solution will not address the first use case (use without capm3) because your version of HostClaim is invisible to the end user and you rely on capm3 controller to talk directly with the service hiding the BareMetalHost.

Can you explain this a bit more? The way I see it, we can propagate the status of the HostClaim back to the Metal3Machine. Then the user will get the information. If the user wants to manually handle HostClaims and skip CAPM3 then they already have access to the HostClaim and can see the status there.

I think the main point now is to decide if having BMH in different clusters is a goal and needs a scenario or a non goal. If it is a goal, then I also need to bring the remote hostclaim in the scope of this document, or we keep it as a non goal and mention that it will be addressed later.

I think it must be a goal because it is ultimately a consequence of multi-tenancy. It is possible to implement it so that it only works in one cluster, but it should be trivial to support a remote cluster if the local cluster already works. Then I do not see a reason to limit it.

Copy link
Author

@pierrecregut pierrecregut Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First HostClaim as I defined it is very similar to Metal3Machine. In fact the only difference in the Spec is the providerID (only in m3m) and online (only in host) but in practice it is a little more than that as there are a few things handled through annotations that are specific to cluster-api in the metal3 machine controller. HostClaim are a way to decouple what is specific to cluster-api from what is generic to a compute resource.

Multi-tenancy for BareMetalHost is not specific to the use with cluster-api. I would like to be able to launch workloads on servers that are shared with other users of the infrastructure even if the workload is not being a node of a cluster. So I think that the implementation of multi-tenancy should be between Hostclaim and BareMetalHost, so in the Hostclaim controller.

To just implement multi-tenancy in the same cluster, I just need to make sure that I hide the credentials but provide sufficient feed back on the baremetal host. If I want to use it without capm3, this feedback cannot be propagated to the m3machine that does not exist in that scenario.

I can extend the solution to work with BareMetalHost on a remote cluster. As in your proposal, it involves having a set of credentials for a remote api that gives access to the bmh. Again I could implement it directly directly in the HostClaim controller but if I want to support hybrid clusters and if I need the same mechanism to access kubevirt VM on a remote cluster, then I prefer an implementation that let me control any kind of HostClaim on a remote cluster. That is exactly what remote hostclaim do.

Your solution trivially supports remote cluster because it uses the mechanisms to access a remote cluster even when it is not needed.

On the other hand, we could also decide that the real controllers are always implemented by a service implementing the HostClaim API. The notion of kind we have could disappear: you would simply target another endpoint. Technically this would be almost the same implementation but the HostClaim on the remote cluster would only exist "on the wire" and not as a resource. This means that the service implementation must be able to rebuild the model (internally you need it) from the compute resource only (through labels and annotations added to mark how they are associated to the source HostClaim resources).

This is probably closer to what you had in mind but the resource would still be in the cluster (tenant) namespace not in the baremetalhost namespace.

The only drawback from my point of view is the complexity of this server when it is not needed (single cluster). Especially with metadata, the flow of information between clusters is not easy and having a single implementation may ensure more coherence. We also loose the ability to implement Quota for HostClaims as this must be done on a resource existing on the target cluster, not on the tenant cluster.

@pierrecregut
Copy link
Author

@lentzi90 I think I need to write down the alternative/risks sections right now. I have a draft but it is not ready yet (pierrecregut@5ed5f5f). I will push it here when it is ready and I will try to faithfully incorporate your approach.

@metal3-io-bot metal3-io-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 12, 2024
@pierrecregut
Copy link
Author

I have added the 'alternative solutions' part and 'risk and mitigation' so that it is more clear how HostClaim compare with alternative solutions. I think I have captured @lentzi90 proposal in "HostClaims as a right to consume BareMetalHosts" but I probably missed part of it. If some obvious alternatives are also missing it is time to add them. The advantage/drawbacks are biased and I am open to improvements. Unfortunately the size of the PR is now clearly XL.

* security risks that must be addressed
* alternatives for the different use cases (multi tenancy or hybrid).

Co-authored-by: Pierre Crégut <[email protected]>
Co-authored-by: Laurent Roussarie <[email protected]>

Signed-off-by: Pierre Crégut <[email protected]>
Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the work on this!

As mentioned in comments above, my main motivation for a HostClaim is to achieve multi-tenancy the way CAPI defines it. I think we need to shape it so that we get a similar flow as other infrastructure providers. This means that CAPM3 would deal with disposable resource (HostClaims) through an API (the Kubernetes API of the cluster holding the BareMetalHosts) protected by credentials. This was the original goal of the architecture but then CAPM3 grew to become completely coupled to BMO and this is something I hope to fix now.

One thing that is not clear to me is how you intend things to work with VMs instead of BareMetalHosts? Are you expecting to create a separate controller instead of BMO?

@pierrecregut
Copy link
Author

This means that CAPM3 would deal with disposable resource (HostClaims) through an API (the Kubernetes API of the cluster holding the BareMetalHosts) protected by credentials. This was the original goal of the architecture but then CAPM3 grew to become completely coupled to BMO and this is something I hope to fix now.

I completely agree with this shift from recyclable to disposable resource. The main question is where we expose the API and what is kept and manipulated as a resource and what is just message over a link.

One thing that is not clear to me is how you intend things to work with VMs instead of BareMetalHosts? Are you expecting to create a separate controller instead of BMO?

Yes and we have already done it with the Host resource in Kanod (which is a HostClaim with a few defects that we intend to correct for the final version). The implementation for kubevirt VM is https://gitlab.com/Orange-OpenSource/kanod/host-kubevirt, the version for having hosts on a remote cluster is https://gitlab.com/Orange-OpenSource/kanod/host-remote-operator. The implementation for BMH is in https://gitlab.com/Orange-OpenSource/kanod/host-operator but mixing the generic definition of the custom resource (with quota webhook) and the specific implementation for BMH was not necessarily the best idea.

@lentzi90
Copy link
Member

Ok I think I understand now how you want the HostClaim to work. Let me see if I can suggest something that would work both for the hybrid, non-CAPI case and for CAPI without making things too complicated.

This is what I want to avoid. It is too complicated, and I see no reason (from CAPI/CAPM3 perspective) to have the LocalClaim.

                
┌────────┐      
│Machine │      
├────────┴┐     
│M3Machine│     
├─────────┴┐    
│LocalClaim│    
└──────────┘    
                
                
┌─────────────┐ 
│RemoteClaim  │ 
├─────────────┤ 
├─────────────┤ 
│BareMetalHost│ 
└─────────────┘ 

Previously I suggested something like this, where the M3Machine fills the same need as the LocalClaim. I understand that this is not ideal for your use-case because there is no M3Machine then.

 ┌────────┐      
 │Machine │      
 ├────────┴┐     
 │M3Machine│     
 └─────────┘     
                 
                 
 ┌─────────────┐ 
 │RemoteClaim  │ 
 ├─────────────┤ 
 ├─────────────┤ 
 │BareMetalHost│ 
 └─────────────┘ 

Now I suggest instead this:

 ┌────────┐                                              
 │Machine │                                              
 ├────────┴┐         ┌─────────┐        ┌─────────┐      
 │M3Machine│         │HostClaim│        │HostClaim│      
 └────┬────┘         └────┬────┘        └────┬────┘      
      │                   │                  │           
      │                   │                  │           
 ┌────▼────────┐     ┌────▼────────┐    ┌────▼─────────┐ 
 │BMHClaim     │     │BMHClaim     │    │VMClaim       │ 
 ├─────────────┤     ├─────────────┤    ├──────────────┤ 
 ├─────────────┤     ├─────────────┤    ├──────────────┤ 
 │BareMetalHost│     │BareMetalHost│    │VirtualMachine│ 
 └─────────────┘     └─────────────┘    └──────────────┘ 

This way you can implement the HostClaim controller separately and get all functionality you want. For Metal3, we would add to BMO the BareMetalHostClaim. No separate controller is needed for this. BMO can reconcile them and match them with BareMetalHosts.
CAPM3 would some similar parts to the HostClaim controller but only for BareMetalHostClaims. I don't see that we would make it work with VMs for example.

What do you think about this?

@pierrecregut
Copy link
Author

pierrecregut commented Apr 18, 2024

This is what I want to avoid. It is too complicated, and I see no reason (from CAPI/CAPM3 perspective) to have the LocalClaim.

LocalClaim serves two purposes:

  • It avoids duplication of work between an M3M controller and what you call a HostClaim controller in last figure.
  • It freezes the contract offered by the API accessing the BMH side which is important if we want alternative implementations.

As an additional bonus point, I think it would give node-reuse a cleaner implementation because it means that we keep the
host binding rather than creating a new one.

CAPM3 would some similar parts to the HostClaim controller but only for BareMetalHostClaims. I don't see that we would make it work with VMs for example.

The whole point of supporting hybrid is to make CAPM3 work with VM. And if the contract exposed by HostClaim is clear this is exactly what we have. With our PoC, we can create workload cluster with either control-plane or some of the machine deployments on kubevirt VM (we use multus and a bridge on each node of the management cluster to attach VM to the network of the baremetal part of target cluster). Then you can even change the compute resource kind of machine deployments between upgrades seamlessly.

If we are not sharing the CR on compute resources, I would rather avoid the BMHClaim/VMClaim parts. The only thing you need to know is for each user, a mapping from LocalClaim ids to BMH (credentials being associated to the user). The reason we have a claim on the remote part is that we distinguish the implementation of the wire protocol (and with it the notions of users) from the handling of compute resources.

In our implementation of remote HostClaim, we have a very crude model of user (just a token representing more or less a namespace). A production grade version would probably distinguish a declarative part (which cluster, which namespace) from the authenticated part (the user) and would provide interface to identity management frameworks (keycloak others) that handle the definition of users and credential protection. We have not looked seriously at that part and one of the advantage of our approach is that it was very easy to replace our Local HostClaim <-> Remote HostClaim controller... or not use any when BMH and clusters are in the same management cluster.

The compromise I suggested yesterday was rather the following where circles represent services not resources. The notion of user is not represented. The important part is that the dialog between HostClaim controller and BMH / VM servers is standardized.

graph TD;
   m3m1[Metal3Machine] --> hc1[HostClaim 1];
   m3m2[Metal3Machine] --> hc2[HostClaim 2];
   hc1 --> bms((BMH Server));
   hc3[HostClaim 3] --> bms;
   hc2 --> vms((VM server));
   hc4[HostClaim 4] --> vms;
   subgraph BareMetal Domain
   bms --hc1--> bmh1[BMH 1];
   bms --hc3--> bmh2[BMH 2];
   bms --> bmh3[BMH 3];
   end
   subgraph KubeVirt Domain
   vms --hc2--> vm1[VM 1];
   vms --hc4--> vm2[VM 2];
   end
Loading

A last warning is that the pivot of the bootstrap cluster which is still simple when we have M3M, Host and BMH in the same namespace during initialization becomes much more complex if we always have a service that migrates.

@lentzi90
Copy link
Member

I'm not convinced that we need a separate HostClaim controller. That is why I see no issue with duplicating the work of that controller in CAPM3.

If the whole point is to get CAPM3 to work with VMs and especially KubeVirt, what about this? It seems like a very tempting alternative to me. We just add support for BMCs in KubeVirt and the whole Metal3 stack can work with it directly.
This is what I hope to use for our CI in the future.

The multi-tenancy / decoupling of CAPM3 from the BMHs is still very much needed, but for that I see no need for the LocalClaim. The Metal3Machines are supposed to fill that same function, just like OpenStackMachines does in CAPO. They are the local resources that the user can inspect.

@pierrecregut
Copy link
Author

pierrecregut commented Apr 19, 2024

Kubevirt is just an example. You can add a BMC to most VM. This exists for Openstack for testing purpose for a long time. Sylva project has already developed an equivalent of the mentioned blueprint for Kubvirt for their CI but going directly to the libvirt layer: : https://gitlab.com/sylva-projects/sylva-elements/container-images/libvirt-metal#libvirt-metal . Unfortunately Redfish/IPMI manages hardware, it does not create it. So we completely loose the advantages of being a disposable resource or we need again a notion of pools synchronized with cluster size. Things quickly become very complex again and that is probably why nobody has pushed it further. Networking and VM definitions are really mixed in Kubevirt and as networking must be reconfigured for each cluster, pre-definition of VM is also harder.

@lentzi90
Copy link
Member

Ok, makes sense. So you will need some code in CAPM3 to make it possible. (Yes it could be in the HostClaim controller instead, but I hope you see why we may want to include that in CAPM3/BMO directly.)

I think the core issue remaining here is about how the HostClaim is bound to the BareMetalHost (or VM). If I understand correctly you would like a HostClaim controller (HCC) that is independent of BMO and CAPM3. It would bind the HostClaims to BMHs or VMs and propagate status back. I think that should be part of BMO directly. It would be such an integral part of how to work with BMO.

I'll be back with more comments on how to bind the HostClaim later

@pierrecregut
Copy link
Author

If I understand correctly you would like a HostClaim controller (HCC) that is independent of BMO and CAPM3. It would bind the HostClaims to BMHs or VMs and propagate status back.

In the proposal, there is no piece of code that handles both BMH and VM at the same time. I will try to summarize what I think our mental models. Please correct me what I misrepresent. If we forget about the single cluster case where my proposal did not need an intermediate protocol, the core of what we agree on is:

graph TD;
  Metal3Machine --> i1[...];
  i1 -->p[/REST API for Host: ↓ selector + workload -  ↑ resource status + most metadata/];
  p --> c((server));
  c --> i2[...];
  i2 --> BareMetalHost
Loading

Rectangles are custom resources. Parallelograms are API and circles, http endpoint implementing the API.

My intial proposal was:

graph TD;
   Metal3Machine --> hc0[HostClaim - remote];
   hc0 --> p[/REST API/];
   p --> c((HostClaim server));
   c --> hc1[HostClaim - baremetal];
   c --> hc2[HostClaim - kubevirt];
   hc1 --> BareMetalHost;
   hc2 --> vm[Kubevirt VM];
Loading

The word after HostClaim is something that can act as a compute kind selector for controllers. To do the selection on the K8S APIserver side, a label should be used but it is just a code optimization. We have as many controllers as there are selectors.

If I undestand well, your proposal is rather something like:

graph TD;
  Metal3Machine --> p[/REST API/];
  p --> c((BMH server));
  c --> BareMetalHost;
Loading

You have a resource between the BMH server and the BareMetalHost, but if I am not wrong, it only either uniquely identify a user or create a unique password for the BareMetalHost depending on the way you restrict the access. Note that you probably want
a notion of user to limit the scope of "watch" (something you need to get a single channel to be notified of status changes). With the user approach, I would just use a custom resource representing the user as consumerRef and use the name of the BMH as id (I don't think it reveals anything secret to the other end).

A custom resource hides both an API and the K8S apiserver that implements this API and delegates actions to the resource controller watching the resource.

So, what I am not sure about is if, in the above drawing, the pair protocol and server is kubernetes apiserver (you mentionned directly using kubeconfig at some point) or if, as recommended in k8s documentation, you implement your own server. With Kubernetes RBAC, it may be difficult to restrict the access by a given user to only the BMH it owns unless you have separate namespaces.

If we agree on a public, versionned API, I think we can gain back more or less what is needed for hybrid clusters and non k8s workload with:

graph TD;
  Metal3Machine --> p[/REST API/];
  Metal3Host --> p;
  p --> c((BMH server));
  c --> BareMetalHost;
  p --> c2((VM server));
  c2 --> vm[Kubevirt VM];
Loading

There is no real difference between Metal3Host and HostClaim - remote from a previous drawing (I have used the name given in a previous design proposal).

@lentzi90
Copy link
Member

Sorry for my slow response. There are too many other things on-going.
I think we can get away with something simpler still.

With Kubernetes RBAC, it may be difficult to restrict the access by a given user to only the BMH it owns unless you have separate namespaces.

Yes, to get proper isolation we would need separate namespaces. However, that does not mean we will necessarily be limited to separate BMH pools. I suggest BMO would be responsible to associating BMHs to HostClaims. This means that we can give the power to control how that happens to the same entity that creates the BMHs.

  • CAPM3 user needs access to create HostClaims in the cluster where the BMHs are
  • BMO owner creates BMHs and policies for what HostClaims they should be associated with
flowchart LR;
  CAPM3 --creates--> c[HostClaim];
  BMO--associates--> c;
  c --associated with--> BareMetalHost;
  c --approved by--> BareMetalHostPolicy;
Loading

This way, the BMO owner can write a policy to approve or deny HostClaim associations. (I stole this idea from cert-manager approver policies.) They can allow associations from multiple namespaces to get a common pool or they can allow only one specific namespace. We could even the policies more fine grained if needed.

@pierrecregut
Copy link
Author

Yes, to get proper isolation we would need separate namespaces. However, that does not mean we will necessarily be limited to separate BMH pools. I suggest BMO would be responsible to associating BMHs to HostClaims. This means that we can give the power to control how that happens to the same entity that creates the BMHs.

If we have separate namespaces and the HostClaim resource contains the requirements and the selectors, I think that is exactly the HostClaim resource as it is described. The only question where your suggested implementation and our differs is the location of the controller. I totally agree for putting the controller in the BMO project. The only reason we did it differently is that we tried to limit places where we forked an existing metal3 project in the scope of the PoC.

@pierrecregut
Copy link
Author

  • CAPM3 user needs access to create HostClaims in the cluster where the BMHs are

I think that is the real point where we differed. In the case of cross cluster, you assume that the user will somehow give a kubeconfig with a restricted scope in a secret to the capm3 controller. I guess that if we are in the same cluster, we could just assume that the HostClaim must be created in the same namespace as the M3M and in that case we are again back to the existing proposal.

We differ because:

  • we implemented the cross cluster as a separate operator so that we could use it directly for the case where capm3 is not involved. There is no real problem to push that code inside capm3 except that we loose the ability to share it with other custom resources. This is less a priority if we already have HostClaim as a usable resource on a given cluster.
  • we do not rely on kubeconfig. We want to automate user and cluster creation. We do not want to have an entity that can create a role over any namespace and a user. This is equivalent to have a controller with full admin power. But my implementation is not great in the sense that it reimplements a lot of apiserver in a hacky way.

A colleague suggested implementing a proxy. That would be more or less transparent to the capm3 side and would solve our security concern. The proxy would have full privileges on hostclaim resources in any namespace. It would replace user identities with its own after verifying that the user is entitled to do the request it asks for. That way we are sure that we do not create a security hole that goes beyond the scope of Host and BMH management.

I will modify the proposal to introduce the kubeconfig credential that will be used in place of the standard service account. It could use the namespace defined in the context as target namespace. We will keep the implementation of a proxy in mind to avoid giving an agent too many rights when we want to automate the management of users.

@pierrecregut
Copy link
Author

This way, the BMO owner can write a policy to approve or deny HostClaim associations. (I stole this idea from cert-manager approver policies.) They can allow associations from multiple namespaces to get a common pool or they can allow only one specific namespace. We could even the policies more fine grained if needed.

BMOpolicies are closer to quota on pods than cert managers policies: quantitative clearly associated to a user/namespace. That is why Laurent first restricted HostQuotas to namespaces using k8s quota implementation as a reference (https://gitlab.com/Orange-OpenSource/kanod/host-quota). Openshift has a notion of cluster wide quotas for projects but I believe that it expresses constraints on each namespace.

@lentzi90
Copy link
Member

I think that is the real point where we differed. In the case of cross cluster, you assume that the user will somehow give a kubeconfig with a restricted scope in a secret to the capm3 controller. I guess that if we are in the same cluster, we could just assume that the HostClaim must be created in the same namespace as the M3M and in that case we are again back to the existing proposal.

We can default to use "in-cluster credentials" (i.e. service account and RBAC). It would not have to be in the same namespace but it would probably make sense for single cluster scenarios anyway.

we do not rely on kubeconfig. We want to automate user and cluster creation. We do not want to have an entity that can create a role over any namespace and a user. This is equivalent to have a controller with full admin power. But my implementation is not great in the sense that it reimplements a lot of apiserver in a hacky way.

I struggle to understand this. If you create the cluster you are usually cluster admin anyway? An entity that manages the users does not need to be coupled to Kubernetes API and RBAC depending on what you use for authentication. You could handle the users separately and use OIDC or JWT with user groups to get correct privileges in the cluster.

I don't mean that we have to use kubeconfigs directly, but considering that the format is well tested and contains everything we need I think it is the obvious choice. We can consider other alternatives of course, but BMO provides a Kubernetes API and then kubeconfig files makes a lot of sense.

A colleague suggested implementing a proxy. That would be more or less transparent to the capm3 side and would solve our security concern. The proxy would have full privileges on hostclaim resources in any namespace. It would replace user identities with its own after verifying that the user is entitled to do the request it asks for. That way we are sure that we do not create a security hole that goes beyond the scope of Host and BMH management.

Not sure I understand this. To me it sounds like the proxy is the security issue here, with full access to all BMHs through HostClaims. Who would control this proxy? Where would it run? How would authentication work and is there a reason to do that in the proxy instead of in the Kubernets API server?

I will modify the proposal to introduce the kubeconfig credential that will be used in place of the standard service account. It could use the namespace defined in the context as target namespace.

Sounds good!

BMOpolicies are closer to quota on pods than cert managers policies: quantitative clearly associated to a user/namespace. That is why Laurent first restricted HostQuotas to namespaces using k8s quota implementation as a reference (https://gitlab.com/Orange-OpenSource/kanod/host-quota). Openshift has a notion of cluster wide quotas for projects but I believe that it expresses constraints on each namespace.

I'm not after limiting the amount or type of hosts at this point. I do see the usefulness of that and we should definitely add it. However, I think we must first make sure that authorization is in place. Not "you cannot use so much", rather "you have no access here".

I think there are two scenarios, with one special case:

  1. Full isolation: BMHs are grouped per user in separate namespaces or even clusters. BMO/BMHs potentially configured to only associate HostClaims in the same namespace. If they are in separate clusters there is no need for that.
  2. Common pool(s): A pool of BMHs in one namespace are shared by multiple users, but not all. A policy is needed to tell BMO what HostClaims are allowed to be associated. E.g. BMHs in namespace A can be associated with HostClaims in namespaces X, Y and Z. No other HostClaims are allowed.
  3. Free for all: All BMHs are in a single namespace with no limitations on what HostClaims can be associated. This is a special case of scenario 2.

@pierrecregut
Copy link
Author

The case I want to address is when you have an infrastructure wide system for managing users and identities. You create a new user X that should be able to describe a new cluster in management cluster A using BareMetalHosts in cluster B. How do I automate the creation of a namespace and the associated kubeconfig in B without a process that have full admin rights in B which should not be strictly necessary.

The idea was to have just a proxy in cluster B that has only rights on HostClaims (so more or less what a HostClaim controller already has) and that uses the centralized authentication/authorization management system to check if the credentials it has received can be used to perform an action in a specific namespace. We still need some glue around to create and distribute such credentials but at least there is a path where we do not create new kubernetes user and a RoleBinding (to a cluster role granting control over HostClaims).

The issue can be left open for the moment as it can be addressed separately.

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pierrecregut
Copy link
Author

I have pushed a new version with the following modifications:

  • state clearly that network isolation is outside the scope of the proposal
  • two small modifications to make clear why all clusters would end up in the same namespace (it is where the baremetalhosts are) and that the workload mentioned is being a node in the target cluster.

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
It looks like you may be unaware that I have a longstanding proposal for decomposing the BMH API in a way that supports multitenancy.

Comparing the two proposals, the major differences I see are:

  • I propose a resource type (BareMetalInventory) that allows the user to get a list of available Hosts that match a selector, rather than request one and not know whether it will be fulfilled or not. I've observed that users often care deeply about which host is which in a data center, even though you'd think we'd be past the point where that matters.
  • You propose to make a Host available in multiple different namespaces, and allow any one of them to claim it. While my proposal could easily accomodate that, I started from the assumption that Hosts would be pre-allocated to individual namespaces by some other process or administrative action. It's not clear to me what the right thing to do here is - being able to allocate dynamically from a pool of hosts is appealing, but if this is done within Metal³ then there will be pressure to have Metal³ automatically reconfigure the network. Whereas if the allocation process is externalised, the network reconfiguration can be as well. In general, users tend to want to utilize all of their hosts; it's rare to see bare metal left sitting idle. It's also rare for private clouds to implement chargeback schemes that give tenants an incentive not to just grab all of the available hosts in the pool; more common would be to allocate a fixed amount of capacity to a tenant based on their contribution to a capex budget.
  • I propose a first-class API (BareMetalAllocation) for specifying which namespace(s) can provision a host. You should do this too.
  • My proposal incorporates a detailed breakdown of all the fields in the BMH and which ones should be configurable by which actor (including the ability for an administrator to set defaults that can be overridden by the user), rather than just the 3 you care about (Image URL, User data, and Network data). This is more flexible (especially, but not only, for non-CAPI use cases) and could eventually lead to a future version of the API where a BMH resource is more like just a "BMC" resource, and not just collecting every field used by every actor together in one CRD as it has historically done.

I'd be interested to hear your comments.

recycled and are bound to new HostClaims, potentially belonging to other
clusters.

#### Hybrid Clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems clearly out of scope for Metal³ to me. If you want to mix infrastructure types within a cluster, then the other types should be handled by their respective CAPI infrastructure providers. Dealing with multiple infrastructure types in a cluster is CAPI's job, and if improvements in CAPI or in individual infrastructure providers are needed they should be driven there, not by creating an inner-platform inside Metal³.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had the same concern. It's clearly something CAPI should provide. If they don't want to provide that, is it our job to basically invent a substitute CAPI feature? We need to keep in mind that the Metal3 team is rather small. We already struggle with reviews on some projects. We need to make sure we're focused on the primary mission - bringing bare-metal provisioning to kubernetes - even if it requires making some tough calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb I do not think that the proposal fit the inner platform pattern:

  • It does not add new core functionality, there is nothing new in the proposal that fits the notion of "universal interpreter" to embed in Metal3
  • It is simply putting API (contracts) at new places so that the framework can be used differently and other project can use part of its functionalities instead of reimplementing the same thing

On the other hand trying to pretend that everything has a BMC and pre starting/configuring VM so that they can be used by Metal3 looks closer to an anti pattern for me.

@dtantsur, I perfectly understand your fear of having to address new feature requests with a limited number of developers but hostclaim does not add new needs for capm3 and it should help to expand the community. Metal3 must not implement the other back-ends for disposable compute resources (kubevirt, openstack, vmware or other). It is up to the hostclaim controller implementation to respect the host-claim contract in the same way as they accept the capi contract and if not possible they can always stay as a standard infra provider.

Why hybrid cluster use case is not handled above at capi level ? I think there are at least three reasons:

  • The initial design did not consider it. This is not a good reason but the separate evolution of the infra providers makes it harder to reconsider the initial design choices and the next two factors are probably the most important.
  • Most providers handle networking. For the hybrid case, the usual "one size fits all" approach to networking the providers implement just do not work. Metal3 does not address networking and should not.
  • providers for public cloud and providers for bare-metal/private cloud do not necessarily access the same kind of API. For those it is not clear that the hostclaim approach would be easy/wise to adopt. But it does not mean that putting a branching point at the level of the host claim rather than at the level of the machine for all the use cases that support it is not the right software engineering choice.

A lot of infra providers implement exactly the same functions in different ways. Not only it means developers redoing and maintaining different implementations but for end-user and especially frameworks orchestrating cluster creations it means slightly different semantics and completely different ways of passing parameters. Two examples:

  • the way for a node to tell it is up uses an ssh connection with kubevirt (see open issue complaining on it). capm3 implementation goes through the cluster API. That code would be shared and implemented once with hostclaims
  • the feature to access compute resource on another cluster, implemented with either CAPI Multi-tenancy contract proposal for CAPM3 #429 or the remote hostclaim, would also be useful for the kubevirt provider to be able to create VM based clusters over other clusters. Unfortunately as long as they are completely different providers they would have to reimplement it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb unfortunately the google document is private, so I cannot read it. But from your explanation I see a few drawbacks:

  • knowing the tenant does not mean knowing the network configuration of nodes and I do not see why it should be so static. Not all cluster nodes need the same connectivity.
  • we are looking for ways to dynamically share infrastructure for clusters that have different activity profiles and reduce hardware over provisioning.

It is possible to reconfigure network dynamically either from a centralized manager (the approach we chose in Kanod) or better from the cluster themselves with Host based routing. On the other hand I think we should always distinguish the provisioning network seen by Ironic and OS image repositories which is star shaped (servers can only communicate with Ironic) and the network underlay for the workload which is the one that is reconfigured. On those network, capm3, potentially not in the same network as Ironic, needs access to the k8s endpoint (the reverse is not true).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAPI provides an abstraction layer over different platforms' ways of provisioning machines.
This proposal provides and abstraction layer over different platforms' ways of provisioning machines, inside one of the CAPI platform providers.
It's exactly the inner platform effect. It's the most canonical example I've seen since people started writing document layout programs in JavaScript.

It would be naive to think that it will end with adding this one field. If anybody uses this, then we will inevitably be called upon to reimplement all of the features of CAPI.

From the fact that we are still speculating about the reasons that CAPI doesn't support this, I would infer that you have not actually approached the CAPI project and explained your requirements to them. I certainly understand why taking a more direct route to implementing the thing you want is appealing for you, but it rarely ends there and this is why open source projects must be protective of their scope.

The API decomposition doc is shared with the metal3-development mailing list, so if you join the project mailing list you'll be able to read and comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed the wikipedia article, I cite this explanation: "The inner-platform effect occurs when this library expands to include general purpose functions that duplicate functionality already available as part of the programming language or platform." and you also said: "we will inevitably be called upon to reimplement all of the features of CAPI."

I really fail to see where the proposal reimplements anything from CAPI. It is just a different branching point allowing more code reuse than was previously possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the CAPI topic: have you heard a definite "no, we won't do this" from them?

The solution is to enforce that BareMetalHost that can be bound to a
HostClaim have a label (proposed name: ``hostclaims.metal3.io/namespaces``)
restricting authorized HostClaims to specific namespaces. The value could be
either ``*`` for no constraint, or a comma separated list of namespace names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make a proper API for this? There's no reason it needs to be so janky.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes a proper field would be better. But I do not think that a complex authorization framework is a good idea.

Another unrelated problem is that Cluster-API has been designed
to define clusters using homogeneous compute resources: it is challenging to
define a cluster with both bare-metal servers and virtual machines in a private
or public cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning the CAPI developers give for not supporting heterogeneous clusters? Are they aware of the use cases?

use hostSelectors. Controllers for a "bare-metal as a service" service
may use selectors.

#### Compute Resources in Another Cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a feature that could be implemented separately from the larger HostClaim discussion.

Copy link
Author

@pierrecregut pierrecregut Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be #429 and be included in capm3 code. For HostClaim you need to add the resource to the list of tracked remote resources (BMH in the proposal). I suggested to use another hostclaim controller but it was rejected. An example implementation along #429 principle is here (for hostclaims only, not bmh).

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2024
@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 23, 2024
@pierrecregut
Copy link
Author

There is a strong consensus in the metal3 team against the hybrid use case so I am removing it. I think that the relevant multi-tenancy part is still readable. The proposal is renamed to reflect the reduction of scope.

Consensus against from Metal3 dev team.

Signed-off-by: Pierre Crégut <[email protected]>
@metal3-io-bot metal3-io-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 23, 2024
@pierrecregut
Copy link
Author

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2025
@pierrecregut
Copy link
Author

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2025
name: host-claim-yyyy
namespace: user1-ns
spec:
kind: baremetalhost
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to default kind to baremetalhost: most users will want to provision bare-metal only

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups, I should have removed all references to hybrid clusters (even if I think it is the right approach, there was a consensus - that I do not understand - against it.). The mention of kubevirt vm a few lines above must be removed too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good thing is: we can discuss hybrid clusters separately once we fix the pressing problem.

namespace: user1-ns
spec:
kind: baremetalhost
credentials: bmh-cluster-credentials
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it identityRef for similarity with CAPM3 resources

spec:
kind: baremetalhost
credentials: bmh-cluster-credentials
userData:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's group userData, image and co under something like template for the UX benefit (similarity to Deployments, etc)

Metal3Machine. As in the original workflow, it is used to choose the BareMetalHost
associated with the cluster, but he associated BareMetalHost is not in the same
namespace as the HostClaim. The exact definition of the BareMetalHost remains
hidden from the cluster user, only parts of its status and metadata are copied
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention what we discuss about showing HardwareDetails as a way for users to know which hosts are available to them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is the new proposal I have to write. We called it HardwareDetail during the meeting as in Zane proposal but there is already HardwareData with its hardwareDetail field. Do we want something else ? It is just promoting HardwareData from an internal tool role to a first class use. Then at some point, bmh does not even need to create the status.hardware field, the inspect.metal3.io/hardwaredetails becomes useless. and you can have your Netbox -> hardwareDetail automatic provisionner if you want (inspect.metal3.io: "" on the other hand could become an official boolean field).

But if we only see HardwareData, there is still the problem I mentionned: how do we access fields for solving fromAnnotations and fromLabels ?

  • Do we copy them from the BareMetalHost ?
  • Let the user create HardwareData object (populated by inspection or other) and put whatever metadata he wants ?

That could be a point of friction when switching from BMH to Hostclaims.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you meant to use HardwareData (the existing one) as a sort of public inventory? Did I misunderstand?

But if we only see HardwareData, there is still the problem I mentionned: how do we access fields for solving fromAnnotations and fromLabels ?

I agree. That's why I initially assumed that there would be a new CRD representing inventory available for a user. I guess that's what Zane called BareMetalAllocation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should use HardwareData and not invent something new. Adding yet another CRD for just the meta-data seems on the other hand heavy-weighted.

  • One solution to avoid the need for a CRD is to use HardwareData meta-data as the source of meta-data for templates. It means that user can create them and even if the spec is populated by the BMH controller, the resource itself is kept (createOrUpdate behavior that does not try to change meta-data). One of the drawback is that it is another disrupting change as hardware admins used the BMH resource for that (but creating yet another CRD is worse).

  • Another solution is to dynamically synchronize BMH meta-data and HardwareData resource. There are two options:

    • synchronize with Hardware metadata but then there is a risk of collision. Advantage: could be an optional behaviour but a default during migration from previous behaviour to the solution described above. Drawback is that we need to be careful if we want to let other sources define some meta-data on HardwareData for other purposes (end up with reserved annotations keeping track of what was synchronized from the BMH and this is not clean). If we view it as a temporary migration behaviour, we can choose full synchronization (meta-data put directly on HardwareData would be destroyed).
    • copy to HardwareData.status.metadata: doing this is much simpler. On the other hand there is no path to solution 1 where hardware manager use directly HardwareData as the source for meta-data.

    Both sub-options mean copying data.

back to the HostClaim resource. With this information,
the data template controller has enough details to compute the different
secrets (userData, metaData and networkData) associated to the Metal3Machine.
Those secrets are linked to the HostClaim and, ultimately, to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: they'll need to be copied over to the BMH namespace.

delete them when they disappear on the BareMetalHost.

Some specific annotations must be synchronized from the HostClaim to the
BareMetalHost (reboot, refresh inspection). A dedicated domain must be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide details on these 2 paragraphs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part may be modifed with the evolution discussed during the metal3 team meetup.
If there is a Read-only resource visible with the BMH that holds hardware detail, it should also hold labels and annotations that can be used to build templates (ie all extra information provided by hardware managers and not discovered through inspection).
We are left with the life-cycle annotations controlling reboot. We can refine the mechanism to ensure both the hardware team and the user have a way to shut down a provisionned host the time they need (at least the end user should not be able to cancel a shutdown imposed by hardware team). reboot.metal3.io/{key} at the level of hostclaims could become reboot.metal3.io/hostclaim-{key}. Refresh inspection : not the scope of the cluster user. I will remove the mention.


Choosing between HostClaims and BareMetalHost is done at the level of
the Metal3Machine controller through a configuration flag. When the HostClaim
mode is activated, all clusters are deployed with HostClaim resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to use HostClaims always?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory no ! But this is only true if the implementation provides exact feature parity at day 1 and there is a clear path to introduce the intermediate hostclaim on capm3 deployments that were not using it. And pivot with hostclaims is still a strange process.. It works well if hostclaims and baremetalhosts are all in the same namespace. Otherwise pivoting means transfering the control of the deployed cluster between two management clusters, but not of the underlying servers.


For the server administrator, the solution is to enforce that BareMetalHost
that can be bound to a HostClaim have the field ``hostClaimNamespaces``
restricting authorized HostClaims to specific namespaces:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a new resource (e.g. HostDeployPolicy) for these reasons:

  1. To avoid further cluttering BMH
  2. Users who can edit policies are not necessarily users who own hardware
  3. We may want to make these resources world-viewable so that users know the policies that apply to them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants