Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Linux Network Devices #1271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aojea
Copy link

@aojea aojea commented Nov 7, 2024

The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace.

This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead.

The proposed scheme is based on the existing representation of network device by the struct net_device
https://docs.kernel.org/networking/netdevices.html.

This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration.

A list of real use cases that justify this proposal is:

  1. Pre-Configuring Physical Devices:

    • Scenario: A container requires a specific physical network interface with a complex IP configuration, RDMA or SR-IOV
    • Implementation:
      • Configure the physical interface on the host with the desired IP addresses, routing, and other settings. In kubernetes this can be done with DRA or Device Plugins.
      • Use netDevices to move the pre-configured interface into the container.
  2. Creating and Moving Virtual Interfaces:

    • Scenario: A container needs to have its own unique MAC address on an existing physical network, without bridging.
    • Implementation:
      • Create the macvlan interface on the host, based on an existing physical interface.
      • Use netDevices to move the MACVLAN interface into the container.
  3. Network Function Containers:

    • Scenario: A container acts as a network router or firewall.
    • Implementation:
      • Use netDevices to move multiple physical or virtual interfaces into the container.
      • The container's processes manage the network configuration, routing, and firewall rules.

References

Fixes: #1239

@aojea
Copy link
Author

aojea commented Nov 7, 2024

/assign @samuelkarp

@AkihiroSuda
Copy link
Member

@aojea aojea force-pushed the network-devices branch 2 times, most recently from 51e5104 to 3a666eb Compare November 12, 2024 12:26
@aojea
Copy link
Author

aojea commented Nov 12, 2024

https://github.com/opencontainers/runtime-spec/blob/main/features.md should be updated too

updated and addressed the comments

@aojea
Copy link
Author

aojea commented Nov 12, 2024

AI @aojea (document the cleanup and destroy of the network interfaces)

@samuelkarp
Copy link
Member

From the in-person discussion today:

  • Net device lifecycle should follow the network namespace lifecycle
  • @aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted
  • @kad was concerned about restarts and error handling
  • Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

config-linux.md Outdated

This schema focuses solely on moving existing network devices identified by name into the container namespace. It does not cover the complexities of network device creation or network configuration, such as IP address assignment, routing, and DNS setup.

**`netDevices`** (object, OPTIONAL) set of network devices that MUST be available in the container. The runtime is responsible for providing these devices; the underlying mechanism is implementation-defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec said "MUST" but, I think it can't do it in the rootless container because the rootless container doesn't have CAP_NET_ADMIN, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should take care of the rootless container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an error in the case of a rootless container, if the runtime is not able to satisfy the MUST condition.

+1 but It'd be better to clarify it in the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added mor explanations about runtime and network devices lifecycle and runtime checks, PTAL

@aojea
Copy link
Author

aojea commented Nov 19, 2024

om the in-person discussion today:

  • Net device lifecycle should follow the network namespace lifecycle
  • @aojea will follow up to determine whether any cleanup actions need to be taken by the OCI runtime on a container being deleted
  • @kad was concerned about restarts and error handling
  • Should we prohibit the new netdev addition to an existing netns? IOW only allow this for containers where a new netns is created? What about containers where the root netns is used?

Pushed a new commit addressing those comments, the changelog is

  • the network namespace lifecycle will move migratebale network devices and destroy virtual devides, the runtime MAY decide to do cleanup actions
  • runtime MUST check the container has enough privileges and an associated network namespace and fail if the check fail
  • removed the Mask field and use the Address field with CIDR notation (IP/Prefix) to deal with IPv4 and IPv6 addresses. Only one IP is allowed to be specified on purpose to simplify the operations and reduce risks
  • Add a HardwareAddress field for use cases that require to set a
    specific mac or infiniband address

@kad
Copy link
Contributor

kad commented Feb 11, 2025

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

My "border line" is about features and fields provided by kernel netlink api. If something is done with it, I don't see big difference between call "move interface xyz to namespace abc" vs. "set property foo to value bar on interface xyz in namespace abc". Or to what it worth, "put value 123 into cgroup foobar".
Things like DHCP that require more than plain kernel api call are clearly outside of the boundaries that can be implemented in runtime.

@kolyshkin
Copy link
Contributor

for things that are easily available via few netlink api calls, I don't see any big reasons why it be too complex to implement in runtime

The problem in my mind is where do you draw the line? At what point does "a few netlink api calls" become "a few too many" and now it's no longer acceptable?

My "border line" is about features and fields provided by kernel netlink api. If something is done with it, I don't see big difference between call "move interface xyz to namespace abc" vs. "set property foo to value bar on interface xyz in namespace abc". Or to what it worth, "put value 123 into cgroup foobar". Things like DHCP that require more than plain kernel api call are clearly outside of the boundaries that can be implemented in runtime.

To me, this sounds like a runtime should implement most of what ip(8) is capable of. Which is a lot.

If that's the case, here's a crazy idea.

Considering that ip(8) is a de facto standard on Linux, perhaps such configuration can be achieved by supplying a sequence of ip command arguments in runtime-spec. A runtime when can use ip binary on the host to get to the needed configuration.

Something like

"linux": {
        "netDevices": [
            "name": "enp34s0u2u1u2",
            "ct_name": "eth0",
            "config": [
                "ip addr add 10.2.3.4/25 dev $IF",
                "ip link set $IF mtu 1350",
                "ip route add 10.10.0.0/16 via 10.2.3.99 dev $IF proto static metric 100"
                "ip route add default dev $IF"
            ]
        ]
    }

Which will result in runtime running the host's ip binary in the container's network namespace (after moving the device into container's namespace and giving it a new name). It can even be a single ip invocation if -b (batch mode) is supported.

The upsides are:

  • this uses a well known, de facto standard "domain language" for network-related configuration;
  • this allows for a quite complex configuration to be performed (limited to whatever ip understands);
  • this seems trivial to implement for any runtime;
  • this doesn't impose any specific container image requirements (no need for any extra binaries etc.);
  • this should be secure (as ip is running inside a netns);
  • some advanced runtimes can implement a subset of this internally (i.e. without the call to ip);

The biggest downside, I guess, is the scope is not well defined, as ip is evolving.

@aojea
Copy link
Author

aojea commented Feb 12, 2025

I don't like the idea of execing commands, specially in golang implementations that are problematic with goroutines and namespaces
I think that once we have just the network interfaces on the runtimes we'll get much better feedback and signal from the implementations to evaluate the next steps and to know if something else is needed and what exactly

I still think that preventing to detect failures at runtime is desirable, you can argue you should not add duplicates, but bugs exist and if somebody accidentally add a duplicate interface name it will cause a problem in production we could have avoided just not allowing to define it

@kad
Copy link
Contributor

kad commented Feb 14, 2025

Considering that ip(8) is a de facto standard on Linux, perhaps such configuration can be achieved by supplying a sequence of ip command arguments in runtime-spec. A runtime when can use ip binary on the host to get to the needed configuration.

I believe that this is really bad idea, compared to have limited set of config options implemented. exec() is always slower and less reliable than calling native kernel APIs. It opens doors to all various injection vulnerabilities and other potential misuses and scope creeps than strict API subset usage. As well as Antonio mentioned proper error handling.

To provide more insights on our use case, here is snip of the code that we want to get rid of, currently implemented as OCI runtime hook (btw, including already unreliable call to ip in it): https://github.com/HabanaAI/habana-container-runtime/blob/main/cmd/habana-container-cli/netdevices.go#L131-L182

@AkihiroSuda
Copy link
Member

Let me move the milestone from vNext (v1.2.1?) to vNextNext (v1.3.0?)

@aojea
Copy link
Author

aojea commented Feb 20, 2025

Do you have an aprox estimation on how long can take 1.3.0? I have some dependencies on this feature and it will be nice to be able to account for that time

@samuelkarp
Copy link
Member

@AkihiroSuda It looks like 1.2.1 got tagged yesterday: https://github.com/opencontainers/runtime-spec/releases/tag/v1.2.1. Is there anything blocking the merge of this PR into main and setting it up for the 1.3.0 release?

@tianon
Copy link
Member

tianon commented Mar 1, 2025

I still don't think it's appropriate to expect the runtime to set up the network interfaces and believe we're opening a can of worms here, but not strongly enough to block it outright (especially with Kir approving it, and thus the implied maintenance of it in runc). ❤️

@aojea
Copy link
Author

aojea commented Mar 1, 2025

I still don't think it's appropriate to expect the runtime to set up the network interfaces and believe we're opening a can of worms here,

I don't see risk meanwhile we stick to network interfaces, the moment we leak networks , we start to create runtimes dependencies and I fully agree with you. That is also the reason why the last proposal removed entirely the IP initialization bits #1271 (comment), I have some ideas on how to solve this problem without modifying the spec, but my priority now is to solve the existing problem in the ecosystem.

The problem this solves is that today there is hardware that needs to use network interfaces, mainly GPUs (but there are other cases). Since there is no way to declaratively move these interfaces to the container, everyone solves the problem from a different angle, by hooking into the Pod/Container network namespace creation using:

  • CNI plugins
  • OCI hooks
  • out of band channels using the Kubernetes API directly
  • ...

All these solutions are brittle , requires high privileges that exposes security problems, and are hard to troubleshoot since run in different places during the container creation and cause fragmentation in the ecosystem and bad user experience.

The goal here is that developers can just patch the OCI spec to say "move eth2 to this container with /dev/gpu0" simple to do with CDI and NRI and no need for extra privileges ... I personally don't feel that meanwhile we stick to this definition, we will open that pandora box that is the network (that non of us want)

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just taking a look, I tried to go through all the 137 comments. But sorry if I'm missing something.

I guess most concerns are solved now that there is no reference to the IP address and friends? Is there any unaddressed concern still?

The idea LGTM. I've also checked quickly the runc implementation, it seemed cleaned and nothing that catched my attention (like no weird magic to do with IPs or anything, that is just not touched at all).

@aojea I ignore this, but how does this works in practice? Is it expected that the host network interface will be configured by the host (i.e. IP address, mtu, etc.) and then moved into the container? All of the configuration and all "just works" when moved into the containe? Or CNI or some other entity will need to do something?

config-linux.md Outdated
The name of the network device is the entry key.
Entry values are objects with the following properties:

* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, as I'm not very familiar with NRI and I don't know if this concern makes sense, please let me know. How can NRI plugins using this decide on container interface name to use? I mean choose one that won't clash with the ones set by potentially other plugins? Can they see what has been done so far by previous plugins? Or this is not an issue at all (in that case, can you explain briefly why? I'm curious :))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kubernetes both main runtimes, containerd and crio, the name of the interface inside the container is always eth0, so for 95% of the cases in kubernetes the problem is easy to solve.
There are cases where people add additional interfaces with out of band mechanisms as in #1271 (comment), in that case, there are several options:

  • add a random generated name with enough entropy
  • inspect the network namespace and check for duplicates
  • fail with a collision name error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but you can't inspect the netns because it hasn't been created yet. So, how can those tools, befor choosing a name for the interface inside the container, check which names were used by others? E.g if NRI has several plugins and more than one adds a interface, how can they the second plugin know eth1 is added and avoid using that name?

The random generated would be an option, but it will be nice to understand if that is needed or if people can just choose names that avoids collisions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kubernetes the network namespace is created by the runtime and there will be only an eth0 interface,
If there are more interface is because some component is adding them via an out of band process, that will have exactly the same problem. This works today because cluster administrators only set up one component to add additional interfaces.

This reinforces my point in #1271 (comment) , using a well defined specification will help multiple implementations to be able to synchronize, and we need thhis primitive to standardize these behaviors, to build higher level APIs ... we are already doing it for Network Status kubernetes/enhancements#4817 , we need to do it for configuration based on this.

Copy link
Member

@rata rata Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we are talking about different things. Let's assume this PR is in, implemented, etc. How a NRI plugin chooses a network interface name without collisions with a network interface added by another plugin?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not into the internal details of CDI of NRI, but I think those modify the OCI spec, so any plugin will be able to check in the OCI spec the transformations of all the other plugins, included the interface names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it could work fine. However, if we don't allow interface renames we can just forget about these problems too.

@aojea
Copy link
Author

aojea commented Mar 3, 2025

@aojea I ignore this, but how does this works in practice? Is it expected that the host network interface will be configured by the host (i.e. IP address, mtu, etc.) and then moved into the container? All of the configuration and all "just works" when moved into the container or CNI or some other entity will need to do something?

The goal is to decouple the interface lifecycle and configuration from the oci runtime, that is the part that SHOULD NOT be handled by the OCI runtimes or actors of the pod/container creation. I think there are two scenarios:

  • interface admin preconfigure the interface before moving it and runc moves it as is. This is already the case on most cases since interfaces use to be part of the host. For those cases that interfaces are hotplaggubale or are dynamically created, the existing preprovisioning hooks in DRA NodePrepareResources can do that, then it is just indicate the OCI runtime to move it inside the container.
  • container application configure the interface. We are already doing this with VMs, see cloud-init and network configuration, there are also use cases where the interface is going to be consumed directly by the application and exposed as configureable to the users, see telco use cases that run router or firewall like apps in the container.

@rata
Copy link
Member

rata commented Mar 3, 2025

Okay, we were talking over slack and there are two things that we think we still need to answer:

  1. Is the configuration (address, mtu, etc.) lost when moved to the container netns?
  2. For the cases that we need to configure the interface, how can we expect the configuration to happen?

I ask 2. for two reasons: a) be clear that is the concern of another part of the stack; b) understand how this will be used and understand if nothing else is missing here (i.e. if CNI is expected to handle it, make sure it is not running too late, or it has all the info to realize there is an extra interface to configure, if another component is expected to handle, see that it can, etc.)

@aojea
Copy link
Author

aojea commented Mar 3, 2025

Thanks @rata for your help today, I updated the PR to implement it in runc with integration tests that show the behavior

  1. Is the configuration (address, mtu, etc.) lost when moved to the container netns?
  2. For the cases that we need to configure the interface, how can we expect the configuration to happen?

the interface configuration is preserved, so users can set down the interface in the host namespace, configure the interface (ip address, mtu, hw address) and the runtime will move it to the network namespace maintaining that configuration, this removes the need to include network configuration in the runtime and allow for implementations to use the preparation of the device to configure it without risks (kubernetes use case)

Users can still decide to use a process inside the container to configure the network configuration, use dhcp or some sort of bootstrap ala cloud-init

@aojea aojea force-pushed the network-devices branch from 0ce07ae to af8cc6f Compare March 3, 2025 23:21
@aojea
Copy link
Author

aojea commented Mar 3, 2025

rebased and added this last requirement to preserve the network config

iff --git a/config-linux.md b/config-linux.md
index 6682e16..1f0e808 100644
--- a/config-linux.md
+++ b/config-linux.md
@@ -201,6 +201,8 @@ This schema focuses solely on moving existing network devices identified by name
 
 The runtime MUST check that is possible to move the network interface to the container namespace and MUST [generate an error](runtime.md#errors) if the check fails.
 
+The runtime MUST preserve the existing network interface attributes, like MTU, MAC and IP addresses, enabling users to preconfigure the interfaces.
+
 The runtime MUST set the network device state to "up" after moving it to the network namespace to allow the container to send and receive network traffic through that device.
 
 

@pfl
Copy link

pfl commented Mar 4, 2025

the interface configuration is preserved, so users can set down the interface in the host namespace, configure the interface (ip address, mtu, hw address)

Link attributes are preserved, but not iP addresses or routes,
ip netns add dummy &&
ip link add dev dummy0 mtu 8000 type dummy &&
ip addr add 192.168.200.10/24 dev dummy0 &&
ip link set dev dummy0 netns dummy &&
ip netns exec dummy ip addr show dev dummy0
gives, unless there is some address preserving option I now totally missed,
1: dummy0: <BROADCAST,NOARP> mtu 8000 qdisc noop state DOWN group default qlen 1000
link/ether ae:3a:ec:00:d4:94 brd ff:ff:ff:ff:ff:ff

As IP address configuration needs to happen after network namespace creation, it leaves options only for runtime hooks, address configuration options in the spec or to run ip configuraton locally? Local ip configuration options will have some timing issues, though.

@rata
Copy link
Member

rata commented Mar 4, 2025

@pfl which timing issues do you mean?

@pfl
Copy link

pfl commented Mar 4, 2025

@pfl which timing issues do you mean?

Just that the containerized IP configuration needs to happen before the workload runs, unless the workload properly waits for the interfaces to come online. I think the order is clear when there is a CNI setting up addresses, less so if the or another container needs to do the configuration more or less in parallel.

@aojea
Copy link
Author

aojea commented Mar 5, 2025

Link attributes are preserved, but not iP addresses or routes,

that is the iproute2 implementation, I didn't find any place that indicates that has to happen that way, and you can see it working in opencontainers/runc#4538 , so I think is fair to implement it this way to solve the preprovision problem.

Just that the containerized IP configuration needs to happen before the workload runs, unless the workload properly waits for the interfaces to come online. I think the order is clear when there is a CNI setting up addresses, less so if the or another container needs to do the configuration more or less in parallel.

I think we are derailing the discussion, this is about network interfaces, not for IP configuration of containers, that already works today with CNI or libnetwork, and this is completely orthogonal to that

The applications that need additional network interfaces may fall into two, but let's keep in mind that the container already has an interface an IP :

  • the interface need to retain the attributes it had before the container is created, hosts and virtual machines already have the interfaces configured and attached , this is the particular case of AI/ML workloads, at least on all the scenarios I've seen
  • the applications consumes the interfaces entirely, and will configure them later, this is the particular case of telco and router applications that are containerized

@aojea
Copy link
Author

aojea commented Mar 5, 2025

Let me share the big picture https://docs.google.com/presentation/d/16eT_EYVbm75UvqKVg8L55VtRuGJ1_dr463OpbrRN2gg/edit?usp=sharing , CNI does one work and does it well, but is struggles to handle more complex scenarios that require a more declarative approach and that will allow high level apis to build this complexity of network configuration
The runtime just need to move network interfaces, nothing else, and not worry about the applications, that is handled at another level, since these are always additional interfaces, never related to the default interface and network configuration

@pfl
Copy link

pfl commented Mar 7, 2025

Fair enough, PR #4538 indeed reads IP addresses and writes them to the interface in the new namespace, very much differently to iproute2.

The runtime just need to move network interfaces, nothing else, and not worry about the applications, that is handled at another level, since these are always additional interfaces, never related to the default interface and network configuration

With IP addresses copied to the new network interface, couldn't the addresses be as easily expressed in the spec? They are after all set to a specific value as far as runc is concerned. When DRA and NRI are involved as in the big picture presentation, doesn't the spec at that point need IP address information for the NRI plugin to set some of them?

@aojea
Copy link
Author

aojea commented Mar 7, 2025

With IP addresses copied to the new network interface, couldn't the addresses be as easily expressed in the spec? They are after all set to a specific value as far as runc is concerned

There is a reasonable concern about scope creep and magnet for "network things"
#1271 (comment)

When DRA and NRI are involved as in the big picture presentation, doesn't the spec at that point need IP address information for the NRI plugin to set some of them?

with DRA and NRI or CDI, there is a "driver" entity. The "driver" may receive the network details via the Kubernetes APIs or decide to use their owns, this driver will preprovision the interface (any sort of operations, if is virtual to create it, if is a vlan or a SRIOV VF, ... whatever, and in this part it can also apply the network configuration). The kubernetes kubelet will stop in this preprovisioning hook, before sending the data to the container runtime. So at this point, we just need to modify the spec to attach the preprovisioned interface ... bear in mind that there are also applications that can configure the IPs and routes and all stuff directly in the CNI or NRI or OCI hooks, that is still another possibility, so this is a very good starting point that unblocks 90% of the problems we have today ...

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks mostly fine, added some comments on the interface rename mostly. I think that is the part that needs a little bit of ironing-out (or I need to understand better why it is not a problem, maybe I'm missing something :))

You mention the scenario of the runtime not participating in the container cleanup. In that case, I wonder what would happen if:
scenario A:

  • host has two interfaces: rata and antonio.
  • Container is created and inteface rata is moved to the container
  • The interface name in the container is antonio.
  • The container crashes, the kernel moves the interfaces back into the host network namespace. How will it be called?

scenario B:

  • host has two interfaces: rata and antonio.
  • Container is created and inteface rata is moved to the container
  • The interface name in the container is eth2.
  • The container crashes, the kernel moves the interfaces back into the host network namespace. Now it has: antonio, eth2.
  • If the container is created again, how will recognize that the interface that it is interested on is called eth2?

I think if we can find a way out for these cases, this looks good to go for me. Well, these scenarios and its combinations (like scenario A, and how does a new container find that the interface they are looking for is called now antonioX, if the kernel did that rename to avoid clashes, and not `rata).

One option might be the alias, that I suggested inline (and just require nodes to not have interfaces named eth0 to be moved), another might be not support renames at all. But if those scenarios are not good, there might be some other way out. Maybe I'm thinking something is a problem when it isn't.

@aojea Let me know what you think or what you can find out of what the kernel does for scenario A :)

config-linux.md Outdated

The runtime MUST check that is possible to move the network interface to the container namespace and MUST [generate an error](runtime.md#errors) if the check fails.

The runtime MUST preserve the existing network interface attributes, like MTU, MAC and IP addresses, enabling users to preconfigure the interfaces.
Copy link
Member

@rata rata Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording is a little vague. I'd have an exhaustive list of things the runtime must preserve, to avoid different runtimes doing something different by mistake.

Also, if the kernel preserves the MTU and MAC already, I'd just remove those things? What do others think?

The kernel has a quite strict user-space breaking policy, so I don't think this behavior will change, we can depend on it. It doesn't seem wrong to depend on it either, IMHO. And I don't think there is any other way than netlink to change the namespace of a iface, it seems all runtimes will use that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, need to change this

config-linux.md Outdated
Entry values are objects with the following properties:

* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.
The runtime, when participating on the container termination, must revert back the original name to guarantee the idempotence of operations, so a container that moves an interface and renames it can be created and destroyed multiple times with the same result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so if the container crashes this idem-potency will be broken, right? Like if a container crashes and it is later created again on this node, it will fail (the node interface is not called as it is expected). Right?

Would it be an issue if we don't allow interface renames, to avoid this issue?

If not supporting that is an issue, maybe we can add aliases? With ip we can do it like this:

ip link property add dev <device_name> altname rata

Will an alias be enough?

We will be leaking aliases in the worst case, not sure if that can have some undesired issues, like learning who used this before by the iface name or something.

IMHO, if we are not sure we need to change the interface name, my preference would be to not support changing the name at all for now, we can add the alias or something else later if needed.

Things like if the host and the container default interface have the same name are an issue, but we can easily change the host interface name. And for ages the default is not eth0, so it seems this idea could fly

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kernel implemented "move and rename" in the same operation because it is a known issue in containerized environment the name conlict , specially with systemd rules and more complex stuff , see discussion related https://lore.kernel.org/r/netdev/[email protected]/T/ , also comments about altname to be more problematic , since increases the risk of collision , see https://gist.github.com/aojea/a5371456177ae85765714fd52db55fdf

config-linux.md Outdated
The name of the network device is the entry key.
Entry values are objects with the following properties:

* **`name`** *(string, OPTIONAL)* - the name of the network device inside the container namespace. If not specified, the host name is used. The network device name is unique per network namespace, if an existing network device with the same name exists that rename operation will fail. The runtime MAY check that the name is unique before the rename operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it could work fine. However, if we don't allow interface renames we can just forget about these problems too.

@aojea aojea force-pushed the network-devices branch from af8cc6f to bea0f2e Compare March 22, 2025 16:24
The proposed "netdevices" field provides a declarative way to
specify which host network devices should be moved into a container's
network namespace.

This approach is similar than the existing "devices" field used for block
devices but uses a dictionary keyed by the interface name instead.

The proposed scheme is based on the existing representation of network
device by the `struct net_device`
https://docs.kernel.org/networking/netdevices.html.

This proposal focuses solely on moving existing network devices into
the container namespace. It does not cover the complexities of
network configuration or network interface creation, emphasizing the
separation of device management and network configuration.

Signed-off-by: Antonio Ojea <[email protected]>
@aojea aojea force-pushed the network-devices branch from bea0f2e to 208f77e Compare March 22, 2025 16:27
@aojea
Copy link
Author

aojea commented Mar 22, 2025

Pushed last proposal , most important change is to not require the runtime to handle the interface lifecycle and not to return it back duringn the container deletion, this is based on following two premises:

  • the namespace may not be owned by the runtime and the interface is associated to the namespace and not the process
  • the runtime, despite it can own the namespace, if the process crashes will not be able to bring back the interface

I've added much more explanations on all the decisions in the text, also suggestions on how to handle certain situations, like interface rename or handling with systemd.

You can find a technical reseach that explain all the relations between namespaces and interfaces in https://gist.github.com/aojea/a5371456177ae85765714fd52db55fdf

I've also updated the runc PR in opencontainers/runc#4538 with current proposal and e2e test bsaed on some real use cases that can serve as guidance for developers.

A list of real use cases that justify this proposal is:

  1. Pre-Configuring Physical Devices:

    • Scenario: A container requires a specific physical network interface with a complex IP configuration, RDMA or SR-IOV
    • Implementation:
      • Configure the physical interface on the host with the desired IP addresses, routing, and other settings. In kubernetes this can be done with DRA or Device Plugins.
      • Use netDevices to move the pre-configured interface into the container.
  2. Creating and Moving Virtual Interfaces:

    • Scenario: A container needs to have its own unique MAC address on an existing physical network, without bridging.
    • Implementation:
      • Create the macvlan interface on the host, based on an existing physical interface.
      • Use netDevices to move the MACVLAN interface into the container.
  3. Network Function Containers:

    • Scenario: A container acts as a network router or firewall.
    • Implementation:
      • Use netDevices to move multiple physical or virtual interfaces into the container.
      • The container's processes manage the network configuration, routing, and firewall rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Network Devices