Skip to content

Commit bea0f2e

Browse files
committed
Define Linux Network Devices
The proposed "netdevices" field provides a declarative way to specify which host network devices should be moved into a container's network namespace. This approach is similar than the existing "devices" field used for block devices but uses a dictionary keyed by the interface name instead. The proposed scheme is based on the existing representation of network device by the `struct net_device` https://docs.kernel.org/networking/netdevices.html. This proposal focuses solely on moving existing network devices into the container namespace. It does not cover the complexities of network configuration or network interface creation, emphasizing the separation of device management and network configuration. Signed-off-by: Antonio Ojea <[email protected]>
1 parent ea38318 commit bea0f2e

File tree

10 files changed

+192
-0
lines changed

10 files changed

+192
-0
lines changed

Diff for: config-linux.md

+109
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,112 @@ In addition to any devices configured with this setting, the runtime MUST also s
189189
* [`/dev/ptmx`][pts.4].
190190
A [bind-mount or symlink of the container's `/dev/pts/ptmx`][devpts].
191191

192+
## <a name="configLinuxNetworkDevices" />Network Devices
193+
194+
Linux network devices are entities that send and receive data packets. They are
195+
not represented as files in the `/dev` directory. Instead, they are represented
196+
by the [`net_device`][net_device] data structure in the Linux kernel. Network
197+
devices can belong to only one network namespace and use a set of operations
198+
distinct from regular file operations. Network devices can be categorized as
199+
**physical** or **virtual**:
200+
201+
* **Physical network devices** correspond to hardware interfaces, such as
202+
Ethernet cards (e.g., `eth0`, `enp0s3`). They are directly associated with
203+
physical network hardware.
204+
* **Virtual network devices** are software-defined interfaces, such as loopback
205+
devices (`lo`), virtual Ethernet pairs (`veth`), bridges (`br0`), VLANs, and
206+
MACVLANs. They are created and managed by the kernel and do not correspond
207+
to physical hardware.
208+
209+
This schema focuses solely on moving existing network devices identified by name
210+
from the host network namespace into the container network namespace. It does
211+
not cover the complexities of network device creation or network configuration,
212+
such as IP address assignment, routing, and DNS setup.
213+
214+
**`netDevices`** (object, OPTIONAL) - A set of network devices that MUST be made
215+
available in the container. The runtime is responsible for moving these devices;
216+
the underlying mechanism is implementation-defined.
217+
218+
The name of the network device is the entry key. Entry values are objects with
219+
the following properties:
220+
221+
* **`name`** *(string, OPTIONAL)* - the name of the network device inside the
222+
container namespace. If not specified, the host name is used.
223+
224+
The runtime MUST check if moving the network interface to the container
225+
namespace is possible. If a network device with the specified name already
226+
exists in the container namespace, the runtime MUST [generate an error](runtime.md#errors),
227+
unless the user has provided a template by appending
228+
`%d` to the new name. In that case, the runtime MUST allow the move, and the
229+
kernel will generate a unique name for the interface within the container's
230+
network namespace.
231+
232+
The runtime MUST preserve the existing network interface attributes, as defined
233+
by the kernel, including IP addresses, enabling users to preconfigure the
234+
interfaces.
235+
236+
The runtime MUST set the network device state to "up" after moving it to the
237+
network namespace to allow the container to send and receive network traffic
238+
through that device.
239+
240+
### Namespace Lifecycle and Container Termination
241+
242+
The runtime MUST NOT actively manage the interface's lifecycle and configuration
243+
*within* the container's network namespace. This is because network interfaces
244+
are inherently tied to the network namespace itself, and their lifecycle is
245+
therefore managed by the owner of the network namespace. Typically, this
246+
ownership and management are handled by higher-level container runtime
247+
orchestrators, rather than the processes running directly within the container.
248+
249+
The runtime **MUST NOT** attempt to move the interface out of the namespace
250+
before deletion. This design decision is based on the following:
251+
252+
* **Namespace Ownership:** Network interfaces are tied to the network namespace,
253+
which may not always be directly managed by the runtime.
254+
* **Abrupt Termination:** Even when the runtime manages the namespace, it cannot
255+
reliably participate in its deletion if the container's processes terminate
256+
abruptly (e.g., due to a crash).
257+
258+
During the network namespace deletion the kernel's built-in namespace cleanup
259+
mechanisms take over, as described in [network_namespaces(7)][net_namespaces.7]:
260+
"When a network namespace is freed (i.e., when the last process in the namespace
261+
terminates), its physical network devices are moved back to the initial network
262+
namespace." All the network namespace migratable physical network devices are
263+
moved to the default network namespace, while virtual devices (veth, macvlan,
264+
...) are destroyed.
265+
266+
If users require custom handling of interface lifecycle during namespace
267+
deletion, they can utilize existing features within the namespace orchestrator
268+
or employ post-stop hooks.
269+
270+
**Physical Interface Renaming and Systemd**
271+
272+
When a physical interface is renamed within a container and the container's
273+
network namespace is later deleted, the kernel will move the interface back to
274+
the root namespace with its renamed name. To ensure predictable interface names
275+
in the root namespace, users can utilize systemd's `udevd` and `networkd` rules.
276+
Refer to [systemd Predictable Network Interface Names][predictable-network-interfaces-names] for more information on configuring
277+
predictable names.
278+
279+
When a physical interface is renamed within a container and the container's
280+
network namespace is later deleted, the kernel will move the interface back to
281+
the root namespace with its renamed name. In case of a name conflict in the root
282+
namespace, the kernel will rename it to `dev%d`. To ensure predictable interface
283+
names in the root namespace, users can utilize systemd's `udevd` and `networkd`
284+
rules. Refer to [systemd Predictable Network Interface Names][predictable-network-interfaces-names]
285+
for more information on configuring predictable names.
286+
287+
### Example
288+
289+
#### Moving a device with a renamed interface inside the container:
290+
291+
```json
292+
"netDevices": {
293+
"eth0" : {
294+
"name": "container_eth0"
295+
}
296+
}
297+
192298
## <a name="configLinuxControlGroups" />Control groups
193299

194300
Also known as cgroups, they are used to restrict resource usage for a container and handle device access.
@@ -975,6 +1081,9 @@ subset of the available options.
9751081
[mknod.1]: https://man7.org/linux/man-pages/man1/mknod.1.html
9761082
[mknod.2]: https://man7.org/linux/man-pages/man2/mknod.2.html
9771083
[namespaces.7_2]: https://man7.org/linux/man-pages/man7/namespaces.7.html
1084+
[net_device]: https://docs.kernel.org/networking/netdevices.html
1085+
[net_namespaces.7]: https://man7.org/linux/man-pages/man7/network_namespaces.7.html
1086+
[predictable-network-interfaces-names]: https://systemd.io/PREDICTABLE_INTERFACE_NAMES
9781087
[null.4]: https://man7.org/linux/man-pages/man4/null.4.html
9791088
[personality.2]: https://man7.org/linux/man-pages/man2/personality.2.html
9801089
[pts.4]: https://man7.org/linux/man-pages/man4/pts.4.html

Diff for: features-linux.md

+14
Original file line numberDiff line numberDiff line change
@@ -228,3 +228,17 @@ Irrelevant to the availability of Intel RDT on the host operating system.
228228
}
229229
}
230230
```
231+
232+
## <a name="linuxFeaturesNetDevices" />NetDevices
233+
234+
**`netDevices`** (object, OPTIONAL) represents the runtime's implementation status of Linux network devices.
235+
236+
* **`enabled`** (bool, OPTIONAL) represents whether the runtime supports the capability to move Linux network devices into the container's network namespace.
237+
238+
### Example
239+
240+
```json
241+
"netDevices": {
242+
"enabled": true
243+
}
244+
```

Diff for: schema/config-linux.json

+6
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,12 @@
99
"$ref": "defs-linux.json#/definitions/Device"
1010
}
1111
},
12+
"netDevices": {
13+
"type": "object",
14+
"additionalProperties": {
15+
"$ref": "defs-linux.json#/definitions/NetDevice"
16+
}
17+
},
1218
"uidMappings": {
1319
"type": "array",
1420
"items": {

Diff for: schema/defs-linux.json

+8
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,14 @@
189189
}
190190
}
191191
},
192+
"NetDevice": {
193+
"type": "object",
194+
"properties": {
195+
"name": {
196+
"type": "string"
197+
}
198+
}
199+
},
192200
"weight": {
193201
"$ref": "defs.json#/definitions/uint16"
194202
},

Diff for: schema/features-linux.json

+8
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,14 @@
110110
}
111111
}
112112
}
113+
},
114+
"netDevices": {
115+
"type": "object",
116+
"properties": {
117+
"enabled": {
118+
"type": "boolean"
119+
}
120+
}
113121
}
114122
}
115123
}

Diff for: schema/test/config/bad/linux-netdevice.json

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"ociVersion": "1.0.0",
3+
"root": {
4+
"path": "rootfs"
5+
},
6+
"linux": {
7+
"netDevices": {
8+
"eth0": {
9+
"name": 23
10+
}
11+
}
12+
}
13+
}

Diff for: schema/test/config/good/linux-netdevice.json

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"ociVersion": "1.0.0",
3+
"root": {
4+
"path": "rootfs"
5+
},
6+
"linux": {
7+
"netDevices": {
8+
"eth0": {
9+
"name": "container_eth0"
10+
},
11+
"ens4": {},
12+
"ens5": {}
13+
}
14+
}
15+
}

Diff for: schema/test/features/good/runc.json

+3
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,9 @@
182182
},
183183
"selinux": {
184184
"enabled": true
185+
},
186+
"netDevices": {
187+
"enabled": true
185188
}
186189
},
187190
"annotations": {

Diff for: specs-go/config.go

+8
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,8 @@ type Linux struct {
236236
Namespaces []LinuxNamespace `json:"namespaces,omitempty"`
237237
// Devices are a list of device nodes that are created for the container
238238
Devices []LinuxDevice `json:"devices,omitempty"`
239+
// NetDevices are key-value pairs, keyed by network device name on the host, moved to the container's network namespace.
240+
NetDevices map[string]LinuxNetDevice `json:"netDevices,omitempty"`
239241
// Seccomp specifies the seccomp security settings for the container.
240242
Seccomp *LinuxSeccomp `json:"seccomp,omitempty"`
241243
// RootfsPropagation is the rootfs mount propagation mode for the container.
@@ -491,6 +493,12 @@ type LinuxDevice struct {
491493
GID *uint32 `json:"gid,omitempty"`
492494
}
493495

496+
// LinuxNetDevice represents a single network device to be added to the container's network namespace
497+
type LinuxNetDevice struct {
498+
// Name of the device in the container namespace
499+
Name string `json:"name,omitempty"`
500+
}
501+
494502
// LinuxDeviceCgroup represents a device rule for the devices specified to
495503
// the device controller
496504
type LinuxDeviceCgroup struct {

Diff for: specs-go/features/features.go

+8
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ type Linux struct {
4848
Selinux *Selinux `json:"selinux,omitempty"`
4949
IntelRdt *IntelRdt `json:"intelRdt,omitempty"`
5050
MountExtensions *MountExtensions `json:"mountExtensions,omitempty"`
51+
NetDevices *NetDevices `json:"netDevices,omitempty"`
5152
}
5253

5354
// Cgroup represents the "cgroup" field.
@@ -143,3 +144,10 @@ type IDMap struct {
143144
// Nil value means "unknown", not "false".
144145
Enabled *bool `json:"enabled,omitempty"`
145146
}
147+
148+
// NetDevices represents the "netDevices" field.
149+
type NetDevices struct {
150+
// Enabled is true if network devices support is compiled in.
151+
// Nil value means "unknown", not "false".
152+
Enabled *bool `json:"enabled,omitempty"`
153+
}

0 commit comments

Comments
 (0)