vz: reconnect socket_vmnet on bridge failure by loncharles · Pull Request #5056 · lima-vm/lima

loncharles · 2026-05-27T18:45:44Z

Summary

Add reconnect logic to VZ network forwarding for socket_vmnet connection failures
On error, the socket_vmnet connection is re-established with exponential backoff (100ms–30s) while keeping the VZ DGRAM socketpair alive
Replaces io.Copy with copyPackets() that distinguishes read vs write errors for fatal/recoverable classification

Motivation

When the socket_vmnet daemon restarts or the UNIX socket connection breaks, forwardPackets() exits permanently and VM networking is dead until a full VM restart. The VZ-side socketpair (bound to VZFileHandleNetworkDeviceAttachment at VM creation) cannot be replaced, but the socket_vmnet connection is stateless and can be re-dialed.

Fixes #3020

Performance

Tested on macOS 26 (Apple M1, VZ backend, 1GbE NIC). 3-run averages for each variant:

	Baseline (2.1.1)	Patched (32K buf)	Patched (64K buf)
Throughput (VM → LAN host)	944 Mbits/sec	944 Mbits/sec	944 Mbits/sec
Throughput (VM → macOS host, loopback)	2.16 Gbits/sec	2.18 Gbits/sec	2.15 Gbits/sec
limactl CPU during iperf3 (VM → LAN)	~40%	~41%	~42%
limactl CPU during iperf3 (loopback)	~145%	~145%	~140%
Ping latency (100 pings)	0.67ms	0.86ms	0.98ms

No measurable throughput or CPU impact. Buffer size (32K matching io.Copy vs 64K) makes no difference on DGRAM sockets where each read returns one packet.

Reconnect behavior

Tested by killing socket_vmnet (SIGKILL) and immediately restarting:

Lima detects connection loss within one packet relay cycle
Exponential backoff: 100ms → 200ms → 400ms → ... → 30s max
socket_vmnet handles stale socket files on restart (no manual cleanup needed)
With immediate daemon restart: zero packet loss, reconnect in <100ms
With delayed restart: packets buffer in socketpair kernel buffer (4MB), full recovery when daemon returns
Docker containers and k3s agent survive the outage

Goroutine teardown tradeoff

Closing the socket_vmnet connection unblocks the goroutine doing socket_vmnet I/O immediately. The goroutine blocked on vzConn.Read() waits until the guest sends its next packet (ARP, NDP, DHCP — typically within seconds). SetReadDeadline was considered but rejected: arming/disarming deadlines on the DGRAM socket across reconnect cycles adds complexity and risks false-positive timeouts during normal forwarding.

Test plan

Unit tests pass (TestDialQemu, TestForwardPacketsReconnect)
TestForwardPacketsReconnect: kills fake server, verifies reconnect to new server, packets resume
No regression in normal operation (ping, iperf3, docker, k3s)
Throughput benchmarked before/after (iperf3, 3 runs averaged, 3 variants)
Reconnect tested: SIGKILL socket_vmnet + immediate restart (2 runs, 0% packet loss)
Reconnect tested: SIGKILL socket_vmnet + delayed restart (backoff to 30s, full recovery)
Docker containers survive outage
k3s agent survives outage
Soak testing on production host

Add reconnect logic to VZ network forwarding so that when the socket_vmnet connection breaks, Lima re-establishes it without requiring a VM restart. The VZ-side DGRAM socketpair is kept alive across reconnects; only the socket_vmnet stream connection is replaced. On error, forwardUntilError() closes the socket_vmnet connection to force both copy goroutines to exit, then returns whether the failure was on the socket_vmnet side (recoverable) or VZ side (fatal). A custom copyPackets() replaces io.Copy to distinguish read errors from write errors, which io.Copy does not expose. The outer loop re-dials with exponential backoff (100ms initial, 2x, 30s max). Goroutine teardown: closing the socket_vmnet connection unblocks one goroutine immediately; the other waits on vzConn.Read() until the guest sends its next packet. SetReadDeadline was considered but rejected to avoid false-positive timeouts during normal forwarding. In testing on macOS 26 (M1 Mac Mini, VZ, 1GbE NIC), the patch adds no measurable overhead to the forwarding path: Throughput (VM → LAN, 3 runs avg): 944 Mbits/sec (before and after) limactl CPU during iperf3: ~40% (before and after) Ping latency (100 pings): ~0.67ms (before and after) Reconnect tested by killing and immediately restarting socket_vmnet: zero packet loss with immediate restart, full recovery with delayed restart after exponential backoff reaches the daemon. Fixes lima-vm#3020 Signed-off-by: Lon C. Lundgren <lon@ocelot.net>

loncharles · 2026-05-27T18:47:20Z

@nirs would you mind reviewing this? It addresses the reconnect approach you described in #3020.

nirs

Nice!

nirs · 2026-05-27T18:58:13Z


+	const (
+		initialBackoff = 100 * time.Millisecond
+		maxBackoff     = 30 * time.Second


2 seconds should be good enough maximum. We do a very cheap reconnect attempt and checking every 2 seconds is negligible.

Done — maxBackoff is now 2s.

nirs · 2026-05-27T18:59:48Z

+			var err error
+			qemuConn, err = dialQemuConn(ctx, unixSock)
+			if err != nil {
+				logrus.WithError(err).Warn("Failed to reconnect to VMNET")


This is not really a warning but expected condition when we reconnect. A debug message is enough. We already logged a warning when the connection was broken.

Done — downgraded to Debug.

nirs · 2026-05-27T19:00:48Z

+	const (
+		initialBackoff = 100 * time.Millisecond
+		maxBackoff     = 30 * time.Second
+		backoffFactor  = 2.0


We can use integer.

Done — removed the float, using integer multiplication with an if-cap.

nirs · 2026-05-27T19:01:32Z

+			qemuConn, err = dialQemuConn(ctx, unixSock)
+			if err != nil {
+				logrus.WithError(err).Warn("Failed to reconnect to VMNET")
+				backoff = time.Duration(math.Min(float64(backoff)*backoffFactor, float64(maxBackoff)))


Why do we need floating point to compute the time?

Done — backoff *= 2 with if backoff > maxBackoff, removed math import.

nirs · 2026-05-27T19:05:40Z

+			case <-ctx.Done():
+				return
+			default:
+			}


I think this has the same effect:

if ctx.Err() != nil { return }

Done — using ctx.Err() != nil.

nirs · 2026-05-27T19:07:09Z

+			default:
+			}
+
+			logrus.Warnf("VMNET connection lost, reconnecting in %v", backoff)


Why we log here and not after getting an error from forwardUntilError()?

Done — moved to right after forwardUntilError returns, logged once per outage.

nirs · 2026-05-27T19:12:23Z

+// The copy loop is the same read-buf-write-buf loop io.Copy uses (neither
+// qemuPacketConn nor packetConn implements ReaderFrom/WriterTo).
+func copyPackets(dst io.Writer, src io.Reader) (readErr bool, err error) {
+	buf := make([]byte, 32*1024)


We write single packet (1514 bytes) to/from the datagram socket, so we don't need 32k buffer. We also write single size prefixed packets to socket_vmnet socket, so maximum size is 1518 bytes.

Done — added a maxPacketSize = 1518 constant and sized the buffer to it. Also added a bounds check in qemuPacketConn.Read to return an error instead of panicking on an oversized length prefix.

Using exact buffer size is little bit fragile is we did not calculate the size correctly, and we don't have the info in this layer. socket_vmnet get the maximum package size from vment but but does not expose this value. I think standard buffer size like 2k or 4k bytes will be better.

nirs · 2026-05-27T19:21:07Z

+		}
+		if er != nil {
+			return true, er
+		}


It will be more clear to handle errors first for read.

But It is not clear to me why we cannot use io.Copy, and why we need to handle read and write errors differently. The same copy loop is used in both directions, so read error may error reading from the datagram socket or error reading from the stream socket, and the same for the other direction.

I think we have 2 possible errors when socket_vment socket is closed:

read from unix stream socket returns 0 bytes or err != nil

write to unix stream socket return connection reset/connection refused

So error handling should not be in the copy loop but inside the unix connection.

We can implement the reconnect inside the unix connection: recoverable read or write errors will go into reconnect loop, completing the read/write then the connection is established. The copy loop from unix to datagram can remain the same.

Addressed the first seven comments and pushed a revision.

On moving reconnect into qemuPacketConn — the question is fair: why differentiate read and write errors at all? If the connection handles its own reconnect transparently, io.Copy works as-is and copyPackets goes away.

The issue is that reconnect requires coordinating both goroutines. When the stream socket breaks, the Read goroutine and Write goroutine both hit errors at the same time. With a connection-internal approach, both call into the reconnect path concurrently — you need a mutex or generation gate so only one dials while the other waits. But even with that, both goroutines need to restart their forwarding loops after the swap (the Read goroutine may have partial framing state from qemuPacketConn.Read's two-phase header+payload read). So the internal approach still needs to: signal both goroutines, wait for both to drain, swap the connection, restart both. That's the outer loop reimplemented inside the struct with more machinery, not less.

The current design keeps that lifecycle explicit: both goroutines exit on error, the single-threaded outer loop reconnects, fresh goroutines start with a clean connection. The read/write error distinction lets us avoid restarting when the failure is fatal (vzConn side).

We can reconsider the connection-internal route if you think the tradeoff is worth the added complexity.

Good point about synchronizing the copy goroutines, but we still need to tell the difference between recoverable errors and fatal errors, so the difference cannot be based on the direction. I think the unix connection should detect and return new recoverable error and the copy loop can try to reconnect only if we get such errors.

Regarding synchronization, depending on getting traffic from guest is not great. It will be better to trigger the reconnect immediately on the first error without having to wait.

I think we will detect the error in 2 cases:

read error from unix socket - should fail immediately when the socket is closed. At this point.

write error to unix socket - should fail on the next write after the socket is closed. But when the socket is closed we may be blocked reading from the datagram socket.

If we handle connections inside the unix connection, Write() can block on a condition variable when we are in in a connection retry. Once connection is finished, the Write() can complete. Same for Read() - it can block on the same condition variable.

So we can have a third goroutine doing the reconnect and signaling the other goroutines when the connection is established.

Makes sense, I like this approach better than the initial suggestion. The condvar + third goroutine gives clean coordination without the racing issues. A few things I want to confirm before implementing:

1/ Fatal vs recoverable errors: With io.Copy, we lose visibility into which side produced the error. I'm thinking qemuPacketConn wraps its errors in a recoverable sentinel type, and anything else is treated as fatal (no reconnect). That way a vzConn failure doesn't trigger a spurious reconnect attempt.

2/ Reconnect goroutine lifecycle: Is this a long-lived goroutine that blocks waiting for a "reconnect needed" signal, or spawned on demand when the first error comes in? Either way it needs to respect context cancellation to unblock the condvar waiters on shutdown.

3/ Write atomicity: qemuPacketConn.Write does header + payload as two syscalls on the stream. The lock needs to be held across both so the connection isn't swapped between them. That means a reconnect blocks until any in-flight Write completes. Should be fine in practice (writes are fast, connection is local) but wanted to flag it.

4/ Per-packet locking: Read and Write will both need to acquire a lock on every call. This probably won't affect throughput at 1GbE but we can perf test on a box with a 5GbE NIC to measure the actual impact.

Will submit another update if we're aligned.

Makes sense, I like this approach better than the initial suggestion. The condvar + third goroutine gives clean coordination without the racing issues. A few things I want to confirm before implementing:

1/ Fatal vs recoverable errors: With io.Copy, we lose visibility into which side produced the error. I'm thinking qemuPacketConn wraps its errors in a recoverable sentinel type, and anything else is treated as fatal (no reconnect). That way a vzConn failure doesn't trigger a spurious reconnect attempt.

If the reconnect mechanism is inside the unix socket, or even better - in the underlying connection, the error remain internal to the reconnecting connection so io.Copy is not in the loop.

We can use this pipeline:

[ datagram conn ] <-> 2x io.Copy <-> [qemu con] <-> [reconnecting con] <-> unix con

We create the qemuPacketConn with a real Con - we can create it with a reconnectingCon - current code will not be affected. But since we depend on reading the size from the socket and only then reading the payload, I think we will have to keep the reconnect mahanism inside the packetCon, so we can drop a package if we get a size but the closed right after we got the size.

2/ Reconnect goroutine lifecycle: Is this a long-lived goroutine that blocks waiting for a "reconnect needed" signal, or spawned on demand when the first error comes in? Either way it needs to respect context cancellation to unblock the condvar waiters on shutdown.

Both options should work, long living grouting will probably be simpler. Loop waiting for reconnect request, reconnecting and signaling waiters.

3/ Write atomicity: qemuPacketConn.Write does header + payload as two syscalls on the stream. The lock needs to be held across both so the connection isn't swapped between them. That means a reconnect blocks until any in-flight Write completes. Should be fine in practice (writes are fast, connection is local) but wanted to flag it.

I don't think you need to hold a lock during the read/write, only to check if you can read or write. If the socket is closed the read/write will fail. Also you cannot hold the lock during read - read is blocking and can block for long time.

4/ Per-packet locking: Read and Write will both need to acquire a lock on every call. This probably won't affect throughput at 1GbE but we can perf test on a box with a 5GbE NIC to measure the actual impact.

socket_vment is pretty slow, maximum throughput is 3-4 Gbits/s - I don't think a lock will make a difference, but it will be easy to test with iperf3.

@nirs, revision is in if you want to take a look.

- Reduce maxBackoff from 30s to 2s (cheap reconnect attempt) - Replace float backoffFactor with integer multiplication - Use ctx.Err() instead of select/default for context check - Log connection loss once after forwardUntilError, not per retry - Downgrade retry failure log from Warn to Debug - Size relay buffer to maxPacketSize (1518) instead of 32KB - Add bounds check in qemuPacketConn.Read for oversized frames - Trim doc comments to match project style Signed-off-by: Lon C. Lundgren <lon@ocelot.net>

Restructure the VZ network reconnect so that qemuPacketConn handles reconnection internally. Read and Write snapshot the current connection via getConn(), perform I/O on the local copy, and on error signal a dedicated reconnect goroutine that re-dials socket_vmnet with exponential backoff. Callers (forwardPackets, io.Copy) never see recoverable errors. This replaces the previous forwardUntilError/copyPackets approach where the outer loop distinguished read vs write errors to classify fatal vs recoverable failures. The new design uses a generation counter and sync.Cond for coordination, with no per-packet locking on the hot path (getConn acquires the mutex only to read two fields). readPacket and writePacket are extracted as standalone functions, used by both production code and the test echo server. Also fixes resource leak in DialQemu error paths (conn not closed if createSockPair fails). Tested on macOS 26 (M1 Mac Mini, VZ, 1GbE NIC). 3-run averages, 18 CPU samples per metric: Throughput VM -> LAN: 940 Mbits/sec (baseline 944) Throughput VM -> host: 2.17 Gbits/sec (baseline 2.16) limactl CPU (LAN): 41.4% (baseline ~40%) limactl CPU (loopback): 139.9% (baseline ~145%) Ping latency: ~1.1ms (matches previous binary under same conditions) Reconnect tested: SIGKILL socket_vmnet, retries with backoff, full recovery on daemon restart, no VM restart needed. Signed-off-by: Lon C. Lundgren <lon@ocelot.net>

loncharles · 2026-05-28T04:45:30Z

Pushed a revision that moves the reconnect logic into qemuPacketConn per your suggestion. The outer loop and copyPackets are gone. forwardPackets is now just two io.Copy goroutines, and reconnection is transparent to callers.

Design:

qemuPacketConn.Read/Write snapshot the current connection via getConn() (mutex acquire to read two fields), do I/O on the local copy, and on error call waitReconnect() which signals a dedicated reconnect goroutine and blocks until a new connection is ready.
Generation counter prevents double-reconnect. Buffered(1) channel with non-blocking send deduplicates concurrent signals from Read and Write goroutines.
A broadcast-on-context-cancel goroutine ensures condvar waiters always wake on shutdown.
readPacket/writePacket extracted as standalone functions so the test echo server can use them without constructing a full qemuPacketConn.

Tested on macOS 26 (M1 Mac Mini, VZ, 1GbE NIC). 3-run averages, 18 CPU samples per metric:

	Baseline (2.1.1)	Refactored
Throughput VM -> LAN	944 Mbits/sec	940 Mbits/sec
Throughput VM -> host (loopback)	2.16 Gbits/sec	2.17 Gbits/sec
limactl CPU (LAN)	~40%	41.4% (40.8-42.3)
limactl CPU (loopback)	~145%	139.9% (136.2-142.2)

Reconnect tested by killing socket_vmnet (SIGKILL). Backoff retries at 2s cap, full recovery on daemon restart, no VM restart needed.

Signed-off-by: Lon C. Lundgren <lon@ocelot.net>

nirs reviewed May 27, 2026

View reviewed changes

loncharles added 2 commits May 27, 2026 12:52

vz: fix errcheck lint for io.Copy in forwardPackets

5f2fa6a

Signed-off-by: Lon C. Lundgren <lon@ocelot.net>

Conversation

loncharles commented May 27, 2026

Summary

Motivation

Performance

Reconnect behavior

Goroutine teardown tradeoff

Test plan

Uh oh!

loncharles commented May 27, 2026

Uh oh!

nirs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loncharles commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants