vz: reconnect socket_vmnet on bridge failure#5056
Conversation
Add reconnect logic to VZ network forwarding so that when the socket_vmnet connection breaks, Lima re-establishes it without requiring a VM restart. The VZ-side DGRAM socketpair is kept alive across reconnects; only the socket_vmnet stream connection is replaced. On error, forwardUntilError() closes the socket_vmnet connection to force both copy goroutines to exit, then returns whether the failure was on the socket_vmnet side (recoverable) or VZ side (fatal). A custom copyPackets() replaces io.Copy to distinguish read errors from write errors, which io.Copy does not expose. The outer loop re-dials with exponential backoff (100ms initial, 2x, 30s max). Goroutine teardown: closing the socket_vmnet connection unblocks one goroutine immediately; the other waits on vzConn.Read() until the guest sends its next packet. SetReadDeadline was considered but rejected to avoid false-positive timeouts during normal forwarding. In testing on macOS 26 (M1 Mac Mini, VZ, 1GbE NIC), the patch adds no measurable overhead to the forwarding path: Throughput (VM → LAN, 3 runs avg): 944 Mbits/sec (before and after) limactl CPU during iperf3: ~40% (before and after) Ping latency (100 pings): ~0.67ms (before and after) Reconnect tested by killing and immediately restarting socket_vmnet: zero packet loss with immediate restart, full recovery with delayed restart after exponential backoff reaches the daemon. Fixes lima-vm#3020 Signed-off-by: Lon C. Lundgren <lon@ocelot.net>
|
|
||
| const ( | ||
| initialBackoff = 100 * time.Millisecond | ||
| maxBackoff = 30 * time.Second |
There was a problem hiding this comment.
2 seconds should be good enough maximum. We do a very cheap reconnect attempt and checking every 2 seconds is negligible.
There was a problem hiding this comment.
Done — maxBackoff is now 2s.
| var err error | ||
| qemuConn, err = dialQemuConn(ctx, unixSock) | ||
| if err != nil { | ||
| logrus.WithError(err).Warn("Failed to reconnect to VMNET") |
There was a problem hiding this comment.
This is not really a warning but expected condition when we reconnect. A debug message is enough. We already logged a warning when the connection was broken.
| const ( | ||
| initialBackoff = 100 * time.Millisecond | ||
| maxBackoff = 30 * time.Second | ||
| backoffFactor = 2.0 |
There was a problem hiding this comment.
Done — removed the float, using integer multiplication with an if-cap.
| qemuConn, err = dialQemuConn(ctx, unixSock) | ||
| if err != nil { | ||
| logrus.WithError(err).Warn("Failed to reconnect to VMNET") | ||
| backoff = time.Duration(math.Min(float64(backoff)*backoffFactor, float64(maxBackoff))) |
There was a problem hiding this comment.
Why do we need floating point to compute the time?
There was a problem hiding this comment.
Done — backoff *= 2 with if backoff > maxBackoff, removed math import.
| case <-ctx.Done(): | ||
| return | ||
| default: | ||
| } |
There was a problem hiding this comment.
I think this has the same effect:
if ctx.Err() != nil {
return
}There was a problem hiding this comment.
Done — using ctx.Err() != nil.
| default: | ||
| } | ||
|
|
||
| logrus.Warnf("VMNET connection lost, reconnecting in %v", backoff) |
There was a problem hiding this comment.
Why we log here and not after getting an error from forwardUntilError()?
There was a problem hiding this comment.
Done — moved to right after forwardUntilError returns, logged once per outage.
| // The copy loop is the same read-buf-write-buf loop io.Copy uses (neither | ||
| // qemuPacketConn nor packetConn implements ReaderFrom/WriterTo). | ||
| func copyPackets(dst io.Writer, src io.Reader) (readErr bool, err error) { | ||
| buf := make([]byte, 32*1024) |
There was a problem hiding this comment.
We write single packet (1514 bytes) to/from the datagram socket, so we don't need 32k buffer. We also write single size prefixed packets to socket_vmnet socket, so maximum size is 1518 bytes.
There was a problem hiding this comment.
Done — added a maxPacketSize = 1518 constant and sized the buffer to it. Also added a bounds check in qemuPacketConn.Read to return an error instead of panicking on an oversized length prefix.
There was a problem hiding this comment.
Using exact buffer size is little bit fragile is we did not calculate the size correctly, and we don't have the info in this layer. socket_vmnet get the maximum package size from vment but but does not expose this value. I think standard buffer size like 2k or 4k bytes will be better.
| } | ||
| if er != nil { | ||
| return true, er | ||
| } |
There was a problem hiding this comment.
It will be more clear to handle errors first for read.
But It is not clear to me why we cannot use io.Copy, and why we need to handle read and write errors differently. The same copy loop is used in both directions, so read error may error reading from the datagram socket or error reading from the stream socket, and the same for the other direction.
I think we have 2 possible errors when socket_vment socket is closed:
- read from unix stream socket returns 0 bytes or err != nil
- write to unix stream socket return connection reset/connection refused
So error handling should not be in the copy loop but inside the unix connection.
We can implement the reconnect inside the unix connection: recoverable read or write errors will go into reconnect loop, completing the read/write then the connection is established. The copy loop from unix to datagram can remain the same.
There was a problem hiding this comment.
Addressed the first seven comments and pushed a revision.
On moving reconnect into qemuPacketConn — the question is fair: why differentiate read and write errors at all? If the connection handles its own reconnect transparently, io.Copy works as-is and copyPackets goes away.
The issue is that reconnect requires coordinating both goroutines. When the stream socket breaks, the Read goroutine and Write goroutine both hit errors at the same time. With a connection-internal approach, both call into the reconnect path concurrently — you need a mutex or generation gate so only one dials while the other waits. But even with that, both goroutines need to restart their forwarding loops after the swap (the Read goroutine may have partial framing state from qemuPacketConn.Read's two-phase header+payload read). So the internal approach still needs to: signal both goroutines, wait for both to drain, swap the connection, restart both. That's the outer loop reimplemented inside the struct with more machinery, not less.
The current design keeps that lifecycle explicit: both goroutines exit on error, the single-threaded outer loop reconnects, fresh goroutines start with a clean connection. The read/write error distinction lets us avoid restarting when the failure is fatal (vzConn side).
We can reconsider the connection-internal route if you think the tradeoff is worth the added complexity.
There was a problem hiding this comment.
Good point about synchronizing the copy goroutines, but we still need to tell the difference between recoverable errors and fatal errors, so the difference cannot be based on the direction. I think the unix connection should detect and return new recoverable error and the copy loop can try to reconnect only if we get such errors.
Regarding synchronization, depending on getting traffic from guest is not great. It will be better to trigger the reconnect immediately on the first error without having to wait.
I think we will detect the error in 2 cases:
- read error from unix socket - should fail immediately when the socket is closed. At this point.
- write error to unix socket - should fail on the next write after the socket is closed. But when the socket is closed we may be blocked reading from the datagram socket.
If we handle connections inside the unix connection, Write() can block on a condition variable when we are in in a connection retry. Once connection is finished, the Write() can complete. Same for Read() - it can block on the same condition variable.
So we can have a third goroutine doing the reconnect and signaling the other goroutines when the connection is established.
There was a problem hiding this comment.
Makes sense, I like this approach better than the initial suggestion. The condvar + third goroutine gives clean coordination without the racing issues. A few things I want to confirm before implementing:
1/ Fatal vs recoverable errors: With io.Copy, we lose visibility into which side produced the error. I'm thinking qemuPacketConn wraps its errors in a recoverable sentinel type, and anything else is treated as fatal (no reconnect). That way a vzConn failure doesn't trigger a spurious reconnect attempt.
2/ Reconnect goroutine lifecycle: Is this a long-lived goroutine that blocks waiting for a "reconnect needed" signal, or spawned on demand when the first error comes in? Either way it needs to respect context cancellation to unblock the condvar waiters on shutdown.
3/ Write atomicity: qemuPacketConn.Write does header + payload as two syscalls on the stream. The lock needs to be held across both so the connection isn't swapped between them. That means a reconnect blocks until any in-flight Write completes. Should be fine in practice (writes are fast, connection is local) but wanted to flag it.
4/ Per-packet locking: Read and Write will both need to acquire a lock on every call. This probably won't affect throughput at 1GbE but we can perf test on a box with a 5GbE NIC to measure the actual impact.
Will submit another update if we're aligned.
There was a problem hiding this comment.
Makes sense, I like this approach better than the initial suggestion. The condvar + third goroutine gives clean coordination without the racing issues. A few things I want to confirm before implementing:
1/ Fatal vs recoverable errors: With io.Copy, we lose visibility into which side produced the error. I'm thinking qemuPacketConn wraps its errors in a recoverable sentinel type, and anything else is treated as fatal (no reconnect). That way a vzConn failure doesn't trigger a spurious reconnect attempt.
If the reconnect mechanism is inside the unix socket, or even better - in the underlying connection, the error remain internal to the reconnecting connection so io.Copy is not in the loop.
We can use this pipeline:
[ datagram conn ] <-> 2x io.Copy <-> [qemu con] <-> [reconnecting con] <-> unix con
We create the qemuPacketConn with a real Con - we can create it with a reconnectingCon - current code will not be affected. But since we depend on reading the size from the socket and only then reading the payload, I think we will have to keep the reconnect mahanism inside the packetCon, so we can drop a package if we get a size but the closed right after we got the size.
2/ Reconnect goroutine lifecycle: Is this a long-lived goroutine that blocks waiting for a "reconnect needed" signal, or spawned on demand when the first error comes in? Either way it needs to respect context cancellation to unblock the condvar waiters on shutdown.
Both options should work, long living grouting will probably be simpler. Loop waiting for reconnect request, reconnecting and signaling waiters.
3/ Write atomicity: qemuPacketConn.Write does header + payload as two syscalls on the stream. The lock needs to be held across both so the connection isn't swapped between them. That means a reconnect blocks until any in-flight Write completes. Should be fine in practice (writes are fast, connection is local) but wanted to flag it.
I don't think you need to hold a lock during the read/write, only to check if you can read or write. If the socket is closed the read/write will fail. Also you cannot hold the lock during read - read is blocking and can block for long time.
4/ Per-packet locking: Read and Write will both need to acquire a lock on every call. This probably won't affect throughput at 1GbE but we can perf test on a box with a 5GbE NIC to measure the actual impact.
socket_vment is pretty slow, maximum throughput is 3-4 Gbits/s - I don't think a lock will make a difference, but it will be easy to test with iperf3.
- Reduce maxBackoff from 30s to 2s (cheap reconnect attempt) - Replace float backoffFactor with integer multiplication - Use ctx.Err() instead of select/default for context check - Log connection loss once after forwardUntilError, not per retry - Downgrade retry failure log from Warn to Debug - Size relay buffer to maxPacketSize (1518) instead of 32KB - Add bounds check in qemuPacketConn.Read for oversized frames - Trim doc comments to match project style Signed-off-by: Lon C. Lundgren <lon@ocelot.net>
Restructure the VZ network reconnect so that qemuPacketConn handles reconnection internally. Read and Write snapshot the current connection via getConn(), perform I/O on the local copy, and on error signal a dedicated reconnect goroutine that re-dials socket_vmnet with exponential backoff. Callers (forwardPackets, io.Copy) never see recoverable errors. This replaces the previous forwardUntilError/copyPackets approach where the outer loop distinguished read vs write errors to classify fatal vs recoverable failures. The new design uses a generation counter and sync.Cond for coordination, with no per-packet locking on the hot path (getConn acquires the mutex only to read two fields). readPacket and writePacket are extracted as standalone functions, used by both production code and the test echo server. Also fixes resource leak in DialQemu error paths (conn not closed if createSockPair fails). Tested on macOS 26 (M1 Mac Mini, VZ, 1GbE NIC). 3-run averages, 18 CPU samples per metric: Throughput VM -> LAN: 940 Mbits/sec (baseline 944) Throughput VM -> host: 2.17 Gbits/sec (baseline 2.16) limactl CPU (LAN): 41.4% (baseline ~40%) limactl CPU (loopback): 139.9% (baseline ~145%) Ping latency: ~1.1ms (matches previous binary under same conditions) Reconnect tested: SIGKILL socket_vmnet, retries with backoff, full recovery on daemon restart, no VM restart needed. Signed-off-by: Lon C. Lundgren <lon@ocelot.net>
|
Pushed a revision that moves the reconnect logic into qemuPacketConn per your suggestion. The outer loop and copyPackets are gone. forwardPackets is now just two io.Copy goroutines, and reconnection is transparent to callers. Design:
Tested on macOS 26 (M1 Mac Mini, VZ, 1GbE NIC). 3-run averages, 18 CPU samples per metric:
Reconnect tested by killing socket_vmnet (SIGKILL). Backoff retries at 2s cap, full recovery on daemon restart, no VM restart needed. |
Signed-off-by: Lon C. Lundgren <lon@ocelot.net>
Summary
io.CopywithcopyPackets()that distinguishes read vs write errors for fatal/recoverable classificationMotivation
When the socket_vmnet daemon restarts or the UNIX socket connection breaks,
forwardPackets()exits permanently and VM networking is dead until a full VM restart. The VZ-side socketpair (bound toVZFileHandleNetworkDeviceAttachmentat VM creation) cannot be replaced, but the socket_vmnet connection is stateless and can be re-dialed.Fixes #3020
Performance
Tested on macOS 26 (Apple M1, VZ backend, 1GbE NIC). 3-run averages for each variant:
No measurable throughput or CPU impact. Buffer size (32K matching io.Copy vs 64K) makes no difference on DGRAM sockets where each read returns one packet.
Reconnect behavior
Tested by killing socket_vmnet (SIGKILL) and immediately restarting:
Goroutine teardown tradeoff
Closing the socket_vmnet connection unblocks the goroutine doing socket_vmnet I/O immediately. The goroutine blocked on
vzConn.Read()waits until the guest sends its next packet (ARP, NDP, DHCP — typically within seconds).SetReadDeadlinewas considered but rejected: arming/disarming deadlines on the DGRAM socket across reconnect cycles adds complexity and risks false-positive timeouts during normal forwarding.Test plan
TestDialQemu,TestForwardPacketsReconnect)TestForwardPacketsReconnect: kills fake server, verifies reconnect to new server, packets resume