README.md: New combined Transmit & Receive design #9

lukego · 2018-05-28T14:37:27Z

Here is an idea for a new combined Transmit & Receive interface. Based on feedback & discussions on the issues.

Single unified buffers for transmit and receive.
No register reads in the main transmit/receive loop (to avoid the CPU blocking on PCIe reads.)
Ethernet FCS is transparent on both transmit and receive (bad FCS frames are dropped at ingress.)

What do we reckon?

emmericp · 2018-05-28T21:25:33Z

Is having no alignment at all a good idea?

We've found that aligning packets by cache line helps a lot on modern CPUs (only 3% space overhead in a real-world trace).

However, our benchmark is completely synthetic (except for packet sizes) and it's just a simple pcap filter matching on various header fields.

https://pam2018.inet.berlin/wp-content/uploads/2018/03/pam18poster-paper8.pdf
https://github.com/emmericp/align-microbenchmark

lukego · 2018-05-29T06:35:35Z

Great link @emmericp! I love the work that you guys are doing :).

This actually touches on a fundamental goal of the EasyNIC approach that I have not properly articulated yet. I want to focus on optimizing for the general case and to resist optimizing for a collection of special cases.

I want to build network equipment with robust performance that is not too sensitive to configurations and workloads. I'm willing to sacrifice peak performance on a few specific workloads to achieve this goal.

Here is the kind of hazard that I see:

Application X is shown to benefit from having packets aligned on a 64-byte boundary. The effect is due to locality i.e. the application tends to be interested in the first 64 bytes of the payload in the configuration that was benchmarked and so this alignment tends to reduce the number of cache loads per packet.
This presents an opportunity to the hardware makers. Vendor A releases a NIC supporting cache-line-aligned receive buffers. They sell like hot cakes: "27% faster packet analysis!"
The other vendors all follow suit to satisfy user demand.

Then a few years later the workload changes. Now your packets are all VXLAN encapsulated and the window of payload you care about has shifted to include the next cache line. 64 bytes is not the optimal alignment anymore. Oh no! What do we do?

Vendor A supports arbitrary alignment so that the application can deliberately misalign the data to hit the new best case (as in 100G Packet Rates: Per-CPU vs Per-Port snabbco/snabb#1013 (comment)).
Vendor B supports VXLAN decapsulation in hardware. Now the NIC can strip off the encapsulation header so that the 64-byte buffer alignment is still optimal.
Vendor C supports an eBPF virtual machine in the NIC. Now the application can customize the NIC to specially align and/or decapsulation the packets before they hit host memory.

Phew! So we managed to defend the advertised performance of our application, but in the process we had to make all of the NICs more complicated, and we also had to make the application and drivers more complicated to support the diverse approaches taken on the NICs.

Then a few years later the workload changes again. Now we are deploying on a mobile network and we need GTP-U encap. Oh no! This market is not important enough for the hardware vendors to support in silicon. Now we need to rearchitect our application to break away from that precious assumption about how our buffers are aligned... or tell our users to buy extra hardware before enabling GTP-U in their config file.

This is the point where we might wish that we could turn back time and optimize for the most general case from the beginning. Then we would not have needed all these complicated hardware and driver and software features. So the next step would be to design a simplified hardware interface that eschews special cases, and that is what leads us to EasyNIC :).

I like the way Juho Snellman framed this in his lessons learned in production talk:

Every time we depend on a hardware feature we end up regretting it. They can never be used to save on development effort, because next month there will be new requirements that the hardware feature isn't flexible enough to handle. You always need to implement a pure software fallback that's fast enough to handle production loads. And if you've already got a good enough software implementation, why go through the bother of doing a parallel hardware implementation? The only thing that'll happen is that you'll get inconsistent performance between use cases that get handled in hardware vs. use cases that get handled in software.

lukego · 2018-05-29T06:39:33Z

One more example from the real world is OpenStack Networking in the Atlanta/Paris era. The kernel people are all showing slides with 10Gbps throughput and 3% CPU usage. The users are all showing slides with 1.3Gbps throughput and 100% CPU usage. What's the difference?

The kernel people have all optimized for the case where everything is offloaded onto the NIC but the users have all enabled a feature that the current generation of NICs doesn't support (VXLAN.) So the users are all hating life, and waiting for the next generation of NICs to save them. Meanwhile the kernel hackers don't even realize this is a problem and a couple of years later when they appreciate what's happening they say "oh, dude, we could just do the LRO/TSO with a software fallback and then we'd get great performance with any card. Sorry I was too busy messing with offloads to understand the problem you were having."

corsix · 2018-05-29T20:37:25Z

My natural inclination is that some alignment would be a good thing. One-byte alignment (i.e. no alignment) makes the specification very simple, but potentially at the cost of making both software (i.e. drivers) and hardware (i.e. the NIC) more complex. Some natural considerations for alignment are:

2-byte alignment. This ensures that the length field is always in two consecutive bytes (whereas with no alignment, the length field might occasionally straddle the first and last bytes of the ring, which makes it more complex for software to read, and for hardware to write).
4-byte alignment. At the PCI-E layer, all transactions have base addresses and lengths which are multiples of 4 (with byte-enable masks used to give the impression of smaller reads and writes). Not having at least 4-byte alignment means you're wasting a little bit of PCI-E bandwidth, and forces the PCI-E logic in the hardware to be more complex than it otherwise needs to be.
64-byte alignment. This gives nice properties around whole cache-lines being transferred at a time. I don't have any concrete evidence to point at, but it certainly feels nice to allow the CPU to write entire cache-lines at a time (that being the point of write-combining buffers), and nice to allow the hardware to write entire cache-lines at a time (which can then, at least in theory, be delivered directly to the CPU's caches, without the CPU having to fetch the whiskers of the cache-line from DRAM).

emmericp · 2018-05-29T23:02:25Z

Yeah, as I've said the benchmark is completely synthetic and very specific to that use case. I'm also not saying that cache line alignment is necesarily a good idea -- i was just surprised how big the effect was and how small the space overhead on real traffic was.
(My initial guess and implementation was 8+2 alignment to make IP addresses 4-byte aligned)

Also, I strongly believe in benchmarking stuff before optimizing stuff.
Both "no alignment at all" and "align by cache line" strike me as a premature optimization.

lukego · 2018-05-30T09:59:32Z

Thanks @corsix and @emmericp for the detailed feedback.

I pushed commit 9b50d34 to articulate that this transmit/receive interface is essentially a high-speed serial port. The NIC acts like a modem to translate a continuous stream of ones and zeros between host memory and the network. The framing information is plain in-band data and has no special treatment from a DMA perspective e.g. the device could always do 4KB PCIe transactions and only needs to make sure that the packet cursor never points to a partially-written packet.

I reckon that the lack of alignment options makes sense in the specific context of this "high speed serial port" design. However these points you guys raise may be reasons to doubt that this model is the right one.

I am keen to support multiple CPUs to cooperate on processing traffic without relying on the NIC to do sharding. On the serial port design this will require some thought about alignment and the MESIF state machine. For example on transmit this design might make it overly complicated to prevent two cores from writing to the same cache line and needing to synchronously ping-pong that between their L1 caches.

So this branch's interface is really the beginning and not the end..... and if we want to do e.g. low-latency I reckon that will need to be a separate interface too.

lukego · 2018-05-30T14:06:54Z

Linking #11 about likely feature creep for efficiently supporting multiple CPUs.

lukego · 2018-05-30T14:16:44Z

I'll merge this one now and we can use new PRs to discuss adding alignment rules. I have some ideas for accommodating this but one step at a time.

README.md: New combined Transmit & Receive design

4f15f4b

README.md: Articulate the streaming nature of this transmit/receive

9b50d34

lukego merged commit 03a0832 into master May 30, 2018

lukego mentioned this pull request May 30, 2018

Benchmark setup for validating design #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md: New combined Transmit & Receive design #9

README.md: New combined Transmit & Receive design #9

lukego commented May 28, 2018

emmericp commented May 28, 2018

lukego commented May 29, 2018

lukego commented May 29, 2018 •

edited

Loading

corsix commented May 29, 2018

emmericp commented May 29, 2018

lukego commented May 30, 2018

lukego commented May 30, 2018

lukego commented May 30, 2018

README.md: New combined Transmit & Receive design #9

README.md: New combined Transmit & Receive design #9

Conversation

lukego commented May 28, 2018

emmericp commented May 28, 2018

lukego commented May 29, 2018

lukego commented May 29, 2018 • edited Loading

corsix commented May 29, 2018

emmericp commented May 29, 2018

lukego commented May 30, 2018

lukego commented May 30, 2018

lukego commented May 30, 2018

lukego commented May 29, 2018 •

edited

Loading