Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

README.md: New combined Transmit & Receive design #9

Merged
merged 2 commits into from
May 30, 2018
Merged

README.md: New combined Transmit & Receive design #9

merged 2 commits into from
May 30, 2018

Conversation

lukego
Copy link
Owner

@lukego lukego commented May 28, 2018

Here is an idea for a new combined Transmit & Receive interface. Based on feedback & discussions on the issues.

  • Single unified buffers for transmit and receive.
  • No register reads in the main transmit/receive loop (to avoid the CPU blocking on PCIe reads.)
  • Ethernet FCS is transparent on both transmit and receive (bad FCS frames are dropped at ingress.)

What do we reckon?

@emmericp
Copy link

Is having no alignment at all a good idea?

We've found that aligning packets by cache line helps a lot on modern CPUs (only 3% space overhead in a real-world trace).

However, our benchmark is completely synthetic (except for packet sizes) and it's just a simple pcap filter matching on various header fields.

https://pam2018.inet.berlin/wp-content/uploads/2018/03/pam18poster-paper8.pdf
https://github.com/emmericp/align-microbenchmark

@lukego
Copy link
Owner Author

lukego commented May 29, 2018

Great link @emmericp! I love the work that you guys are doing :).

This actually touches on a fundamental goal of the EasyNIC approach that I have not properly articulated yet. I want to focus on optimizing for the general case and to resist optimizing for a collection of special cases.

I want to build network equipment with robust performance that is not too sensitive to configurations and workloads. I'm willing to sacrifice peak performance on a few specific workloads to achieve this goal.

Here is the kind of hazard that I see:

  1. Application X is shown to benefit from having packets aligned on a 64-byte boundary. The effect is due to locality i.e. the application tends to be interested in the first 64 bytes of the payload in the configuration that was benchmarked and so this alignment tends to reduce the number of cache loads per packet.
  2. This presents an opportunity to the hardware makers. Vendor A releases a NIC supporting cache-line-aligned receive buffers. They sell like hot cakes: "27% faster packet analysis!"
  3. The other vendors all follow suit to satisfy user demand.

Then a few years later the workload changes. Now your packets are all VXLAN encapsulated and the window of payload you care about has shifted to include the next cache line. 64 bytes is not the optimal alignment anymore. Oh no! What do we do?

  1. Vendor A supports arbitrary alignment so that the application can deliberately misalign the data to hit the new best case (as in 100G Packet Rates: Per-CPU vs Per-Port snabbco/snabb#1013 (comment)).
  2. Vendor B supports VXLAN decapsulation in hardware. Now the NIC can strip off the encapsulation header so that the 64-byte buffer alignment is still optimal.
  3. Vendor C supports an eBPF virtual machine in the NIC. Now the application can customize the NIC to specially align and/or decapsulation the packets before they hit host memory.

Phew! So we managed to defend the advertised performance of our application, but in the process we had to make all of the NICs more complicated, and we also had to make the application and drivers more complicated to support the diverse approaches taken on the NICs.

Then a few years later the workload changes again. Now we are deploying on a mobile network and we need GTP-U encap. Oh no! This market is not important enough for the hardware vendors to support in silicon. Now we need to rearchitect our application to break away from that precious assumption about how our buffers are aligned... or tell our users to buy extra hardware before enabling GTP-U in their config file.

This is the point where we might wish that we could turn back time and optimize for the most general case from the beginning. Then we would not have needed all these complicated hardware and driver and software features. So the next step would be to design a simplified hardware interface that eschews special cases, and that is what leads us to EasyNIC :).

I like the way Juho Snellman framed this in his lessons learned in production talk:

Every time we depend on a hardware feature we end up regretting it. They can never be used to save on development effort, because next month there will be new requirements that the hardware feature isn't flexible enough to handle. You always need to implement a pure software fallback that's fast enough to handle production loads. And if you've already got a good enough software implementation, why go through the bother of doing a parallel hardware implementation? The only thing that'll happen is that you'll get inconsistent performance between use cases that get handled in hardware vs. use cases that get handled in software.

@lukego
Copy link
Owner Author

lukego commented May 29, 2018

One more example from the real world is OpenStack Networking in the Atlanta/Paris era. The kernel people are all showing slides with 10Gbps throughput and 3% CPU usage. The users are all showing slides with 1.3Gbps throughput and 100% CPU usage. What's the difference?

The kernel people have all optimized for the case where everything is offloaded onto the NIC but the users have all enabled a feature that the current generation of NICs doesn't support (VXLAN.) So the users are all hating life, and waiting for the next generation of NICs to save them. Meanwhile the kernel hackers don't even realize this is a problem and a couple of years later when they appreciate what's happening they say "oh, dude, we could just do the LRO/TSO with a software fallback and then we'd get great performance with any card. Sorry I was too busy messing with offloads to understand the problem you were having."

@corsix
Copy link

corsix commented May 29, 2018

My natural inclination is that some alignment would be a good thing. One-byte alignment (i.e. no alignment) makes the specification very simple, but potentially at the cost of making both software (i.e. drivers) and hardware (i.e. the NIC) more complex. Some natural considerations for alignment are:

  • 2-byte alignment. This ensures that the length field is always in two consecutive bytes (whereas with no alignment, the length field might occasionally straddle the first and last bytes of the ring, which makes it more complex for software to read, and for hardware to write).
  • 4-byte alignment. At the PCI-E layer, all transactions have base addresses and lengths which are multiples of 4 (with byte-enable masks used to give the impression of smaller reads and writes). Not having at least 4-byte alignment means you're wasting a little bit of PCI-E bandwidth, and forces the PCI-E logic in the hardware to be more complex than it otherwise needs to be.
  • 64-byte alignment. This gives nice properties around whole cache-lines being transferred at a time. I don't have any concrete evidence to point at, but it certainly feels nice to allow the CPU to write entire cache-lines at a time (that being the point of write-combining buffers), and nice to allow the hardware to write entire cache-lines at a time (which can then, at least in theory, be delivered directly to the CPU's caches, without the CPU having to fetch the whiskers of the cache-line from DRAM).

@emmericp
Copy link

Yeah, as I've said the benchmark is completely synthetic and very specific to that use case. I'm also not saying that cache line alignment is necesarily a good idea -- i was just surprised how big the effect was and how small the space overhead on real traffic was.
(My initial guess and implementation was 8+2 alignment to make IP addresses 4-byte aligned)

Also, I strongly believe in benchmarking stuff before optimizing stuff.
Both "no alignment at all" and "align by cache line" strike me as a premature optimization.

@lukego
Copy link
Owner Author

lukego commented May 30, 2018

Thanks @corsix and @emmericp for the detailed feedback.

I pushed commit 9b50d34 to articulate that this transmit/receive interface is essentially a high-speed serial port. The NIC acts like a modem to translate a continuous stream of ones and zeros between host memory and the network. The framing information is plain in-band data and has no special treatment from a DMA perspective e.g. the device could always do 4KB PCIe transactions and only needs to make sure that the packet cursor never points to a partially-written packet.

I reckon that the lack of alignment options makes sense in the specific context of this "high speed serial port" design. However these points you guys raise may be reasons to doubt that this model is the right one.

I am keen to support multiple CPUs to cooperate on processing traffic without relying on the NIC to do sharding. On the serial port design this will require some thought about alignment and the MESIF state machine. For example on transmit this design might make it overly complicated to prevent two cores from writing to the same cache line and needing to synchronously ping-pong that between their L1 caches.

So this branch's interface is really the beginning and not the end..... and if we want to do e.g. low-latency I reckon that will need to be a separate interface too.

@lukego
Copy link
Owner Author

lukego commented May 30, 2018

Linking #11 about likely feature creep for efficiently supporting multiple CPUs.

@lukego
Copy link
Owner Author

lukego commented May 30, 2018

I'll merge this one now and we can use new PRs to discuss adding alignment rules. I have some ideas for accommodating this but one step at a time.

@lukego lukego merged commit 03a0832 into master May 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants