100G Packet Rates: Per-CPU vs Per-Port #1013

lukego · 2016-09-07T10:49:28Z

I am pondering how to think about packet rates in the 100G era. How should we be designing and optimizing our software?

Consider these potential performance targets:

A: 1x100G @ 64 Mpps.
B: 1x100G @ 96 Mpps.
C: 2x100G @ 128 Mpps (max 64 Mpps per port).

I have a whole mix of questions about these in my mind:

Which target is most meaningful? For which applications?
How do you optimize for each target?
Which one is harder to achieve? Is the main challenge software optimization or hardware selection?

Raw brain dump...

Performance "A" seems suitable for many applications. This is providing 100G of bandwidth for average packet size of around 192 bytes or higher.
Performance "B" may be more suitable for some specific applications. Load generators (like packetblaster) may want to maximize the packet rate on a single port. Packet capture applications (like firehose) may need to never miss a packet.
Performance "C" seems suitable for applications that need to scale out. For example IPv4-IPv6 translation (lwAFTR) may want to maximize hardware density by handling as many 100G ports per server as possible.

So how would you optimize for each? In every case you would surely use multiple cores with RSS or equivalent traffic dispatching. Beyond that...

Performance "A" would let you choose your own trade-off between software optimization or throwing hardware at the problem. If you are ambitious you may pick a low-end CPU like a Xeon E3-1650v3 (6 cores @ 3.5 GHz) for 328 cycles per packet processing budget. If you prefer to throw hardware at the problem you may pick a high-end CPU like a Xeon E5-2699v4 (22 cores @ 2.2 GHz) for 756 cycles per packet. There are plenty of points in between, too.
Performance "B" is putting more strain on both the CPU and the NIC. You may need to use special driver routines to get the NIC performance you want & this may involve trading off CPU resources to reduce load on the NIC. For example, ConnectX-4 NICs may provide better packet rates with "inline descriptor" mode but this involves a complete packet copy in the driver routine i.e. spending CPU cycles and cache footprint to assist the DMA engine on the NIC. Similarly the NICs have hardware features for e.g. tuple-matching on packets but those surely have performance limits (e.g. rules checked per second) that you are more likely to discover the harder you press them.
Performance "C" is putting twice the strain on the CPU compared with "A". This would require software optimization since it halves the cycles-per-packet budget. It should not cause NIC problems since the per-port load is the same. There seems to be a risk of uncovering performance limits in the "uncore" parts of the processor e.g. the L3 cache, the DMA engine, the RAM controller, the IOMMU (if it were used), and so on.

So which would be hardest to achieve? and why?

The one I have a bad feeling about is "B". Historically we are used to NICs that can do line-rate with 64B packets. However, those days may be behind us. If you read Intel datasheets then the minimum packet rate they are guaranteeing is 64B for 10G (82599), 128B for 40G (XL710), and 256B for 100G (FM10K). (This is lower even than performance "A".) If our performance targets for the NICs are above what they are designed for then we are probably headed for trouble. I think if we want to support really high per-port packet rates then it will take a lot of work and we will be very contained in which hardware we can choose (both vendor and enabled features).

So, stumbling back towards the development de jour, I am tempted to initially accept the 64 Mpps per port limit observed in #1007 and focus on supporting "A" and "C". In practical terms this means spending my efforts on writing simple and CPU-efficient transmit/receive routines rather than looking for complex and CPU-expensive ways to squeeze more packets down a single card. We can always revisit the problem of squeezing the maximum packet rate out of a card in the context of specific applications (e.g. packetblaster and firehose) and there we may be able to "cheat" in some useful application-specific ways.

Early days anyway... next step is to see how the ConnectX-4 performs with simultaneous transmit+receive using generic routines. Can't take the 64 Mpps figure to the bank quite yet.

Thoughts?

sleinen · 2016-09-07T13:20:42Z

Nice description of the problem, and I agree with your conclusions. For most real applications you should be able to reduce b) to c) at the cost of additional ports. Exceptions? Artificial constraints such as "dragster-race" competitions (Internet2 land-speed records) or unrealistic customer expectations ("we only ever buy kit that does line rate even with 64-byte christmas-tree-packet workloads").

Cost of additional ports may be a problem, but that needs to be weighted against development costs as well. (You can formulate that as a time-to-market argument where you have the choice of either getting a working system now and upgrade it to the desired throughput once additional ports have gotten cheaper, or waiting until a "more efficient" system is developed that can do the same work with just one port :-)

lukego · 2016-09-08T09:51:13Z

Relatedly: Nathan Owens pointed out to me via Twitter that the sexy Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets. Seems to be confirmed on ipspace.net.

virtuallynathan · 2016-09-08T15:04:16Z

As far as other switches go, Mellanox Spectrum can do line-rate at all
packet sizes. Based on their "independent" testing, it seems Broadcom's
spec is not 100% accurate, see page 11:
http://www.mellanox.com/related-docs/products/tolly-report-performance-evaluation-2016-march.pdf

I haven't seen a number for Cavium Xpliant.
On Thu, Sep 8, 2016 at 2:51 AM Luke Gorrie [email protected] wrote:

Relatedly: Nathan Owens pointed out to me via Twitter that the sexy
Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets.
Seems to be confirmed on ipspace.net
http://blog.ipspace.net/2015/12/broadcom-tomahawk-101.html.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1013 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADT7LazQEjat9-FbjcATRK1Oi_fHKjOZks5qn9qVgaJpZM4J2yWB
.

plajjan · 2016-09-08T15:37:07Z

I don't think you should go out of your way to support what seems to be a bad NIC, i.e. if it requires you to move packets to certain places and thus decreasing performance in Snabb it's a bad move.

I want to be able to get wirespeed performance out of this by asking a vendor to produce a fast NIC and then just throw more cores out of it. If someone don't need wirespeed they can buy a bad/cheaper NIC (seemingly like this Mellanox) and can use less cores.

Most importantly the decision on pps/bps should be with the end-user :)

The first 10G NICs I used didn't do much more than 5Gbps. I think it's too early in the life of 100G NICs to draw conclusions on general trends.

lukego · 2016-09-08T18:05:47Z

Here are some public performance numbers from Mellanox: https://www.mellanox.com/blog/2016/06/performance-beyond-numbers-stephen-curry-style-server-io/

The headlines there are line-rate 64B with 25G NIC and 74.4 Mpps max on 100G. (I am told they have squeezed a bit more than this on the 100G but I haven't found a published account of that.)

Note that there are two different ASICs: "ConnectX-4" (100G) and "ConnectX-4 Lx" (10G/25G/40G/50G). If you needed more silicon horsepower per 100G, for example to do line-rate with 64B packets, maybe combining 4x25G NICs would be a viable option? (Is that likely to cause interop issues with 100G ports on switches/routers in practice?)

lukego · 2016-09-08T20:56:53Z

I tested the ConnectX-4 with every packet size 60..1500 and at both 3.5 GHz and 2.0 GHz.

Whaddayareckon?

plajjan · 2016-09-09T08:13:20Z

Interesting graph. Is it some fixed buffer size that leads to the plateaus?

lukego · 2016-09-09T09:34:01Z

Good question. It looks like the size of each packet is effectively being rounded up to a multiple of 64. I wonder what would cause this?

Suspects to eliminate:

Software.
DMA.
Ethernet MAC/PHY.

lukego · 2016-09-09T11:14:07Z

DMA/PCIe

I would really like to extend our PMU support to also track "uncore" counters like PCIe/RAM/NUMA activity. This way we could include all of those values in the data sets.

Meanwhile I created a little table by hand. This shows the PCIe activity on both sides of the first four distinct plateaus.

Mpps  PacketSize   PCIeRdCur (M)  DRd (M)  PCIeRead (GB)  PCIeWrite (GB)
37    190          1323           355      107            0.082
37    250          1656           356      128            0.082

30    260          1592           277      120            0.066
30    316          1591           284      120            0.066

25    320          1311           232       99            0.054
25    380          1529           232      112            0.054

21    384          1313           199       97            0.046
21    440          1512           204      110            0.046

This is based on me snipping bits from the output of the Intel Performance Counter Monitor tool. I found some discussion of its output here.

Here is a very preliminary idea of how I am interpreting these columns:

Mpps: Approximate packet rate of the plateau.
PacketSize: Bytes per packet (+CRC).
PCIeRdCur (M): Millions of 64B cache lines fetched from memory via PCIe.
DRd (M): Millions of 64B cache lines fetched from L3 cache via DDIO.
PCIeRead (GB) and PCIeWrite (GB): Total data read by NIC / written by NIC over PCIe. (Docs seem to say that this is Gigabytes but the numbers only makes sense to me as Gigabits.)

How to interpret this? In principle it seems tempting to blame the "64B-wide plateau" issue on DMA if it is fetching data in 64B cache lines. Trouble is that then I would expect to see the same level of PCIe traffic for both sides of the plateau -- and probably with PCIe bandwidth maxxed out at 128Gbps (PCIe 3.0 x16 slot). However, in most cases it seems like PCIe bandwidth is not maxxed out and the right-hand side of the plateau is transferring more data.

So: no smoking gun from looking at PCIe performance counters.

Ethernet MAC/PHY

I have never really looked closely at the internals of Layer-2 and Layer-1 on Ethernet. Just a few observations from reading wikipedia though:

100GbE uses 64b/66b encoding.
100GbE uses four physical channels.

So as a wild guess it seems possible that 100GbE would show some effects at 32-byte granularity (64 bit * 4 channel) based on the physical transport. However, this would only be 1/2 of the plateau size, and I don't know whether this 64-bit/4-channel grouping is visible in the MAC layer or just an internal detail of the physical layer.

I am running a test on a 10G ConnectX-4 NIC now just out of curiosity. If this showed plateaus with 1/4 the width then it may be reasonable to wonder if the issue is PHY related (10GbE also uses 64b/66b but via only one channel).

fmadio · 2016-09-09T11:27:34Z

probably want to look at L3 miss rate and/or DDR Rd counters as it steps.

Gen3 x16 will max out ~ 112Gbps after the encoding overhead in practice.

plajjan · 2016-09-09T11:43:00Z

I don't think 64b/66b has anything to do with this. That's just avoiding certain bit patterns on the wire and happens real close to the wire nor do I think it's related the AUI interface (which I assume you are referring to).

Doesn't the NIC copy packets from RAM to some little circular transmit buffer just before it sends them out? Is that buffer carved up in 64 byte slices?

fmadio · 2016-09-09T12:11:46Z

Yeah 64/66 encoding is not connected to this at all, there`s absolutely no flow control when your at that level, its 103.125Gbps or 0 Gbps with nothing in between.

There should be some wide FIFO`s before transferring to the CMAC and down the wire, but even then it should be at least 64B wide (read 512bit ) interface, which means 512b x say 250Mhz -> 128Gbps. More importantly that will effect the PPS rate which even at 128B packets would clock in at 125Mpps (2 clocks @ 250Mhz). My money is on L3/LLC or UnCore or QPI cache/request counts.

lukego · 2016-09-09T19:44:34Z

@fmadio Yes, this sounds like the most promising line of inquiry now: Can we explain the performance here, including the plateau every 64B, in terms of the way the memory subsystem is serving DMA requests. And if the memory subsystem is the bottleneck then can we improve its performance e.g. by serving more requests from L3 cache rather than DRAM.

Time for me to read the Intel Uncore Performance Monitoring Reference Manual...

fmadio · 2016-09-10T02:08:47Z

Yup its probably QPI / L3 / DDR some where some how. Assuming the Tx payloads are unique memory locations the plateau is the PCIe requestor hitting a 64B line somewhere, the drop is additional latency to fetch the next line, probably Uncore -> QPI -> LLC/L3. Note that the PCIe EP on the Uncore does not do any prefetching such as the CPUs DCU streamer, thus its a cold hard miss... back to fun days of CPU`s with no cache!

If you really want to dig into it suggest getting a PCIe sniffer but those things are dam expensive :(

lukego · 2016-09-10T17:34:50Z

@fmadio Great thoughts, keep 'em coming :).

I added a couple of modeled lines based on your suggestions:

Max 100GbE showing the theoretical maximum packet rate, based on the notion that the NIC will always transmit at 100Gbps & the MAC will add 24 bytes of per-packet overhead (CRC + Preamble + Gap).
Max PCIe/MLX4 showing the expected PCIe bandwidth limit, based on the notion that PCIe is transferring cache lines at 112Gbps and ConnectX-4 has one cache line per packet of overhead (64B transmit descriptor).

Here is how it looks (click to zoom):

This looks in line with the theory of a memory/uncore bottleneck:

Cache line granularity explains the width and alignments of the plateaus.
Performance curve would be smoothed where it reaches line rate, but it never does.
Cache lines are not being delivered to the NIC fast enough to keep the transmitter busy.

One more interesting perspective is to change the Y-axis from Mpps to % of line rate:

Looks to me like:

We are delivering ~80 Gbps of packet-data cache lines to the NIC.
If the packet size is a multiple of 64B then all the transferred data can be sent onto the wire, but otherwise part of the last cache line is not used and throughput drops.
There are a couple of sweet-spots around 256B where throughput reaches 85% of line rate. Perhaps more of the cache lines were served from L3 cache vs RAM here.

So the next step is to work out how to keep the PCIe pipe full with cache lines and break the 80G bottleneck.

fmadio · 2016-09-11T02:17:32Z

Cool, one thing totally forgot is 112Gbps is PCIe Posted Write bandwidth. As our capture devices is focused on Write to DDR I have not tested what the Max DDR Read bandwidth would be, its quite possible the system runs out of PCIe Tags at which point peek Read bandwidth would suffer.

Probably the only way to prefech data into the L3 is via the CPU, but that assumes the problem is L3 / DDR miss not something else. Would be interested if you limit the Tx buffer address to be < total L3 size. e.g. is the problem L3 -> DDR miss or something else.

fmadio · 2016-09-11T02:19:19Z

Also, for the Max PCIe/MLX4 green line. Looks like your off by 1 64B line some how ?

lukego · 2016-09-11T12:33:15Z

This is an absolutely fascinating problem. Can't put it down :).

@fmadio Great info! So on the receive path the NIC uses "posted" (fire and forget) PCIe operations to write packet data to memory but on the transmit path it uses "non-posted" (request/reply) operations to read packet data from memory. So the receive path is like UDP but the transmit path is more like TCP where performance can be constrained by protocol issues (analogous to window size, etc).

I am particularly intrigued by the idea of "running out of PCIe tags." If I understand correctly the number of PCIe tags determines the maximum number of parallel requests. I found one PCIe primer saying that the typical number of PCIe tags is 32 (but can be extended up to 2048).

Now I am thinking about bandwidth delay products. If we know how much PCIe bandwidth we have (~220M 64B cache-lines per second for 112Gbps) and we know how many requests we can make in parallel (32 cache lines) then we can calculate the point at which latency will impact PCIe throughput:

delay  =  parallel / bandwidth  =  32 / 220M per sec  =  146 nanoseconds

So the maximum (average) latency we could tolerate for PCIe-rate would be 146 nanoseconds per cache line under these assumptions.

Could this be the truth? (Perhaps with slightly tweaked constants?) Is there a way to check without a hardware PCIe sniffer?

I made a related visualization. This shows nanoseconds per packet (Y-axis) based on payload-only packet size in cache lines (X-axis). The black line is the actual measurements (same data set as before). The blue line is a linear model that seems to fit the data very well.

The slope of the line says that each extra cache line of data costs an extra 6.6 nanoseconds. If we assumed that 32 reads are being made in parallel then the actual latency would be 211 nanoseconds. Comparing this with the calculated limit of 146 nanoseconds for PCIe line rate we would expect to achieve around 70% of PCIe line rate.

This is a fairly elaborate model but it seems worth investigating because the numbers all seem to align fairly well to me. If this were the case then it would have major implications i.e. that the reason for all this fussing about L3 cache and DDIO is due to under-dimensioned PCIe protocol resources on the NIC creating artificially tight latency bounds on the CPU.

(Relatedly: The Intel FM10K uses two PCIe x8 slots ("bifurcation") instead of one PCIe x16 slot. This seemed awkward to me initially but now I wonder if it was sound engineering to provision additional PCIe protocol/silicon resources that are needed to achieve 100G line rate in practice? This would put things into a different light.)

wingo · 2016-09-11T12:52:12Z

Do I understand correctly that these are full duplex receive and transmit tests, and that they are being limited by the transmit side because of the non-posted semantics of the way the NIC is using the PCIe bus?

lukego · 2016-09-11T12:59:58Z

No and maybe, in that order ;-). This is transmit-only (packetblaster) and this root cause is not confirmed yet, just idea de jour.

Some more details of the setup over in #1007.

lukego · 2016-09-11T13:01:44Z

The NIC probably has to use posted requests here - read request needs a reply to get the data - but maybe it needs to make twice as many requests in parallel to achieve robust performance.

lukego · 2016-09-11T13:26:34Z

@wingo A direct analogy here is if your computer couldn't take advantage of your fast internet connection because it had an old operating system that advertises a small TCP window. Then you would only reach the advertised speed if latency is low e.g. downloading from a mirror at your ISP. Over longer distances there would not be enough packets in flight to keep the pipe full.

Anyway, just a theory, fun if it were true...

fmadio · 2016-09-12T00:37:53Z

A few things.

Pretty much all devices support "PCIe Extended Tags" which add a few more bits so you can have alot more transactions in flight at any one time. E.g. think about GPU`s reading crap from system memory ... nvidia, intel & co have a lot of smart ppl working on this.
In practice you`ll run out of PCIe credits first. This is a flow control / throttling mechanism that allows the PCIe UnCore to throttle the data rate, so the UnCore never drops a request. For both Posted & Non-Posted requests, it gets split further into credits for headers and credits for data.
Latency is closer to 500ns RTT last time I checked, putting it half that going one way. Keep in mind for non-posted reads from system DDR its a request to PCIe UnCore, then response so full RTT is more appropriate. Of course these are fully pipelined requests so 211ns sounds close.

For 100G packet capture we dont care about latency much, just maximum throughput thus havent dug around there much. We`ll add full nano accurate 100G line rate PCAP replay in a few months at which point latency and maximum non-posted read bandwidth becomes important.

All of this is pretty easy test with an fpga. Problem is I dont have time mess around with this at the moment.

lukego · 2016-09-12T10:24:17Z

@fmadio Thanks for the info! I am struck that "networks are networks" and all these PCIe knobs seem to have direct analogies in TCP. "Extended tags" is window scaling, "credits" is advertised window, bandwidth*delay=parallel constraint is the same. Sensible defaults change over time too e.g. you probably don't want to use Windows XP default TCP settings for a 1 Gbps home internet connection. (Sorry, I am always overdoing analogies.)

So based on the info from @fmadio it sounds like my theory from the weekend may not actually fit the constants but let's go down the PCIe rabbit hole and find out anyway.

I have poked around the PCIe specification and found a bunch of tunables but no performance breakthrough yet.

Turns out that lspci can tell us a lot about how the device is setup:

# lspci -s 03:00.0 -vvvv
...
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
        MaxPayload 256 bytes, MaxReadReq 512 bytes
...

Observations:

ExtTag (PCIe extended tags) is supported by the device in DevCap.
ExtTag is however disabled in DevCtl.
More interesting-looking tunables exist in DevCap: MaxPayload and MaxReadReq.

I have tried a few different settings (e.g. ExtTag+ and MaxPayload=512 and MaxReadReq=4096) but I have not observed any impact on throughput.

I would like to check if we are running out of "credits" and that is limiting parallelism. I suppose this depends on how much buffer space the processor uncore is making available to the device. Guess the first place to look is the CPU datasheet.

I suppose that it would be handy to have a PCIe sniffer at this point. Then we could simply observe the number of parallel requests that are in flight. I wonder if there is open source Verilog/VHDL code for a PCIe sniffer? I could easily be tempted to buy a generic FPGA for this kind of activity but a single-purpose PCIe sniffer seems like overkill. Anyway - I reckon we will be able to extract enough information from the uncore performance counters in practice.

BTW lspci continues with more parameters that may also include something relevant:

DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
         Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

lukego · 2016-09-12T10:45:46Z

Just a note about the server that I am using for testing here (lugano-3.snabb.co):

CPU: E3-1650v3 (6 cores @ 3.5 GHz, 15MB L3 cache)
NIC: 2 x ConnectX4 100GbE (each in a separate PCIe 3.0 x16 slot)
RAM: 4 x 8GB DDR4 (2133 MHz)

Could be that we could learn interesting things by testing both 100G ports in parallel. Just briefly I tested with 60B and 1500B packets. In both cases the traffic is split around 50/50 between ports. On 1500B I see aggregate 10.75 Mpps (well above single-port rate of ~6.3 Mpps) and on 60B I see aggregate 76.2 Mpps (only modestly above the single-port rate of 68 Mpps).

fmadio · 2016-09-12T12:12:58Z

HDL these days is almost entirely packet based, all the flow control and processing inside those fancy asic`s are mostly packet based. So all the same algos are there, different names and format but a packet is still a packet regardless of it contains a TCP header or QPI header.

Surprised the devices shows up as x16, means you`ve got PLX chip there somewhere acting as bridge. It should be 2 separate and distinct pcie devices.

You can`t just make a PCIe sniffer, a bridge would be easier. You realize a Oscilloscope that capable of sampling PCIe3 signals will cost $100-$500K ? those things are dam expensive. PCIe sniffer will "only" cost a meager $100K+ USD.

On the FPGA side monitoring the credits is pretty trivial. Forget if the Intel PCM kit has anything about PCIe Credits or monitoring Uncore PCIe FIFO sizes. Its probably there somewhere, so if you find something would be very cool to share.

lukego · 2016-09-12T12:57:15Z

@fmadio ucevent has a mouth-watering list of events and metrics. I am working out which ones are actually supported on my processor and to untangle CPU/QPI/PCIe ambiguities. Do any happen to catch your eye? This area is obscure enough that googling doesn't yield much :-).

lukego · 2017-01-20T12:04:28Z

@fozog Great thoughts!

I would love to see a packet processing framework designed around back-to-back contiguous packets. This would be a very interesting design and potentially in harmony with CPUs in many ways. This would also be quite aligned with the physical link i.e. ethernet is essentially a serial port of sorts.

However, my suspicion is that this may be a short-term band-aid rather than the way forward. Seems like it is straightforward to do 40G/100G 64B line-rate using multiple 10G adapters and only a challenge with 40G/100G adapters. So what is the difference? Seems to me like the CPU, L3 cache, RAM, PCIe lanes, etc, all have enough capacity and the limitation is somewhere in the NIC.

I would bet a cold beverage that the problem for high-packet-rate 40G/100G today is the PCIe endpoints in the NICs being under-dimensioned. Likely the PCIe endpoint is a generic IP core licensed from a third party who is now racing to fix the performance in time for the PCIe 4.0 refresh. Would you take that bet? :-)

fozog · 2017-01-20T18:32:23Z

@lukego Thanks.

A DMA transaction needs a completion. So we have the current timing:
t0: NIC DMA engine send 64 byte packet = 64 + 24 (PCI decorations) bytes to transfer
t0+12ns: CPU DMA engine (in Intel that would be the uBOX in the uncore domain) send to L3 cache (best case -> 20ns)
t0+32ns: CPU DMA engine send completion
t0+34ns: NIC DMA engine can send another packet.

This means roughly 30M DMA transactions per second...

If I am not mistaken, early completions are sent when the DMA is received at the uBOX. but because of delays, you can't have too many early completions, hence the credit mechanism. In addition this theoretical reasoning assumes that all packets fit in L3, which may be possible for L2 switch that just operates on MAC, but typically not possible for complex firewalls, openflow switches with decent number of rules...
So concretely, this means that the maximum observable is 36M DMA transactions per second.
Having mega fast DMA engines at both CPU and NIC will not change the result.

If we double DMA capacity for PCIe gen4 and use x16 ports, it may be possible to support 100Gbps line rate but I think it is weak and depends on perfect distribution of packets in queues and limits the functionality that can be applied. In addition, the talks I have with people designing for 400Gbps conclude that moving one packet at a time is a huge waste while moving packets in batch allow more time budget for packet handling (it is an intuitive truth don't you think?).

One scarce resource of data centers in square meters, or more precisely volume. On a 1U server, you can squeeze the CPU power to handle 200Gbps of data (I made the test with a simple web server built on commercial TCP/IP stack leveraging DPDK) but only two NICs.... So using 10G adapters instead of 40G or 100G is not really efficient...

Did I won the cold beverage ?

lukego · 2017-01-20T19:57:37Z

@fozog I think you make an excellent case. I will buy you that cold beverage when our paths cross one day :).

I would quite like to play with this design in Snabb. I have been thinking related thoughts on other topics like "Packet Copies: Cheap or Expensive?" (#648).

I have a couple of objections in practical terms to adoption though:

First is that we cannot implement this design using existing NICs (e.g. Intel) on the receive path, right? So the implementation benefits need to outweigh the compatibility issues. (Thinking VHS / BETA.)

Second is that line-rate 64B is not actually an important workload. 64B packets are tiny and nobody really wants to send millions of them across their networks back-to-back. The real reason that people focus on 64B line-rate is to avoid creating a bottleneck by introducing a new device that can't keep up with the existing ones. If you bought an expensive router that does 64B line-rate then you will not want to connect that to other devices that can't keep up e.g. a switch or a monitoring probe because then you are "wasting" your capacity. On the other hand if you already bought a switch that only does 256B line-rate (e.g. 100G Tomahawk) then the other devices only need to keep up with that - there is no value in being faster because the bottleneck does not change.

So, tongue only half in cheek, I would propose to solve this problem by specifying a minimum packet size of 256B for the network and enforcing this with policing in the core. This will be more than adequate for real-world internet traffic and it will save a lot of unnecessary engineering!

fozog · 2017-01-21T09:57:49Z

You can use Chelsio NICs that implement such coalescing scheme since 2012 (not sure about the date, I mean this is not new) with their T4 chip. The kernel driver wraps SKBs around packets placed in the buffers managed by T5. It is unclear whether the DPDK driver supports T4 and T5 chips.

The key challenge is to coalesce packets on the send side because it is the software that has to do that.

According to statistics I gathered from Internet Exchanges a few years ago, 64 byte packets account for 30% of packets observed on a 10Gbps link (TCP SYN, SYN-ACK, ACK, RST; some UDP...). Being line rate at 256 bytes is the typical metric of telecom operator RFPs so I could agree with you that this is the practical line rate target.

That said, when you have DDoS (TCP SYN or DNS amplified) then 64 byte packets are representing the vast majority of packets.

So the 64 byte packet rate behavior becomes the metric for attack resilience. For a DPI or a security box, you want 64 byte line rate. For other nodes in the network, 256 line rate is good enough.

plajjan · 2017-01-21T10:45:47Z

I have an observation to make based on the "core box does 256B so we don't need to care about more". There is a difference here between that core box and Snabb. The core device have tens or hundreds of 100G and sees all types of traffic in the network of which the average packets size must be higher than 256B. Snabb is typically used to implement a specific application, so it will not see the networks average packet size - it will see the packet size for this specific application.

If Snabb is a single application and it gets a DDoS attack of 200Gbps at 64B packets it will have to process all of those. For my core box the influx of 200Gbps 64B packets will only marginally lower the average packet size since it is already shuffling around say 10Tbps of other traffic with a much higher average packet size.

lukego · 2017-01-25T09:04:40Z

@plajjan Good point!

@fozog I really like this idea of storing packets contiguous in memory. So instead of having a ring containing pointers to ~100 packets we could have a ring containing ~10kb of serial packet data. Then if we want to transform the packets (e.g. add/remove 802.1Q header) we make an efficient streaming copy with transformation on the way through. I think this would be most interesting if the application code used a streaming API for accessing the packets rather than pretending they are scattered in memory (even if that may be necessary for compatibility in certain cases.) This way the streaming would be an optimization for both the CPU and the PCIe (rather than improving PCIe efficiency at the expense of the CPU.)

This would be a very interesting experiment to play with :) even purely on the CPU side to see if the streaming programming model is convenient and efficient in practice.

I would love to have a hackable NIC in the spirit of lowRISC to be able to experiment and productionize new features on the hardware side too... cc @fmadio :-)

fmadio · 2017-01-25T09:42:08Z

@lukego I`m still game for this, except HDL can not be opensource. Could do source access with an NDA if you want to hack on it tho. Driver and other stuff can be full GPL.

Your starting to realize some of the cool things that can be done if you throw out the existing ring/descriptor paradigm, the architecture starts looking pretty different :)

lukego · 2017-01-25T10:47:17Z

Just thinking that a simple way to experiment on the contiguous packet idea would be to process pcap files. These already have packets stored back-to-back with a modest amount of per-packet metadata (16B for timestamp + length.)

Challenge could be along the lines of:

Transform a source pcap file ("ingress") to a destination pcap file ("egress".)
Perform the transformation by composing multiple independent operations (apps) e.g. add 802.1Q header, add tunnel encapsulation, update L2/L3 header with routing information.
Make sure the code is producing ready-to-transmit packets regularly (e.g. one at a time or in small chunks.) It would be cheating to make multiple passes over the whole input, for example, because in a real device we are operating on infinite streams of packets from the NICs.

Could be that this would work fantastically well and we would want to update the whole of Snabb to operate on contiguous packet data instead of isolated buffers. Or could be disappointing. dunno :).

lukego · 2017-01-25T10:54:58Z

@fmadio Have you seen lowRISC? I think this is an interesting model. They have sat down and built a base hardware implementation (CPU) completely from scratch. Huge investment of effort presumably. Now they want to start producing ASICs at regular intervals so that any contributions that are upstreamed will become silicon automatically. This could then create a feedback loop where lots of users want to improve the upstream repository so that they will have better hardware available for their future projects.

Quite a bootstrapping challenge. If it works out for CPUs then maybe somebody can try it for NICs.

fmadio · 2017-01-25T13:43:33Z

@lukego yes saw lowRISC, its a non-profit and grant (pseudo VC) funded. We are obviously for-profit and bootstrapped, almost the complete opposite.

The PCAP header is horribly inefficient and slow. We do use a 16B header except its much better packed and optimized for HDL.

Try work us, maybe being fully open will be less important to you once we get a rapid feedback loop running. Could do some pretty cool things with our NIC + Snabb I think.

lukego · 2017-01-26T07:58:25Z

@fmadio I would love to play with you guys' stuff. Just now though I am not involved in any projects that need 64B line-rate or that can use non-COTS hardware. So I am limited to experimenting with software architectures and blue-sky dreams for future projects with more exotic hardware :-).

lukego · 2017-02-19T11:28:07Z

@fozog I am really interested in the "store packets in contiguous serial memory" idea. I have been pondering it a little more. Current feeling is that it makes sense in niche applications - e.g. 100G 64B-line-rate packet capture using NICs that exist in 2017 - but that discrete packets are likely more suitable for general application development (e.g. main Snabb/DPDK use cases.) Few reasons:

IMHO PCIe limitations should be resolved on the PCIe layer e.g. with extra bandwidth in Gen4 and PCIe endpoint silicon provisioned for more parallelism. I'm not wild about redesigning applications to work around PCIe implementation problems - e.g. max number of PCIe requests in flight - using the CPU.
In principle it would be nice for the L2/L3 cache prefetchers to stream all the packet data into the core before you need to touch it but I see a couple of caveats. One is that AFAIK the prefetchers need substantial warm-up and are mostly useful when you are processing megabytes at a time i.e. not well suited to small bursts of packets. Other is that L3 and RAM access are already very efficient if you make the requests in parallel: 7-cpu.com benchmarks say that these days L3 delivers a random-access cache line every 5 cycles and RAM every 5.9 cycles. This may be optimistic in practice but that order of magnitude is very low.
Contiguous packet buffers seem to impose a FIFO packet lifecycle. This will be fine for some applications: if all packets are sunk to disk or a single egress port for example. However if I am switching the packets across many different egress ports then I seem to be exposed to head-of-line blocking where one slow DMA request prevents me from reusing the block of memory.

So for the moment I am going to switch my daydreaming over to the topic of array programming in #1099 :-)

ghost · 2017-02-20T05:24:10Z

@lukego
re: storing packets contiguous in memory/aggregated frames across the bus

I am aware of some silicon forwarding implementations that rely on this "aggregated frames" concept across the backplane switch fabric. As I recall, the packet processors were sending/receiving them to/from fab deaggregated because of an intermediate step in fabric tx/rx (which I reckon had more to do with locality breeding easier interconnect wrt deaggregated frames and packet processors). The way packets were brought across fabric to prevent HOL blocking in this case was to arbitrate frames in VoQ fashion. This assumes, I think, that one can A) model the output queues and B) determine or, perhaps, even simply predict the output device before bringing it in.

If you have tighter control of the PCI interaction and expose a set of "VoQs" as (example) a separate credit recipient from PCI perspective, would this let you stop HOL blocking and potentially just drop in HW? Would the complexities of modeling some sorts of "predictive" queueing outweigh the benefits? And more importantly, is this idea really just too device-specific, situation-specific and overall just a pain to handle? Hoping this will breed some other ideas too, perhaps without the need for such complicated predictions in our COTS NIC HW.

theo19 · 2017-02-23T15:56:18Z

I was looking into the DPDK site list of NICs (imagine how cool it would be if such a long list would exist for snabb ...) for those that support 100G, and think one more recent interesting addition - based on cost consideration - is obviously the entry "15. QEDE Poll Mode Driver" that refers to Qlogic FastLinQ QL45000 series (resold by some major server vendors) http://dpdk.org/doc/guides/nics/qede.html

The Qlogic site at http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/SearchByProduct.aspx?ProductCategory=325&Product=1263&Os=26 shows not just the drivers, but also allows to download detailed documentation dated late 2016: for the chipset register specification (approx. 350 pages) and a developers guide (approx. 130 pages).

The overall offering of real 100G NIC chips will probably keep on broadening - hopefully the industry reshuffling will also drive more players to open up their documentation. Qlogic's steps after takeover by Cavium does certainly help.

I am also curious on the outcome of another group, as Emulex was swallowed by Avago, who later took over Broadcom, but retained the brand name for the whole group as Broadcom (stock ticker is still AVGO). We should expect a pretty competitive 100G NIC from price performance point of view from them. A quite open question here is if Broadcom will remain rather locked up wrt. documentation as this is how they managed it in the past. Certainly Qlogic/Cavium, Mellanox and others have embraced the idea of maximising their usefulness in context of FOSS projects more than some others... the official website still shows no 100G NIC product, but IMHO it really can't take very much longer until they come out of the stealth mode with somthing like that.

fozog · 2017-02-24T10:48:19Z

@lukego 5.9 cycles per RAM! That was a stricking number when I mentally compared to DRAM timing characteristics I am aware of. The site says 5.9ns. Well that seemed more realistic, but if you read on top it says "RAM Latency = 42 cycles + 51 ns" which is what I would have thought. I am affraid the 5.9ns calculation is close to the following reasoning: "with nine women we can get a baby in one month".

But you are making very good points on the future of PCI and and L3. I think the fundamental benefit of coalesced packets is the cache protection. By having 2K aligned packet buffers, I suspect that packets are not that well distributed and tend to hit particular areas of the L3 cache (no concrete measures made) and the number of cache ways may not suffice to absorb the pressure.

By contrast, consecutive packets should dispatch cachelines to all L3 slots.

lukego · 2017-02-24T12:34:57Z

@fozog You are right about 2KB aligned packet buffers. We have found that all data structures used in the traffic plane need to have a lot of entropy in the lower bits (mod 4096) of their addresses. Otherwise we see cache conflict misses that impact performance in a big way. This has bitten us on packet buffers (once upon a time 4096-aligned) and also on everything allocated as file-backed shared memory (kernel allocates on a 4096 byte page boundary.)

I think the "RAM delivers one random-access read every 5.9ns" claim is following the reasoning that "with nine women we can produce nine babies in nine months" i.e. latency is 9 months but average throughput is one baby per month due to parallelism. In our code we are often performing the latency-sensitive operations on many packets in a row (like loading the first cache line of payload or making a lookup in a table) and so we should be well positioned to exploit instruction-level parallelism. (Compare with e.g. Linux kernel that tends to process one packet at a time and will need to find a way to overlap computation with individual high-latency operations like loading from memory and taking locks.)

fmadio · 2017-02-27T01:51:34Z

Re: Headline blocking, would expect its reasonable to just drop packets on any egress ports who`s FIFO is full ?

lukego · 2017-02-27T08:50:29Z

@fmadio You are right. If I am using contiguous buffers from both input and output then I will need to make a data copy to transfer from ingress queue memory to egress queue memory. I could make these copies in FIFO order. Once that copy is done then the ingress memory can be freed because the DMA is referencing the egress memory. If no egress memory is available then the packet is dropped (never held pinned in the ingress memory.)

I have been imagining a zero-copy design where packet data stays in the same place i.e. the same memory is referenced by DMA descriptors for both ingress and egress. However that would seem to be the wrong design for contiguous packet buffers because then you have to somehow "garbage collect" memory that can be referenced by multiple DMA requests with indefinite lifetimes. Seems like the right alternative would be to accept the copies and aggressively optimize them.

Is that how you are thinking too?

lukego · 2017-02-27T08:53:22Z

(Could also perhaps implement the zero-copy mode with reference counters on the memory blocks. Then you increment the reference count when you create a DMA descriptor and you decrement it when the DMA completes. Memory is freed when count reaches zero. This is actually what we did in Snabb back in the distant past but our experience lead us to prefer a simpler design with fewer code paths to optimize.)

mlilja01 · 2017-03-09T10:21:29Z

@fmadio I totally agree that 100G sustained come with costs but I just want to point out that Napatech actually do 100G sustained for all packet sizes ;)

@lukego This is quite an interesting discussion that touch several good topics. One of the things are “contiguous serial memory” and in that regards I just want to stress out that this approach from a PCIe perspective has the advantage that you avoid lots of overhead enabling you to get sustained 100G with packet sizes < 128B. It’s going to be a while before PCIe Gen4 is to be found in any Xeon systems, so right now if you really want to have 100G sustained my take is that the “contiguous serial memory” is the only approach that works and then you must RSS scale to multiple cores to hide the overhead of the potentially involved memcpy().

fmadio · 2017-03-09T10:28:05Z

@mlilja01 cool, when did you achieve sustained line rates? Last time I checked your 100G boards were burst capture only and limited to 40Gbps of host bandwidth?

mlilja01 · 2017-03-09T10:32:05Z

@fmadio I think we have had that since last summer :)

fmadio · 2017-03-09T10:38:23Z

@mlilja01 ok, guess I can no longer mouth off about it :) tho you need to tell the marketing folks.

Fix packetblaster lwaftr

lukego · 2018-05-25T05:16:56Z

This issue is spinning off into an effort to design a "simple and pleases both hardware and software people" host-device interface called EasyNIC. Hope to chat with you guys on issues over there too :).

eugeneia · 2022-02-23T11:54:59Z

Just a heads up that we looked at this again recently witha pci gen4 system and ConnectX-5 cards: #1471

lukego added question idea labels Sep 7, 2016

lukego mentioned this issue Sep 8, 2016

ConnectX: Review N*SQ 64B transmit performance mellanox (Rev 2) #1007

Open

dpino pushed a commit to dpino/snabb that referenced this issue Feb 14, 2018

Merge pull request snabbco#1013 from Igalia/fix-packetblaster-lwaftr

f4411a6

Fix packetblaster lwaftr

lukego mentioned this issue May 29, 2018

README.md: New combined Transmit & Receive design lukego/easynic#9

Merged

lukego mentioned this issue Dec 19, 2018

Benchmark setup for validating design lukego/easynic#12

Open

lukego mentioned this issue Feb 5, 2019

Hardware for 10G EasyNIC lukego/easynic#15

Open

eugeneia mentioned this issue Feb 23, 2022

ConnectX: Review ConnectX-5 transmit/receive/forwarding performance on PCI-e Gen4 #1471

Open

100G Packet Rates: Per-CPU vs Per-Port #1013

100G Packet Rates: Per-CPU vs Per-Port #1013

Comments

lukego commented Sep 7, 2016

sleinen commented Sep 7, 2016

lukego commented Sep 8, 2016

virtuallynathan commented Sep 8, 2016

plajjan commented Sep 8, 2016

lukego commented Sep 8, 2016

lukego commented Sep 8, 2016

plajjan commented Sep 9, 2016

lukego commented Sep 9, 2016 • edited Loading

lukego commented Sep 9, 2016

DMA/PCIe

Ethernet MAC/PHY

fmadio commented Sep 9, 2016 • edited Loading

plajjan commented Sep 9, 2016

fmadio commented Sep 9, 2016

lukego commented Sep 9, 2016

fmadio commented Sep 10, 2016 • edited Loading

lukego commented Sep 10, 2016

fmadio commented Sep 11, 2016

fmadio commented Sep 11, 2016

lukego commented Sep 11, 2016 • edited Loading

wingo commented Sep 11, 2016

lukego commented Sep 11, 2016

lukego commented Sep 11, 2016

lukego commented Sep 11, 2016

fmadio commented Sep 12, 2016 • edited Loading

lukego commented Sep 12, 2016

lukego commented Sep 12, 2016

fmadio commented Sep 12, 2016

lukego commented Sep 12, 2016

lukego commented Jan 20, 2017

fozog commented Jan 20, 2017

lukego commented Jan 20, 2017

fozog commented Jan 21, 2017

plajjan commented Jan 21, 2017 • edited Loading

lukego commented Jan 25, 2017

fmadio commented Jan 25, 2017

lukego commented Jan 25, 2017

lukego commented Jan 25, 2017

fmadio commented Jan 25, 2017

lukego commented Jan 26, 2017

lukego commented Feb 19, 2017 • edited Loading

ghost commented Feb 20, 2017 • edited by ghost Loading

theo19 commented Feb 23, 2017 • edited Loading

fozog commented Feb 24, 2017

lukego commented Feb 24, 2017

fmadio commented Feb 27, 2017

lukego commented Feb 27, 2017

lukego commented Feb 27, 2017

mlilja01 commented Mar 9, 2017 • edited Loading

fmadio commented Mar 9, 2017

mlilja01 commented Mar 9, 2017

fmadio commented Mar 9, 2017

lukego commented May 25, 2018

eugeneia commented Feb 23, 2022

lukego commented Sep 9, 2016 •

edited

Loading

fmadio commented Sep 9, 2016 •

edited

Loading

fmadio commented Sep 10, 2016 •

edited

Loading

lukego commented Sep 11, 2016 •

edited

Loading

fmadio commented Sep 12, 2016 •

edited

Loading

plajjan commented Jan 21, 2017 •

edited

Loading

lukego commented Feb 19, 2017 •

edited

Loading

ghost commented Feb 20, 2017 •

edited by ghost

Loading

theo19 commented Feb 23, 2017 •

edited

Loading

mlilja01 commented Mar 9, 2017 •

edited

Loading