-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100G Packet Rates: Per-CPU vs Per-Port #1013
Comments
Nice description of the problem, and I agree with your conclusions. For most real applications you should be able to reduce b) to c) at the cost of additional ports. Exceptions? Artificial constraints such as "dragster-race" competitions (Internet2 land-speed records) or unrealistic customer expectations ("we only ever buy kit that does line rate even with 64-byte christmas-tree-packet workloads"). Cost of additional ports may be a problem, but that needs to be weighted against development costs as well. (You can formulate that as a time-to-market argument where you have the choice of either getting a working system now and upgrade it to the desired throughput once additional ports have gotten cheaper, or waiting until a "more efficient" system is developed that can do the same work with just one port :-) |
Relatedly: Nathan Owens pointed out to me via Twitter that the sexy Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets. Seems to be confirmed on ipspace.net. |
As far as other switches go, Mellanox Spectrum can do line-rate at all I haven't seen a number for Cavium Xpliant.
|
I don't think you should go out of your way to support what seems to be a bad NIC, i.e. if it requires you to move packets to certain places and thus decreasing performance in Snabb it's a bad move. I want to be able to get wirespeed performance out of this by asking a vendor to produce a fast NIC and then just throw more cores out of it. If someone don't need wirespeed they can buy a bad/cheaper NIC (seemingly like this Mellanox) and can use less cores. Most importantly the decision on pps/bps should be with the end-user :) The first 10G NICs I used didn't do much more than 5Gbps. I think it's too early in the life of 100G NICs to draw conclusions on general trends. |
Here are some public performance numbers from Mellanox: https://www.mellanox.com/blog/2016/06/performance-beyond-numbers-stephen-curry-style-server-io/ The headlines there are line-rate 64B with 25G NIC and 74.4 Mpps max on 100G. (I am told they have squeezed a bit more than this on the 100G but I haven't found a published account of that.) Note that there are two different ASICs: "ConnectX-4" (100G) and "ConnectX-4 Lx" (10G/25G/40G/50G). If you needed more silicon horsepower per 100G, for example to do line-rate with 64B packets, maybe combining 4x25G NICs would be a viable option? (Is that likely to cause interop issues with 100G ports on switches/routers in practice?) |
Interesting graph. Is it some fixed buffer size that leads to the plateaus? |
Good question. It looks like the size of each packet is effectively being rounded up to a multiple of 64. I wonder what would cause this? Suspects to eliminate:
|
DMA/PCIeI would really like to extend our PMU support to also track "uncore" counters like PCIe/RAM/NUMA activity. This way we could include all of those values in the data sets. Meanwhile I created a little table by hand. This shows the PCIe activity on both sides of the first four distinct plateaus.
This is based on me snipping bits from the output of the Intel Performance Counter Monitor tool. I found some discussion of its output here. Here is a very preliminary idea of how I am interpreting these columns:
How to interpret this? In principle it seems tempting to blame the "64B-wide plateau" issue on DMA if it is fetching data in 64B cache lines. Trouble is that then I would expect to see the same level of PCIe traffic for both sides of the plateau -- and probably with PCIe bandwidth maxxed out at 128Gbps (PCIe 3.0 x16 slot). However, in most cases it seems like PCIe bandwidth is not maxxed out and the right-hand side of the plateau is transferring more data. So: no smoking gun from looking at PCIe performance counters. Ethernet MAC/PHYI have never really looked closely at the internals of Layer-2 and Layer-1 on Ethernet. Just a few observations from reading wikipedia though:
So as a wild guess it seems possible that 100GbE would show some effects at 32-byte granularity (64 bit * 4 channel) based on the physical transport. However, this would only be 1/2 of the plateau size, and I don't know whether this 64-bit/4-channel grouping is visible in the MAC layer or just an internal detail of the physical layer. I am running a test on a 10G ConnectX-4 NIC now just out of curiosity. If this showed plateaus with 1/4 the width then it may be reasonable to wonder if the issue is PHY related (10GbE also uses 64b/66b but via only one channel). |
probably want to look at L3 miss rate and/or DDR Rd counters as it steps. Gen3 x16 will max out ~ 112Gbps after the encoding overhead in practice. |
I don't think 64b/66b has anything to do with this. That's just avoiding certain bit patterns on the wire and happens real close to the wire nor do I think it's related the AUI interface (which I assume you are referring to). Doesn't the NIC copy packets from RAM to some little circular transmit buffer just before it sends them out? Is that buffer carved up in 64 byte slices? |
Yeah 64/66 encoding is not connected to this at all, there`s absolutely no flow control when your at that level, its 103.125Gbps or 0 Gbps with nothing in between. There should be some wide FIFO`s before transferring to the CMAC and down the wire, but even then it should be at least 64B wide (read 512bit ) interface, which means 512b x say 250Mhz -> 128Gbps. More importantly that will effect the PPS rate which even at 128B packets would clock in at 125Mpps (2 clocks @ 250Mhz). My money is on L3/LLC or UnCore or QPI cache/request counts. |
@fmadio Yes, this sounds like the most promising line of inquiry now: Can we explain the performance here, including the plateau every 64B, in terms of the way the memory subsystem is serving DMA requests. And if the memory subsystem is the bottleneck then can we improve its performance e.g. by serving more requests from L3 cache rather than DRAM. Time for me to read the Intel Uncore Performance Monitoring Reference Manual... |
Yup its probably QPI / L3 / DDR some where some how. Assuming the Tx payloads are unique memory locations the plateau is the PCIe requestor hitting a 64B line somewhere, the drop is additional latency to fetch the next line, probably Uncore -> QPI -> LLC/L3. Note that the PCIe EP on the Uncore does not do any prefetching such as the CPUs DCU streamer, thus its a cold hard miss... back to fun days of CPU`s with no cache! If you really want to dig into it suggest getting a PCIe sniffer but those things are dam expensive :( |
@fmadio Great thoughts, keep 'em coming :). I added a couple of modeled lines based on your suggestions:
Here is how it looks (click to zoom): This looks in line with the theory of a memory/uncore bottleneck:
One more interesting perspective is to change the Y-axis from Mpps to % of line rate: Looks to me like:
So the next step is to work out how to keep the PCIe pipe full with cache lines and break the 80G bottleneck. |
Cool, one thing totally forgot is 112Gbps is PCIe Posted Write bandwidth. As our capture devices is focused on Write to DDR I have not tested what the Max DDR Read bandwidth would be, its quite possible the system runs out of PCIe Tags at which point peek Read bandwidth would suffer. Probably the only way to prefech data into the L3 is via the CPU, but that assumes the problem is L3 / DDR miss not something else. Would be interested if you limit the Tx buffer address to be < total L3 size. e.g. is the problem L3 -> DDR miss or something else. |
Also, for the Max PCIe/MLX4 green line. Looks like your off by 1 64B line some how ? |
This is an absolutely fascinating problem. Can't put it down :). @fmadio Great info! So on the receive path the NIC uses "posted" (fire and forget) PCIe operations to write packet data to memory but on the transmit path it uses "non-posted" (request/reply) operations to read packet data from memory. So the receive path is like UDP but the transmit path is more like TCP where performance can be constrained by protocol issues (analogous to window size, etc). I am particularly intrigued by the idea of "running out of PCIe tags." If I understand correctly the number of PCIe tags determines the maximum number of parallel requests. I found one PCIe primer saying that the typical number of PCIe tags is 32 (but can be extended up to 2048). Now I am thinking about bandwidth delay products. If we know how much PCIe bandwidth we have (~220M 64B cache-lines per second for 112Gbps) and we know how many requests we can make in parallel (32 cache lines) then we can calculate the point at which latency will impact PCIe throughput:
So the maximum (average) latency we could tolerate for PCIe-rate would be 146 nanoseconds per cache line under these assumptions. Could this be the truth? (Perhaps with slightly tweaked constants?) Is there a way to check without a hardware PCIe sniffer? I made a related visualization. This shows nanoseconds per packet (Y-axis) based on payload-only packet size in cache lines (X-axis). The black line is the actual measurements (same data set as before). The blue line is a linear model that seems to fit the data very well. The slope of the line says that each extra cache line of data costs an extra 6.6 nanoseconds. If we assumed that 32 reads are being made in parallel then the actual latency would be 211 nanoseconds. Comparing this with the calculated limit of 146 nanoseconds for PCIe line rate we would expect to achieve around 70% of PCIe line rate. This is a fairly elaborate model but it seems worth investigating because the numbers all seem to align fairly well to me. If this were the case then it would have major implications i.e. that the reason for all this fussing about L3 cache and DDIO is due to under-dimensioned PCIe protocol resources on the NIC creating artificially tight latency bounds on the CPU. (Relatedly: The Intel FM10K uses two PCIe x8 slots ("bifurcation") instead of one PCIe x16 slot. This seemed awkward to me initially but now I wonder if it was sound engineering to provision additional PCIe protocol/silicon resources that are needed to achieve 100G line rate in practice? This would put things into a different light.) |
Do I understand correctly that these are full duplex receive and transmit tests, and that they are being limited by the transmit side because of the non-posted semantics of the way the NIC is using the PCIe bus? |
No and maybe, in that order ;-). This is transmit-only (packetblaster) and this root cause is not confirmed yet, just idea de jour. Some more details of the setup over in #1007. |
The NIC probably has to use posted requests here - read request needs a reply to get the data - but maybe it needs to make twice as many requests in parallel to achieve robust performance. |
@wingo A direct analogy here is if your computer couldn't take advantage of your fast internet connection because it had an old operating system that advertises a small TCP window. Then you would only reach the advertised speed if latency is low e.g. downloading from a mirror at your ISP. Over longer distances there would not be enough packets in flight to keep the pipe full. Anyway, just a theory, fun if it were true... |
A few things.
For 100G packet capture we dont care about latency much, just maximum throughput thus havent dug around there much. We`ll add full nano accurate 100G line rate PCAP replay in a few months at which point latency and maximum non-posted read bandwidth becomes important.
|
@fmadio Thanks for the info! I am struck that "networks are networks" and all these PCIe knobs seem to have direct analogies in TCP. "Extended tags" is window scaling, "credits" is advertised window, bandwidth*delay=parallel constraint is the same. Sensible defaults change over time too e.g. you probably don't want to use Windows XP default TCP settings for a 1 Gbps home internet connection. (Sorry, I am always overdoing analogies.) So based on the info from @fmadio it sounds like my theory from the weekend may not actually fit the constants but let's go down the PCIe rabbit hole and find out anyway. I have poked around the PCIe specification and found a bunch of tunables but no performance breakthrough yet. Turns out that
Observations:
I have tried a few different settings (e.g. I would like to check if we are running out of "credits" and that is limiting parallelism. I suppose this depends on how much buffer space the processor uncore is making available to the device. Guess the first place to look is the CPU datasheet. I suppose that it would be handy to have a PCIe sniffer at this point. Then we could simply observe the number of parallel requests that are in flight. I wonder if there is open source Verilog/VHDL code for a PCIe sniffer? I could easily be tempted to buy a generic FPGA for this kind of activity but a single-purpose PCIe sniffer seems like overkill. Anyway - I reckon we will be able to extract enough information from the uncore performance counters in practice. BTW lspci continues with more parameters that may also include something relevant:
|
Just a note about the server that I am using for testing here (
Could be that we could learn interesting things by testing both 100G ports in parallel. Just briefly I tested with 60B and 1500B packets. In both cases the traffic is split around 50/50 between ports. On 1500B I see aggregate 10.75 Mpps (well above single-port rate of ~6.3 Mpps) and on 60B I see aggregate 76.2 Mpps (only modestly above the single-port rate of 68 Mpps). |
HDL these days is almost entirely packet based, all the flow control and processing inside those fancy asic`s are mostly packet based. So all the same algos are there, different names and format but a packet is still a packet regardless of it contains a TCP header or QPI header. Surprised the devices shows up as x16, means you`ve got PLX chip there somewhere acting as bridge. It should be 2 separate and distinct pcie devices. You can`t just make a PCIe sniffer, a bridge would be easier. You realize a Oscilloscope that capable of sampling PCIe3 signals will cost $100-$500K ? those things are dam expensive. PCIe sniffer will "only" cost a meager $100K+ USD. On the FPGA side monitoring the credits is pretty trivial. Forget if the Intel PCM kit has anything about PCIe Credits or monitoring Uncore PCIe FIFO sizes. Its probably there somewhere, so if you find something would be very cool to share. |
@fmadio ucevent has a mouth-watering list of events and metrics. I am working out which ones are actually supported on my processor and to untangle CPU/QPI/PCIe ambiguities. Do any happen to catch your eye? This area is obscure enough that googling doesn't yield much :-). |
@fozog Great thoughts! I would love to see a packet processing framework designed around back-to-back contiguous packets. This would be a very interesting design and potentially in harmony with CPUs in many ways. This would also be quite aligned with the physical link i.e. ethernet is essentially a serial port of sorts. However, my suspicion is that this may be a short-term band-aid rather than the way forward. Seems like it is straightforward to do 40G/100G 64B line-rate using multiple 10G adapters and only a challenge with 40G/100G adapters. So what is the difference? Seems to me like the CPU, L3 cache, RAM, PCIe lanes, etc, all have enough capacity and the limitation is somewhere in the NIC. I would bet a cold beverage that the problem for high-packet-rate 40G/100G today is the PCIe endpoints in the NICs being under-dimensioned. Likely the PCIe endpoint is a generic IP core licensed from a third party who is now racing to fix the performance in time for the PCIe 4.0 refresh. Would you take that bet? :-) |
@lukego Thanks. A DMA transaction needs a completion. So we have the current timing: This means roughly 30M DMA transactions per second... If I am not mistaken, early completions are sent when the DMA is received at the uBOX. but because of delays, you can't have too many early completions, hence the credit mechanism. In addition this theoretical reasoning assumes that all packets fit in L3, which may be possible for L2 switch that just operates on MAC, but typically not possible for complex firewalls, openflow switches with decent number of rules... If we double DMA capacity for PCIe gen4 and use x16 ports, it may be possible to support 100Gbps line rate but I think it is weak and depends on perfect distribution of packets in queues and limits the functionality that can be applied. In addition, the talks I have with people designing for 400Gbps conclude that moving one packet at a time is a huge waste while moving packets in batch allow more time budget for packet handling (it is an intuitive truth don't you think?). One scarce resource of data centers in square meters, or more precisely volume. On a 1U server, you can squeeze the CPU power to handle 200Gbps of data (I made the test with a simple web server built on commercial TCP/IP stack leveraging DPDK) but only two NICs.... So using 10G adapters instead of 40G or 100G is not really efficient... Did I won the cold beverage ? |
@fozog I think you make an excellent case. I will buy you that cold beverage when our paths cross one day :). I would quite like to play with this design in Snabb. I have been thinking related thoughts on other topics like "Packet Copies: Cheap or Expensive?" (#648). I have a couple of objections in practical terms to adoption though: First is that we cannot implement this design using existing NICs (e.g. Intel) on the receive path, right? So the implementation benefits need to outweigh the compatibility issues. (Thinking VHS / BETA.) Second is that line-rate 64B is not actually an important workload. 64B packets are tiny and nobody really wants to send millions of them across their networks back-to-back. The real reason that people focus on 64B line-rate is to avoid creating a bottleneck by introducing a new device that can't keep up with the existing ones. If you bought an expensive router that does 64B line-rate then you will not want to connect that to other devices that can't keep up e.g. a switch or a monitoring probe because then you are "wasting" your capacity. On the other hand if you already bought a switch that only does 256B line-rate (e.g. 100G Tomahawk) then the other devices only need to keep up with that - there is no value in being faster because the bottleneck does not change. So, tongue only half in cheek, I would propose to solve this problem by specifying a minimum packet size of 256B for the network and enforcing this with policing in the core. This will be more than adequate for real-world internet traffic and it will save a lot of unnecessary engineering! |
You can use Chelsio NICs that implement such coalescing scheme since 2012 (not sure about the date, I mean this is not new) with their T4 chip. The kernel driver wraps SKBs around packets placed in the buffers managed by T5. It is unclear whether the DPDK driver supports T4 and T5 chips. The key challenge is to coalesce packets on the send side because it is the software that has to do that. According to statistics I gathered from Internet Exchanges a few years ago, 64 byte packets account for 30% of packets observed on a 10Gbps link (TCP SYN, SYN-ACK, ACK, RST; some UDP...). Being line rate at 256 bytes is the typical metric of telecom operator RFPs so I could agree with you that this is the practical line rate target. That said, when you have DDoS (TCP SYN or DNS amplified) then 64 byte packets are representing the vast majority of packets. So the 64 byte packet rate behavior becomes the metric for attack resilience. For a DPI or a security box, you want 64 byte line rate. For other nodes in the network, 256 line rate is good enough. |
I have an observation to make based on the "core box does 256B so we don't need to care about more". There is a difference here between that core box and Snabb. The core device have tens or hundreds of 100G and sees all types of traffic in the network of which the average packets size must be higher than 256B. Snabb is typically used to implement a specific application, so it will not see the networks average packet size - it will see the packet size for this specific application. If Snabb is a single application and it gets a DDoS attack of 200Gbps at 64B packets it will have to process all of those. For my core box the influx of 200Gbps 64B packets will only marginally lower the average packet size since it is already shuffling around say 10Tbps of other traffic with a much higher average packet size. |
@plajjan Good point! @fozog I really like this idea of storing packets contiguous in memory. So instead of having a ring containing pointers to ~100 packets we could have a ring containing ~10kb of serial packet data. Then if we want to transform the packets (e.g. add/remove 802.1Q header) we make an efficient streaming copy with transformation on the way through. I think this would be most interesting if the application code used a streaming API for accessing the packets rather than pretending they are scattered in memory (even if that may be necessary for compatibility in certain cases.) This way the streaming would be an optimization for both the CPU and the PCIe (rather than improving PCIe efficiency at the expense of the CPU.) This would be a very interesting experiment to play with :) even purely on the CPU side to see if the streaming programming model is convenient and efficient in practice. I would love to have a hackable NIC in the spirit of lowRISC to be able to experiment and productionize new features on the hardware side too... cc @fmadio :-) |
@lukego I`m still game for this, except HDL can not be opensource. Could do source access with an NDA if you want to hack on it tho. Driver and other stuff can be full GPL. Your starting to realize some of the cool things that can be done if you throw out the existing ring/descriptor paradigm, the architecture starts looking pretty different :) |
Just thinking that a simple way to experiment on the contiguous packet idea would be to process pcap files. These already have packets stored back-to-back with a modest amount of per-packet metadata (16B for timestamp + length.) Challenge could be along the lines of:
Could be that this would work fantastically well and we would want to update the whole of Snabb to operate on contiguous packet data instead of isolated buffers. Or could be disappointing. dunno :). |
@fmadio Have you seen lowRISC? I think this is an interesting model. They have sat down and built a base hardware implementation (CPU) completely from scratch. Huge investment of effort presumably. Now they want to start producing ASICs at regular intervals so that any contributions that are upstreamed will become silicon automatically. This could then create a feedback loop where lots of users want to improve the upstream repository so that they will have better hardware available for their future projects. Quite a bootstrapping challenge. If it works out for CPUs then maybe somebody can try it for NICs. |
@lukego yes saw lowRISC, its a non-profit and grant (pseudo VC) funded. We are obviously for-profit and bootstrapped, almost the complete opposite. The PCAP header is horribly inefficient and slow. We do use a 16B header except its much better packed and optimized for HDL. Try work us, maybe being fully open will be less important to you once we get a rapid feedback loop running. Could do some pretty cool things with our NIC + Snabb I think. |
@fmadio I would love to play with you guys' stuff. Just now though I am not involved in any projects that need 64B line-rate or that can use non-COTS hardware. So I am limited to experimenting with software architectures and blue-sky dreams for future projects with more exotic hardware :-). |
@fozog I am really interested in the "store packets in contiguous serial memory" idea. I have been pondering it a little more. Current feeling is that it makes sense in niche applications - e.g. 100G 64B-line-rate packet capture using NICs that exist in 2017 - but that discrete packets are likely more suitable for general application development (e.g. main Snabb/DPDK use cases.) Few reasons:
So for the moment I am going to switch my daydreaming over to the topic of array programming in #1099 :-) |
@lukego I am aware of some silicon forwarding implementations that rely on this "aggregated frames" concept across the backplane switch fabric. As I recall, the packet processors were sending/receiving them to/from fab deaggregated because of an intermediate step in fabric tx/rx (which I reckon had more to do with locality breeding easier interconnect wrt deaggregated frames and packet processors). The way packets were brought across fabric to prevent HOL blocking in this case was to arbitrate frames in VoQ fashion. This assumes, I think, that one can A) model the output queues and B) determine or, perhaps, even simply predict the output device before bringing it in. If you have tighter control of the PCI interaction and expose a set of "VoQs" as (example) a separate credit recipient from PCI perspective, would this let you stop HOL blocking and potentially just drop in HW? Would the complexities of modeling some sorts of "predictive" queueing outweigh the benefits? And more importantly, is this idea really just too device-specific, situation-specific and overall just a pain to handle? Hoping this will breed some other ideas too, perhaps without the need for such complicated predictions in our COTS NIC HW. |
I was looking into the DPDK site list of NICs (imagine how cool it would be if such a long list would exist for snabb ...) for those that support 100G, and think one more recent interesting addition - based on cost consideration - is obviously the entry "15. QEDE Poll Mode Driver" that refers to Qlogic FastLinQ QL45000 series (resold by some major server vendors) http://dpdk.org/doc/guides/nics/qede.html The Qlogic site at http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/SearchByProduct.aspx?ProductCategory=325&Product=1263&Os=26 shows not just the drivers, but also allows to download detailed documentation dated late 2016: for the chipset register specification (approx. 350 pages) and a developers guide (approx. 130 pages). The overall offering of real 100G NIC chips will probably keep on broadening - hopefully the industry reshuffling will also drive more players to open up their documentation. Qlogic's steps after takeover by Cavium does certainly help. I am also curious on the outcome of another group, as Emulex was swallowed by Avago, who later took over Broadcom, but retained the brand name for the whole group as Broadcom (stock ticker is still AVGO). We should expect a pretty competitive 100G NIC from price performance point of view from them. A quite open question here is if Broadcom will remain rather locked up wrt. documentation as this is how they managed it in the past. Certainly Qlogic/Cavium, Mellanox and others have embraced the idea of maximising their usefulness in context of FOSS projects more than some others... the official website still shows no 100G NIC product, but IMHO it really can't take very much longer until they come out of the stealth mode with somthing like that. |
@lukego 5.9 cycles per RAM! That was a stricking number when I mentally compared to DRAM timing characteristics I am aware of. The site says 5.9ns. Well that seemed more realistic, but if you read on top it says "RAM Latency = 42 cycles + 51 ns" which is what I would have thought. I am affraid the 5.9ns calculation is close to the following reasoning: "with nine women we can get a baby in one month". But you are making very good points on the future of PCI and and L3. I think the fundamental benefit of coalesced packets is the cache protection. By having 2K aligned packet buffers, I suspect that packets are not that well distributed and tend to hit particular areas of the L3 cache (no concrete measures made) and the number of cache ways may not suffice to absorb the pressure. By contrast, consecutive packets should dispatch cachelines to all L3 slots. |
@fozog You are right about 2KB aligned packet buffers. We have found that all data structures used in the traffic plane need to have a lot of entropy in the lower bits (mod 4096) of their addresses. Otherwise we see cache conflict misses that impact performance in a big way. This has bitten us on packet buffers (once upon a time 4096-aligned) and also on everything allocated as file-backed shared memory (kernel allocates on a 4096 byte page boundary.) I think the "RAM delivers one random-access read every 5.9ns" claim is following the reasoning that "with nine women we can produce nine babies in nine months" i.e. latency is 9 months but average throughput is one baby per month due to parallelism. In our code we are often performing the latency-sensitive operations on many packets in a row (like loading the first cache line of payload or making a lookup in a table) and so we should be well positioned to exploit instruction-level parallelism. (Compare with e.g. Linux kernel that tends to process one packet at a time and will need to find a way to overlap computation with individual high-latency operations like loading from memory and taking locks.) |
Re: Headline blocking, would expect its reasonable to just drop packets on any egress ports who`s FIFO is full ? |
@fmadio You are right. If I am using contiguous buffers from both input and output then I will need to make a data copy to transfer from ingress queue memory to egress queue memory. I could make these copies in FIFO order. Once that copy is done then the ingress memory can be freed because the DMA is referencing the egress memory. If no egress memory is available then the packet is dropped (never held pinned in the ingress memory.) I have been imagining a zero-copy design where packet data stays in the same place i.e. the same memory is referenced by DMA descriptors for both ingress and egress. However that would seem to be the wrong design for contiguous packet buffers because then you have to somehow "garbage collect" memory that can be referenced by multiple DMA requests with indefinite lifetimes. Seems like the right alternative would be to accept the copies and aggressively optimize them. Is that how you are thinking too? |
(Could also perhaps implement the zero-copy mode with reference counters on the memory blocks. Then you increment the reference count when you create a DMA descriptor and you decrement it when the DMA completes. Memory is freed when count reaches zero. This is actually what we did in Snabb back in the distant past but our experience lead us to prefer a simpler design with fewer code paths to optimize.) |
@fmadio I totally agree that 100G sustained come with costs but I just want to point out that Napatech actually do 100G sustained for all packet sizes ;) @lukego This is quite an interesting discussion that touch several good topics. One of the things are “contiguous serial memory” and in that regards I just want to stress out that this approach from a PCIe perspective has the advantage that you avoid lots of overhead enabling you to get sustained 100G with packet sizes < 128B. It’s going to be a while before PCIe Gen4 is to be found in any Xeon systems, so right now if you really want to have 100G sustained my take is that the “contiguous serial memory” is the only approach that works and then you must RSS scale to multiple cores to hide the overhead of the potentially involved memcpy(). |
@mlilja01 cool, when did you achieve sustained line rates? Last time I checked your 100G boards were burst capture only and limited to 40Gbps of host bandwidth? |
@fmadio I think we have had that since last summer :) |
@mlilja01 ok, guess I can no longer mouth off about it :) tho you need to tell the marketing folks. |
Fix packetblaster lwaftr
This issue is spinning off into an effort to design a "simple and pleases both hardware and software people" host-device interface called EasyNIC. Hope to chat with you guys on issues over there too :). |
Just a heads up that we looked at this again recently witha pci gen4 system and ConnectX-5 cards: #1471 |
I am pondering how to think about packet rates in the 100G era. How should we be designing and optimizing our software?
Consider these potential performance targets:
I have a whole mix of questions about these in my mind:
Raw brain dump...
So how would you optimize for each? In every case you would surely use multiple cores with RSS or equivalent traffic dispatching. Beyond that...
So which would be hardest to achieve? and why?
The one I have a bad feeling about is "B". Historically we are used to NICs that can do line-rate with 64B packets. However, those days may be behind us. If you read Intel datasheets then the minimum packet rate they are guaranteeing is 64B for 10G (82599), 128B for 40G (XL710), and 256B for 100G (FM10K). (This is lower even than performance "A".) If our performance targets for the NICs are above what they are designed for then we are probably headed for trouble. I think if we want to support really high per-port packet rates then it will take a lot of work and we will be very contained in which hardware we can choose (both vendor and enabled features).
So, stumbling back towards the development de jour, I am tempted to initially accept the 64 Mpps per port limit observed in #1007 and focus on supporting "A" and "C". In practical terms this means spending my efforts on writing simple and CPU-efficient transmit/receive routines rather than looking for complex and CPU-expensive ways to squeeze more packets down a single card. We can always revisit the problem of squeezing the maximum packet rate out of a card in the context of specific applications (e.g. packetblaster and firehose) and there we may be able to "cheat" in some useful application-specific ways.
Early days anyway... next step is to see how the ConnectX-4 performs with simultaneous transmit+receive using generic routines. Can't take the 64 Mpps figure to the bank quite yet.
Thoughts?
The text was updated successfully, but these errors were encountered: