You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last year, after the longawaited release of Zenoh 1.0 which included a unified C API with Zenoh-C and Zenoh-Pico, we decided to dedicate some time to measure and improve the performance & efficiency of Zenoh-Pico. These modifications were released with Zenoh 1.1 earlier this year and we're presenting you the results with this blog post.
12
+
Last year, after the long-awaited release of Zenoh 1.0 which included a unified C API with Zenoh-C and Zenoh-Pico, we decided to dedicate some time to measure and improve the performance and efficiency of Zenoh-Pico. These modifications were released with Zenoh 1.1 earlier this year and we present the results to you with this blog post.
13
13
14
14
## What is Zenoh-Pico?
15
15
16
16
Zenoh-Pico is the lightweight, native C implementation of the[ Eclipse Zenoh](http://zenoh.io) protocol, designed specifically for constrained devices. It provides a streamlined, low-resource API while maintaining compatibility with the main[ Rust Zenoh implementation](https://github.com/eclipse-zenoh/zenoh). Zenoh-Pico already supports a broad range of platforms and protocols, making it a versatile choice for embedded systems development.
17
17
18
18
## The results
19
19
20
-
To measure performance, we have a standardized throughput test and a latency test which we run on a standardized machine (Intel Xeon E3-1275 @3.6GHzn 32GB DDR4, Ubuntu 22.04). For embedded measurements, we ran those tests on an esp32-wroom32 dev board.
20
+
To measure performance, we have a standardized throughput test and a latency test which we run on a standardized machine (Intel Xeon E3-1275 @3.6GHz, 32GB DDR4, Ubuntu 22.04). For embedded measurements, we ran those tests on an ESP32-WROOM-32 dev board.
21
21
22
-
These tests produce a thousand measurements or so per payload size that we use to calculate the median value to then get the following graphs (note that the yaxis is log scale):
22
+
These tests produce a thousand measurements or so per payload size that we use to calculate the median value to then get the following graphs (note that the y-axis is log scale):
We see a massive (up to 100x) improvement in throughput for payloads over 32kiB, this is because packets of these sizes are fragmented on the network and we had an issue where their data was serialized byte per byte.
31
+
We see a massive (up to 100x) improvement in throughput for payloads over 32KiB, this is because packets of these sizes are fragmented on the network and we had an issue where their data was serialized byte-by-byte.
32
32
33
33
We also see a >10x improvement in throughput for smaller payloads when using manual batching (more info below) introduced in 1.1 as well.
34
34
35
35
Other than that there are no significant changes because client performance is limited by the router.
Peer to peer being not limited by router performance, we observe a bigger improvement on smaller payloads with batching (>20x), but a smaller one (>10x) for fragmented packets (>2kiB) because of UDP's smaller packet size.
44
+
Peer to peer being not limited by router performance, we observe a bigger improvement on smaller payloads with batching (>20x), but a smaller one (>10x) for fragmented packets (>2KiB) because of UDP's smaller packet size.
45
45
46
-
In addition, we observe a 60% throughput increase for the other payload sizes, that results from the many improvements we implemented and that we detail below.
46
+
In addition, we observe a 60% throughput increase for the other payload sizes, that results from the general library optimization.
This plot let us see a >50x improvement on fragmented packets latency, again due to data copy optimization, but also a 35% improvement across the board from the general library optimization.
55
+
This plot shows a >50x enhancement on fragmented packets latency, again due to data copy improvement, but also a 35% boost across the board from the general library optimization.
56
56
57
-
Note that a big chunk of the latency value is due the router (node to router hop + time to route the packet + router to node hop), and this value could be much lower using peer to peer tcp unicast.
57
+
Note that a big chunk of the latency value is due to the router (node to router hop + time to route the packet + router to node hop), and this value could be much lower using peer to peer TCP unicast.
58
58
59
59
### Performance limitations/regime:
60
60
@@ -67,7 +67,7 @@ Before going into embedded results, let's spend some time in understanding what
67
67
68
68
For throughput there are 3 distinctive regions:
69
69
* Region 1 is limited by network and syscalls, with `send` and `recv` taking more than 90% of the execution time.
70
-
* Region 2 is limited by cpu speed / zenoh-pico performance, with tx taking slightly more cpu power than rx.
70
+
* Region 2 is limited by CPU speed / Zenoh-Pico performance, with TX taking slightly more CPU power than RX.
71
71
* Region 3 is limited by memory bandwidth, with `memcpy` taking more and more of the execution time as payload size grows.
72
72
73
73
{{< figure-inline
@@ -76,19 +76,19 @@ For throughput there are 3 distinctive regions:
76
76
alt="Latency limitations" >}}
77
77
78
78
For latency there are 2 regions:
79
-
* Region 1 is limited by cpu speed / Zenoh-Pico performance.
79
+
* Region 1 is limited by CPU speed / Zenoh-Pico performance.
80
80
* Region 2 is limited by memory bandwidth, similarly to throughput.
81
81
82
82
### Embedded throughput:
83
83
84
-
Embedded systems being limited memorywise, we limited payload sizes to 4kiB maximum which is still enough to observe fragmented packets behavior for 2kiB and 4kiB sizes.
84
+
Embedded systems being limited memory-wise, we limited payload sizes to 4KiB maximum which is still enough to observe fragmented packets behavior for 2KiB and 4KiB sizes.
The esp32 really benefits from batching with a >50x increase in throughput, which seems fair since we're going through a much slower wifi interface compared to loopback that uses unix pipe.
91
+
The ESP32 really benefits from batching with a >50x increase in throughput, which seems fair since we're going through a much slower Wi-Fi interface compared to loopback that uses unix pipe.
92
92
93
93
### Embedded latency:
94
94
@@ -97,26 +97,26 @@ The esp32 really benefits from batching with a >50x increase in throughput, whic
97
97
class="figure-inline"
98
98
alt="Peer throughput" >}}
99
99
100
-
Latency values are in the ~10ms range mostly because wifi itself is slow as demonstrated by the ~4ms value observed on Zenoh-Pico PC latency measured on the same wifi network.
100
+
Latency values are in the ~10ms range mostly because Wi-Fi itself is slow as demonstrated by the ~4ms value observed on Zenoh-Pico PC latency measured on the same Wi-Fi network.
101
101
102
-
We do observe a big impact on latency when trying to send fragmented packets which should come from both wifi and esp32 bandwidth limitation.
102
+
We do observe a big impact on latency when trying to send fragmented packets which should come from both Wi-Fi and ESP32 bandwidth limitation.
103
103
104
104
## How performance was improved
105
105
106
-
To improve Zenoh-Pico performance, we traced it on PC using [samply](https://github.com/mstange/samply) and the Firefox debugger to visualize the traces. That allowed us to detect choke points and part of the code that could be improved.
106
+
To improve Zenoh-Pico performance, we traced it on PC using [samply](https://github.com/mstange/samply) and the Firefox debugger to visualize the traces. That allowed us to detect choke points and parts of the code that could be improved.
107
107
108
-
As stated earlier, the most impactful changes were solving the byte by byte copy issue for fragmented packets and the introduction of the manual batching mechanism.
108
+
As stated earlier, the most impactful changes were solving the byte-by-byte copy issue for fragmented packets and the introduction of the manual batching mechanism.
109
109
110
-
Beside that, we also streamlined a lot how the stack created, used and destroyed data to avoid redundant operations or needless data copies. We also rationalized heap memory usage and fragmentation although these changes were not quantified.
110
+
Besides that, we also streamlined a lot how the stack created, used and destroyed data to avoid redundant operations or unnecessary data copies. We also rationalized heap memory usage and fragmentation although these changes were not quantified.
111
111
112
112
## Manual Batching
113
113
114
114
If you want to use Zenoh-Pico recently introduced manual batching you only have 3 things to know about:
115
-
*`zp_batch_start`: Activate the batching mechanism, any message that would have been sent on the network by a subsequent api call (e.g `z_put`, `z_get`) will be instead stored until either: the batch is full, flushed or batching is stopped
116
-
*`zp_batch_stop`: Deactivate the batching mechanism and send the currently batched on the network.
115
+
*`zp_batch_start`: Activate the batching mechanism, any message that would have been sent on the network by a subsequent API call (e.g `z_put`, `z_get`) will be instead stored until either the batch is full, flushed or batching is stopped.
116
+
*`zp_batch_stop`: Deactivate the batching mechanism and send the currently batched messages on the network.
117
117
*`zp_batch_flush`: Send the currently batched messages on the network.
118
118
119
-
Note that there are also cases where a batch will be sent if a message needs to be sent immediately, like when sending keepalive messages or if the api pushes a message with the `is_express`qos.
119
+
Note that there are also cases where a batch will be sent if a message needs to be sent immediately, like when sending keep-alive messages or if the API pushes a message with the `is_express`QOS.
120
120
121
121
### Examples:
122
122
@@ -151,12 +151,12 @@ In this second example, another thread is responsible for sending messages and w
151
151
zp_batch_stop(z_loan(session));
152
152
```
153
153
154
-
## Wrapping-up
154
+
## Wrappingup
155
155
156
156
As you saw, we improved throughput and latency across the board, in some cases reaching a 100x increase.
157
157
158
158
We also introduced manual batching which, beside improving throughput of small messages, can be used to reduce power consumption in embedded devices by reducing network transmissions.
159
159
160
-
Now let's talk briefly of our next big feature. As we hinted above, we are limited in client by the router both in throughput and latency, but client mode is currently the only way to use tcp links in Zenoh-Pico...
160
+
Now let's talk briefly of our next big feature. As we hinted above, we are limited in client mode by the router both in throughput and latency, but client mode is currently the only way to use TCP links in Zenoh-Pico...
161
161
162
-
That was true until the newly introduced peer to peer unicast mode that we will present in a future blogpost!
162
+
That was true until the newly introduced peer-to-peer unicast mode that we will present in a future blog post!
0 commit comments