-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace queue with linked list #21
base: main
Are you sure you want to change the base?
Conversation
Replace the bounded queue with a linked list and condvar implementation, and replace the closed_slots system with double indirection via AsyncClient's own memory. This allows the system to correctly handle cases where it is not possible to allocate a new event and still ensure that the client's `onDisconnect()` will be queued for execution.
Wow! Thanks a lot! |
src/AsyncTCP.h
Outdated
#ifndef CONFIG_ASYNC_TCP_MAX_ACK_TIME | ||
#define CONFIG_ASYNC_TCP_MAX_ACK_TIME 5000 | ||
#endif | ||
|
||
#ifndef CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST | ||
#define CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the use of that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First porting error - both the name and the implementation are bad. :(
There's a strange feature in the original code: AsyncClient has an intrusive list integrated (the prev/next pointers), but it's unclear as to why or what it's useful for. I removed it in my branch to save RAM as nobody in WLED (AsyncWebServer, AsyncMQTT) actually uses it for anything. I was trying to make it conditional, but default on. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(default on for backwards compatibility, in case there is someone using it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don' see any reason to keep this linked list... Do you @me-no-dev ? This is not even used.
@willmmiles : fyi as seen with @me-no-dev we will merge #19 first, do a v3.3.4, then focus on reviewing / merging you PR. |
Sounds good, will do! |
The semantics of this operation were error-prone; remove it to ensure safety.
I had just a short glimpse here, an interesting approach indeed. But also have a lot of things to help me to understand. first thing first - how the list size is controlled there? Does it have any limits or it can grow as much as resources are there? |
This draft makes no attempt to limit the queue length, so long as there's heap available. A strictly fixed size queue is impractical because we must ensure that disconnection events cannot be dropped, or resources will leak. It's possible to add a soft upper limit on non-critical events, but it didn't seem to be worth the extra complexity (or having to explain an arbitrary resource limit that's independent of what the heap can actually service). Implementing event coalescence for poll, recv, and sent events will put a functional upper bound on the queue size as well, based on the number of open connections. The rationale to replace the queue breaks down as:
I'm usually the first to recommend using library implementations for classic data structures; ultimately I judged that the maintenance burden for the limited requirements here was less than the cost of bringing in some external library, given that Replacing the close_slot system is otherwise straightforward (it's nothing more than strict ownership of the pcb* by the LwIP thread), but it is contingent on guaranteed disconnect event delivery. I did originally implement these as separate changes, but since it wasn't going to merge cleanly with the development line from my old fork, I judged it wasn't worth trying to break it down in to separate commits. |
@willmmiles : fyi asynctcp is released and we'll do an asyncwebserver release tomorrow with the current v3.3.5 of asynctcp. So we are good to refresh this PR and have time to reviewed / merged. When i tested with autocannon:
I didn't test the client part yet. Will do later. |
No reason for this to require a function call.
Use new(std::nothrow) instead.
If any of the TCP callbacks fails to allocate, return ERR_MEM.
src/AsyncTCP.cpp
Outdated
tcp_recv(pcb, &_tcp_recv); | ||
tcp_sent(pcb, &_tcp_sent); | ||
tcp_err(pcb, &_tcp_error); | ||
tcp_poll(pcb, &_tcp_poll, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a constant here ? it was CONFIG_ASYNC_TCP_POLL_TIMER
before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# 2 on the merge failure list!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2321755
tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg); | ||
return msg.err == ESP_OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an issue with the client part which is not working (testing with the Client.ino).
connect()
returns false from here.
I've changed the code to get the error:
tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg);
if(msg.err != ERR_OK) {
log_e("tcpip_api_call error: %d", msg.err);
}
return msg.err == ESP_OK;
[ 1305][E][AsyncTCP.cpp:791] _connect(): tcpip_api_call error: -16
-16 is the invalid arg error.
Hope that helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be using the config setting.
We can store the newly allocated handle directly in the object member.
We don't plan on writing to it - might as well save the copying.
I've put a prototype event coalescing branch at https://github.com/willmmiles/AsyncTCP/tree/replace-queue-coalesce . My basic tests seem to work, but I don't yet have a good test case to really exercise the new logic. |
@willmmiles agreed to most of your points above, with proper coalescing code the events chain size cap won't be that critical. And the benefits of correctness does worth the efforts on proper locking code. Good job indeed! I'm not sure I understood your point on why |
Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :) |
Here are the problematic behaviour that we currently have and that this implementation seems to solve
Plus all the valid points Will explained above. I think it makes sense to inclue his changes following the many efforts he did in stabilising this library in the case of WLED. @willmmiles : I will retest your changes and the replace-queue-coalesce branch also, we have 2 use cases. |
So memory leak and lost events when running under high load, which might be actually connected. |
Hi @mathieucarbou, |
I just run perftest example on main, then connect to the AP and run: autocannon -c 64 -w 32 -d 30 -t 30 http://192.168.4.1. I don't see any recovery after 20-30 sec and if I run a second time the CLI, esp crashes with memory error. Note: and I do see a recovery with Will's code, heap recovers in a few seconds. |
@vortigont I think I was able to pinpoint a bit more the issue:
Runs 32 concurrent requests, spread across 32 workers. If am am running this 3 times, I do not have any memory leak. output:
Now, I am testing a different situation:
I still have 32 concurrent connections over 32 workers, but this time, I am sending 96 requests total, so some workers will send more than 1 requests. I run this 3 times: autocannon tells me he served 96 requests each time I run it (so 3 times). ![]() But the number of requests received is only 192, so 64 per test, so 32 less than the expected number. And the logs show some heap loss that never recover:
This is the test I did previously: it runs 64 connections concurrently, over 32 workers, and I am maintaining that for 30 seconds. So I have some workers sending several requests concurrently and also sequentially. |
Exactly that. If an event which would've triggered The only way through is to guarantee that the The second set of changes was a removal of the |
this autocannon thing could be really confusing. To my surprise I found that |
If we're fortunate enough to have exceptions enabled, then yes, Alas, it seems that Arduino ESP32 projects are built with exceptions disabled. Since
If using this team's releases, yes. We forked AsyncWebServer before those changes; and that decision blocks us from switching. We don't need the new features -- what we need is heap exhaustion safety, which can't be achieved without exceptions if you're using |
@willmmiles : for espasyncwebserver, 3.x will probably stay as is because it is in bugfix / maintainance mode. We plan on starring a v4 branch once 3.x is more stable to start cleaning stuff and change the apis required to be changed. This will be an opportunity to review all the memory allocations since breaking current API will be possible. Did you see by the way the stack trace I got today when testing your branch ? I didn't have that yesterday. Thanks |
Unfortunately your v4 plans are also of no help to us -- ESP8266 is 50%+ of our active user base. (Some of the other maintainers were very surprised to see those stats on our last release!) A lot of us have them deployed in hard-to-replace locations. They're capable little chips with many years of service life left, and it always saddens me to find so many projects wanting to turn them in to e-waste. :(
Yup, working on it! |
@willmmiles pls do not get me wrong, I'm not trying to persuade you into using |
Somewhat ironically, it's the web requests themselves that "cause" the heap exhaustion. An idle ESP32-S2 system might have maybe 50-70kb of heap available. A single session, busy serializing 10kb of JSON, plus transport buffers, fits easily in available RAM and runs without issue. But when we have many requests in flight at once it can use a lot of heap quickly, with LwIP transfer buffers, upper layer serialization buffers, filesystem buffers, and so on... (Darn HomeAssistant never stops polling that big presets file!) So the reported behaviour from the users is "everything works great, except sometimes it mysteriously resets". Crash dumps indicated heap exhaustion because too many web requests were being serviced in parallel -- the user just so happened to hit the server at the same moment as their other integrations, resulting in OOM and the world ending. My first attempt to mitigate the problem was to try to adding a queue in AsyncWebServer, permitting only a couple of requests to be in flight at once, and 503ing anything we didn't think we could handle. It helped - at least from the end user perspective - but because I'm a glutton for punishment, I wasn't going to be satisfied until the web server could no longer bring down an otherwise correctly operating device. Unfortunately, even queuing responses quickly spirals out of control if you hit it with a stress test like If I had control over the LwIP stack, I'd both (a) allow drastically smaller poll timers - something in the 10s of ms range; and (b) arrange a way to intentionally solicit retransmission/re-recv of a packet I declined to handle the first time so as not to spend my RAM on something the other TCP stack still has. (Kind of dirty, I know!)
I haven't investigated it personally yet; maybe someday. Another aspect of the WLED project is that it's big on integrations - both maintained by the core team, as well as user supplied modules, that call on all kinds of third party libraries;. We also support a large variety of boards. It puts a cost on every non-standard environment decision we might make, as the further we get from common platforms, the tougher it is for users to integrate their own features. Still, it's quite possible that exception support is not yet a bridge too far! |
C++ Exceptions are enabled in ESP32 Arduino. Have been for a very long time now. Do you see a different result? Or is it something else that needs enabling? |
…ts were not correctly cleaned up Fix inspired by PR #21 from @willmmiles
…ts were not correctly cleaned up Fix inspired by PR #21 from @willmmiles
…ed with a null pcb Fix inspired by PR #21 from @willmmiles
…ed with a null pcb Fix inspired by PR #21 from @willmmiles
…ed with a null pcb Fix inspired by PR #21 from @willmmiles
…ed with a null pcb Fix inspired by PR #21 from @willmmiles
pcb is already aborted by the constructor.
We're still supporting ESP8266 - not relevant for this repo, but very important for That said, maybe someday I'll look in to enabling exceptions on 8266es; might be the path of least resistance. |
@mathieucarbou I haven't been able to reproduce this. I'm pushing a couple of small fixes I found through inspection, but I couldn't get it to crash for me. :( |
Finally got a crash locally, though not in AsyncTCP -- this time the victim was |
Replace the bounded queue with a linked list and "condvar" implementation, and replace the closed_slots system with double indirection via AsyncClient's own memory. This allows the system to correctly handle cases where it is not possible to allocate a new event while guaranteeing that the client's
onDisconnect()
will be run to free up any other related resources.Key changes:
CONFIG_ASYNC_TCP_QUEUE_SIZE
queue size limit;CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST
AsyncClient
's own_pcb
member. Once initialized, this member is written only by the LwIP thread, preventing most races; and theAsyncClient
object itself is never deleted on the LwIP thread.This draft rebases the changes from willmmiles/AsyncTCP:master to this development line. As this project is moving faster than I can keep up with, in the interests of making this code available for review sooner, I have performed only minimal testing on this port. It is likely there is a porting error somewhere.
Known issues as of this writing:
AsyncClient::operator=(const AsyncClient&)
assignment operator is removed. The old code had implemented this operation with an unsafe partial move sematic, leaving the old object holding a dangling reference to the pcb. It's not clear to me what should be done here - copies of AsyncClient are not generally meaningful.AsyncClient
is addressed from a third task (eg. from an Arduinoloop()
, not LwIP or the async task). The fact that LwIP reserves the right to invalidate tcp_pcbs on its thread at any time aftertcp_err
makes this extremely challenging to get both strictly correct and performant. Core operations that pass to the LwIP thread are safe, but I think state reads (state()
,space()
, etc.) are still risky._end_event
can ignore the limit to ensureonDisconnect
gets run.Future work:
lwip_tcp_event_packet_t::dns
should be unbundled to a separate event object type from the rest of thelwip_tcp_event_packet_t
variants. It's easily twice the size of any of the others; this will reduce the memory footprint for normal operation.recv
andsend
also permit sensible aggregation.