Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace queue with linked list #21

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

willmmiles
Copy link

Replace the bounded queue with a linked list and "condvar" implementation, and replace the closed_slots system with double indirection via AsyncClient's own memory. This allows the system to correctly handle cases where it is not possible to allocate a new event while guaranteeing that the client's onDisconnect() will be run to free up any other related resources.

Key changes:

  • Replaces FreeRTOS fixed-size queue with a dynamic allocation list;
  • Removes CONFIG_ASYNC_TCP_QUEUE_SIZE queue size limit;
  • Makes AsyncClient's intrusive list support optional via CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST
  • Replace the closed slot system with a double indirection on AsyncClient's own _pcb member. Once initialized, this member is written only by the LwIP thread, preventing most races; and the AsyncClient object itself is never deleted on the LwIP thread.
  • Internal callbacks are made private

This draft rebases the changes from willmmiles/AsyncTCP:master to this development line. As this project is moving faster than I can keep up with, in the interests of making this code available for review sooner, I have performed only minimal testing on this port. It is likely there is a porting error somewhere.

Known issues as of this writing:

  • AsyncClient::operator=(const AsyncClient&) assignment operator is removed. The old code had implemented this operation with an unsafe partial move sematic, leaving the old object holding a dangling reference to the pcb. It's not clear to me what should be done here - copies of AsyncClient are not generally meaningful.
  • Poll coaelscing is not yet implemented. This could be done with a list walk, as before, or via storing pending poll event pointers in the AsyncClient objects.
  • Move construction and assignment is not yet implemented. It will require careful interlocking with the LwIP and async threads.
  • I'm about 85% confident there's still a race somewhere when an AsyncClient is addressed from a third task (eg. from an Arduino loop(), not LwIP or the async task). The fact that LwIP reserves the right to invalidate tcp_pcbs on its thread at any time after tcp_err makes this extremely challenging to get both strictly correct and performant. Core operations that pass to the LwIP thread are safe, but I think state reads (state(), space(), etc.) are still risky.
  • If strict memory bounds are required, a soft queue size limit could be reimplemented, with the caveat that each client's _end_event can ignore the limit to ensure onDisconnect gets run.

Future work:

  • lwip_tcp_event_packet_t::dns should be unbundled to a separate event object type from the rest of the lwip_tcp_event_packet_t variants. It's easily twice the size of any of the others; this will reduce the memory footprint for normal operation.
  • Event coaelscing could be considered for other types -- recv and send also permit sensible aggregation.

Replace the bounded queue with a linked list and condvar
implementation, and replace the closed_slots system with double
indirection via AsyncClient's own memory.  This allows the
system to correctly handle cases where it is not possible to allocate a
new event and still ensure that the client's `onDisconnect()` will
be queued for execution.
@mathieucarbou mathieucarbou requested a review from a team February 5, 2025 14:32
@mathieucarbou
Copy link
Member

Wow! Thanks a lot!
@vortigont and @me-no-dev will surely have a look!

@mathieucarbou
Copy link
Member

I tested this implementation in ESPAsyncWebServer:

  • autocannon -c 16 -w 16 -d 20 http://192.168.4.1 => works, about same perf

  • autocannon -c 64 -w 32 -d 20 -t 30 http://192.168.4.1 => works fine, no errors, but some requests dropped, aI see a lot of:

[327202][E][AsyncTCP.cpp:1535] _accept(): TCP ACCEPT FAIL

image
  • sse perf: for i in {1..16}; do ( count=$(gtimeout 30 curl -s -N -H "Accept: text/event-stream" http://192.168.4.1/events 2>&1 | grep -c "^data:"); echo "Total: $count events, $(echo "$count / 4" | bc -l) events / second" ) & done;

slower (about 50-100 events per second less) but i I did not see any queue overflow

  • slow requests (SlowChunkResponse.ino) works

  • request continuation works

  • and no memory leak for all, while when I run those tests on main, the heap decreases a lot.

  • websocket also works

So to me, this is a hug improvement over what we have.

src/AsyncTCP.h Outdated
#ifndef CONFIG_ASYNC_TCP_MAX_ACK_TIME
#define CONFIG_ASYNC_TCP_MAX_ACK_TIME 5000
#endif

#ifndef CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST
#define CONFIG_ASYNCTCP_HAS_INTRUSIVE_LIST 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use of that ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First porting error - both the name and the implementation are bad. :(

There's a strange feature in the original code: AsyncClient has an intrusive list integrated (the prev/next pointers), but it's unclear as to why or what it's useful for. I removed it in my branch to save RAM as nobody in WLED (AsyncWebServer, AsyncMQTT) actually uses it for anything. I was trying to make it conditional, but default on. Will fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(default on for backwards compatibility, in case there is someone using it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don' see any reason to keep this linked list... Do you @me-no-dev ? This is not even used.

mathieucarbou

This comment was marked as off-topic.

@mathieucarbou
Copy link
Member

@willmmiles : fyi as seen with @me-no-dev we will merge #19 first, do a v3.3.4, then focus on reviewing / merging you PR.
And we will release a new 4.0.0 once merged.
So you can remove the operator overrides, and prev/next fields that are anyway unused internally.
Thanks 👍

@willmmiles
Copy link
Author

@willmmiles : fyi as seen with @me-no-dev we will merge #19 first, do a v3.3.4, then focus on reviewing / merging you PR. And we will release a new 4.0.0 once merged. So you can remove the operator overrides, and prev/next fields that are anyway unused internally. Thanks 👍

Sounds good, will do!

@vortigont
Copy link

I had just a short glimpse here, an interesting approach indeed. But also have a lot of things to help me to understand.
First I like the idea of giving up the slot system, but replacing the rtos queue with a poor-man's linked list brings all the burden of it's integrity into lib's code, which is - sync/locking, length management, etc...

first thing first - how the list size is controlled there? Does it have any limits or it can grow as much as resources are there?

@willmmiles
Copy link
Author

First I like the idea of giving up the slot system, but replacing the rtos queue with a poor-man's linked list brings all the burden of it's integrity into lib's code, which is - sync/locking, length management, etc...
first thing first - how the list size is controlled there? Does it have any limits or it can grow as much as resources are there?

This draft makes no attempt to limit the queue length, so long as there's heap available. A strictly fixed size queue is impractical because we must ensure that disconnection events cannot be dropped, or resources will leak. It's possible to add a soft upper limit on non-critical events, but it didn't seem to be worth the extra complexity (or having to explain an arbitrary resource limit that's independent of what the heap can actually service). Implementing event coalescence for poll, recv, and sent events will put a functional upper bound on the queue size as well, based on the number of open connections.

The rationale to replace the queue breaks down as:

  • (Correctness) There exist events that must never be dropped, so maintaining a strict upper limit requires an extremely complicated dance to invalidate accepted events (and abort yet more connections!) if the queue should fill up
  • (Correctness) The "purge events on close" sequence requires an external lock on the queue, or there's a potential risk that events will be re-inserted out of order if new events arrive during the operation. Using the RTOS queue means paying the locking penalty twice (since it has an internal lock).
  • (Correctness) Available list libraries in std will crash if allocation fails.
  • (Performance) The "purge events on close" requires locking the queue only once, instead of 2*N times
  • (Performance) The local queue implementation minimizes the amount of code space required -- it is significantly smaller than std::forward_list
  • (Simplicity) The "purge events on close" becomes trivial code.
  • (Simplicity) Once an event has been allocated, it is guaranteed that it can be added to the queue; this eliminates all the code paths that have to free the new event should the queue turn out to be full.

I'm usually the first to recommend using library implementations for classic data structures; ultimately I judged that the maintenance burden for the limited requirements here was less than the cost of bringing in some external library, given that std is unusable for my use case (specifically: not crashing when low on memory). If you know of a suitable alternative library (especially one that allows pre-allocation of items before locking the queue) and you're happy with bringing in a new dependency, I've otherwise got no objections.

Replacing the close_slot system is otherwise straightforward (it's nothing more than strict ownership of the pcb* by the LwIP thread), but it is contingent on guaranteed disconnect event delivery. I did originally implement these as separate changes, but since it wasn't going to merge cleanly with the development line from my old fork, I judged it wasn't worth trying to break it down in to separate commits.

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 6, 2025

@willmmiles : fyi asynctcp is released and we'll do an asyncwebserver release tomorrow with the current v3.3.5 of asynctcp.

So we are good to refresh this PR and have time to reviewed / merged.

When i tested with autocannon: autocannon -c 64 -w 32 -d 20 -t 30 http://192.168.4.1, or different variants at 16 connections or more, I always found this implementation more stable, compared to the current one in main, which as these issues:

  • memory leak, and a lot! Impossible to run twice the same test
  • and logs some error logs regarding inability to allocate, ack timeout, etc

I didn't test the client part yet. Will do later.

No reason for this to require a function call.
Use new(std::nothrow) instead.
If any of the TCP callbacks fails to allocate, return ERR_MEM.
src/AsyncTCP.cpp Outdated
tcp_recv(pcb, &_tcp_recv);
tcp_sent(pcb, &_tcp_sent);
tcp_err(pcb, &_tcp_error);
tcp_poll(pcb, &_tcp_poll, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a constant here ? it was CONFIG_ASYNC_TCP_POLL_TIMER before

Copy link
Author

@willmmiles willmmiles Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# 2 on the merge failure list!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2321755

Comment on lines +789 to +790
tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg);
return msg.err == ESP_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an issue with the client part which is not working (testing with the Client.ino).

connect() returns false from here.

I've changed the code to get the error:

  tcpip_api_call(_tcp_connect_api, (struct tcpip_api_call_data *)&msg);
  if(msg.err != ERR_OK) {
    log_e("tcpip_api_call error: %d", msg.err);
  }
  return msg.err == ESP_OK;
[  1305][E][AsyncTCP.cpp:791] _connect(): tcpip_api_call error: -16

-16 is the invalid arg error.

Hope that helps!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there's merge failure #3 - I missed an underscore (and put the lines in the wrong place) when importing the locking code.

Fixed in 26e49f6

Should be using the config setting.
We can store the newly allocated handle directly in the object member.
We don't plan on writing to it - might as well save the copying.
@willmmiles
Copy link
Author

I've put a prototype event coalescing branch at https://github.com/willmmiles/AsyncTCP/tree/replace-queue-coalesce . My basic tests seem to work, but I don't yet have a good test case to really exercise the new logic.

@vortigont
Copy link

@willmmiles agreed to most of your points above, with proper coalescing code the events chain size cap won't be that critical. And the benefits of correctness does worth the efforts on proper locking code. Good job indeed!

I'm not sure I understood your point on why std::fwd_list is not usable here, list node could be created with proper alloc exception handling, then locking the queue and moving the node to list chain. I do not see much problem here, but on the second thought it might be a good choice to use simplified list structs code here. std containers in esp32 is linked against a quite large lib that can't be stripped down and affects fw size considerably. On one hand we already use those containers in AsyncWebServer code, so it will bring cxx lib anyway, but on the other hand if we consider AsyncTCP as a standalone lib that might be used elsewhere it would be wise to minimize the dependencies that considerably affects resulting fw size. Point taken!

@me-no-dev
Copy link
Member

Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :)

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 7, 2025

Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :)

Here are the problematic behaviour that we currently have and that this implementation seems to solve

  • Memory leak: currently asynctcp has one or many memory leaks which can be easily seen by running the Perf example with autocannon running like 32 concurrent connections. A big amount of heap is lost, and the test cannot be run twice: MCU crashes because cannot allocate
  • Running several concurrent connections with asynctcp also creates some ack timeout I saw sometimes and memory errors linked to the above.
  • SSE events lost: currently, when running 16 concurrent SSE connections, the queue fills up quickly and discards some events, while I do not see events discarded with this implementation. The send rate with this implementation is a little higher so that might be the reason

Plus all the valid points Will explained above. I think it makes sense to inclue his changes following the many efforts he did in stabilising this library in the case of WLED.

@willmmiles : I will retest your changes and the replace-queue-coalesce branch also, we have 2 use cases.

@mathieucarbou
Copy link
Member

Here are some tests results. I dod not test the client part anymore.

AsyncTCP#main

  • autocannon -c 64 -w 32 -d 30 -t 30 http://192.168.4.1

    • queue fills up quikcly: Errors: [ 96708][D][AsyncTCP.cpp:386] _tcp_poll(): throttling
    • memory leak of 45k !
    image

    2 errors, matching the 2 times I got the throttling message

    I cannot run the test a second time, or I have memory issues.

  • Slow chunk callback: time curl -N -v -G -d 'd=2000' -d 'l=10000' http://192.168.4.1/slow.html --output -

    works, can serve concurrent requests while a slow one is in progress, in about 1 second and we see as expected:

    [149824][D][AsyncTCP.cpp:175] _get_async_event(): coalescing polls, network congestion or async callbacks might be too slow!
    [149835][D][AsyncTCP.cpp:175] _get_async_event(): coalescing polls, network congestion or async callbacks might be too slow!
    [149846][D][AsyncTCP.cpp:175] _get_async_event(): coalescing polls, network congestion or async callbacks might be too slow!
    [149857][D][WebResponses.cpp:351] _ack(): (chunk) out of in-flight credits
    10240
    [149865][D][AsyncTCP.cpp:175] _get_async_event(): coalescing polls, network congestion or async callbacks might be too slow!
    [149876][D][WebResponses.cpp:351] _ack(): (chunk) out of in-flight credits
    [149887][D][WebResponses.cpp:351] _ack(): (chunk) out of in-flight credits
    [149894][D][WebResponses.cpp:351] _ack(): (chunk) out of in-flight credits
    

    Memory leak of ~500b

  • SSE event rate test

    image

    Memory leak of ~500b

willmmiles:replace-queue

  • autocannon -c 64 -w 32 -d 30 -t 30 http://192.168.4.1

    • We see these (expected) errors: [144350][E][AsyncTCP.cpp:1487] _accept(): TCP ACCEPT FAIL
    • memory leak of 45k !
    image

    2 timeouts => which is more expected than the errors aboive

    FASTER: serves more requests, more data transferred

    NO MEMORY LEAK

  • Slow chunk callback: time curl -N -v -G -d 'd=2000' -d 'l=10000' http://192.168.4.1/slow.html --output -

    I only see:

    [149857][D][WebResponses.cpp:351] _ack(): (chunk) out of in-flight credits
    

    NO MEMORY LEAK

  • SSE event rate test

    image

    CRASH I did not have this crash yesterday, everything was working fine and no event were lost. Now I have in the delete of:

    static void _free_event(lwip_tcp_event_packet_t *evpkt) {
      DEBUG_PRINTF("_FE: 0x%08x -> %d 0x%08x [0x%08x]", (intptr_t)evpkt, (int)evpkt->event, (intptr_t)evpkt->client, (intptr_t)evpkt->next);
      if ((evpkt->event == LWIP_TCP_RECV) && (evpkt->recv.pb != nullptr)) {
        // We must free the packet buffer
        pbuf_free(evpkt->recv.pb);
      }
      delete evpkt;
    }
    CORRUPT HEAP: Bad head at 0x3ffd389c. Expected 0xabba1234 got 0x3ffc8ba4
    
    assert failed: multi_heap_free multi_heap_poisoning.c:279 (head != NULL)
    
    
    Backtrace: 0x400832ed:0x3ffd0520 0x4008c1c9:0x3ffd0540 0x400923ae:0x3ffd0560 0x400910b7:0x3ffd0690 0x4008365b:0x3ffd06b0 0x400923e1:0x3ffd06d0 0x40152d29:0x3ffd06f0 0x40151e8d:0x3ffd0710 0x4015dcd0:0x3ffd0730 0x4015e9eb:0x3ffd0750 0x400d6476:0x3ffd0770 0x4015e9a2:0x3ffd0790 0x4015eb9a:0x3ffd07c0 0x4015ed2d:0x3ffd07f0 0x4015ee05:0x3ffd0810 0x4008cbce:0x3ffd0840
      #0  0x400832ed in panic_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_system/panic.c:463
      #1  0x4008c1c9 in esp_system_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_system/port/esp_system_chip.c:92
      #2  0x400923ae in __assert_func at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/assert.c:80
      #3  0x400910b7 in multi_heap_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c:279 (discriminator 1)
      #4  0x4008365b in heap_caps_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps_base.c:70
      #5  0x400923e1 in free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/heap.c:39
      #6  0x40152d29 in operator delete(void*) at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/0/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/del_op.cc:49
      #7  0x40151e8d in operator delete(void*, unsigned int) at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/0/idf/crosstool-NG/.build/xtensa-esp-elf/src/gcc/libstdc++-v3/libsupc++/del_ops.cc:33
      #8  0x4015dcd0 in _free_event(lwip_tcp_event_packet_t*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:151 (discriminator 1)
      #9  0x4015e9eb in AsyncClient::~AsyncClient() at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:704
      #10 0x400d6476 in std::_Function_handler<void (void*, AsyncClient*), AsyncEventSourceClient::AsyncEventSourceClient(AsyncWebServerRequest*, AsyncEventSource*)::{lambda(void*, AsyncClient*)#2}>::_M_invoke(std::_Any_data const&, void*&&, AsyncClient*&&) at src/AsyncEventSource.cpp:183 (discriminator 1)
          (inlined by) __invoke_impl<void, AsyncEventSourceClient::AsyncEventSourceClient(AsyncWebServerRequest*, AsyncEventSource*)::<lambda(void*, AsyncClient*)>&, void*, AsyncClient*> at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/13.2.0/bits/invoke.h:61 (discriminator 1)
          (inlined by) __invoke_r<void, AsyncEventSourceClient::AsyncEventSourceClient(AsyncWebServerRequest*, AsyncEventSource*)::<lambda(void*, AsyncClient*)>&, void*, AsyncClient*> at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/13.2.0/bits/invoke.h:111 (discriminator 1)
          (inlined by) _M_invoke at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/13.2.0/bits/std_function.h:290 (discriminator 1)
      #11 0x4015e9a2 in std::function<void (void*, AsyncClient*)>::operator()(void*, AsyncClient*) const at /Users/mat/.platformio/packages/toolchain-xtensa-esp-elf/xtensa-esp-elf/include/c++/13.2.0/bits/std_function.h:591
      #12 0x4015eb9a in AsyncClient::_error(signed char) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:960
          (inlined by) AsyncClient::_error(signed char) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:955
      #13 0x4015ed2d in AsyncClient_detail::handle_async_event(lwip_tcp_event_packet_t*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:312
      #14 0x4015ee05 in _async_service_task(void*) at .pio/libdeps/arduino-3/AsyncTCP/src/AsyncTCP.cpp:363
      #15 0x4008cbce in vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:139
    

@me-no-dev
Copy link
Member

So memory leak and lost events when running under high load, which might be actually connected.

@vortigont
Copy link

Hi @mathieucarbou,
how do you find this mem leak, btw? memory leak of 45k ! I do not see that with current main, for sure free heap drops during the test but once stopped it recovers after time. Just want to be aligned on test cases here.

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 7, 2025

Hi @mathieucarbou, how do you find this mem leak, btw? memory leak of 45k ! I do not see that with current main, for sure free heap drops during the test but once stopped it recovers after time. Just want to be aligned on test cases here.

I just run perftest example on main, then connect to the AP and run: autocannon -c 64 -w 32 -d 30 -t 30 http://192.168.4.1.

I don't see any recovery after 20-30 sec and if I run a second time the CLI, esp crashes with memory error.

Note: and I do see a recovery with Will's code, heap recovers in a few seconds.

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 7, 2025

@vortigont I think I was able to pinpoint a bit more the issue:

  • autocannon -c 32 -w 32 -a 32 -t 30 http://192.168.4.1

Runs 32 concurrent requests, spread across 32 workers.

If am am running this 3 times, I do not have any memory leak. output:

Uptime:  14 s, requests:   0, Free heap: 236148
Uptime:  16 s, requests:   0, Free heap: 236148
Uptime:  18 s, requests:   0, Free heap: 236148
Uptime:  20 s, requests:   0, Free heap: 236148
Uptime:  22 s, requests:   0, Free heap: 236148
[ 23777][D][AsyncTCP.cpp:175] _get_async_event(): coalescing polls, network congestion or async callbacks might be too slow!
Uptime:  24 s, requests:  22, Free heap: 135992
Uptime:  26 s, requests:  32, Free heap: 226404
Uptime:  28 s, requests:  32, Free heap: 235844
Uptime:  30 s, requests:  32, Free heap: 235844
Uptime:  32 s, requests:  32, Free heap: 235844
Uptime:  34 s, requests:  32, Free heap: 235844
Uptime:  36 s, requests:  32, Free heap: 234652
Uptime:  38 s, requests:  57, Free heap: 235844
Uptime:  40 s, requests:  64, Free heap: 235844
Uptime:  42 s, requests:  64, Free heap: 235844
Uptime:  44 s, requests:  64, Free heap: 235844
Uptime:  46 s, requests:  64, Free heap: 235844
Uptime:  48 s, requests:  64, Free heap: 235844
Uptime:  50 s, requests:  82, Free heap: 235844
[ 50775][D][AsyncTCP.cpp:175] _get_async_event(): coalescing polls, network congestion or async callbacks might be too slow!
Uptime:  52 s, requests:  96, Free heap: 235844
Uptime:  54 s, requests:  96, Free heap: 235844
Uptime:  56 s, requests:  96, Free heap: 235844
Uptime:  58 s, requests:  96, Free heap: 235844
Uptime:  60 s, requests:  96, Free heap: 235844
Uptime:  62 s, requests:  96, Free heap: 235844
Uptime:  64 s, requests:  96, Free heap: 235844
Uptime:  66 s, requests:  96, Free heap: 235844

Now, I am testing a different situation:

  • autocannon -c 32 -w 32 -a 96 -t 30 http://192.168.4.1

I still have 32 concurrent connections over 32 workers, but this time, I am sending 96 requests total, so some workers will send more than 1 requests.

I run this 3 times:

autocannon tells me he served 96 requests each time I run it (so 3 times).

image

But the number of requests received is only 192, so 64 per test, so 32 less than the expected number. And the logs show some heap loss that never recover:

Uptime:   2 s, requests:   0, Free heap: 236444
Uptime:   4 s, requests:   0, Free heap: 236444
Uptime:   6 s, requests:   0, Free heap: 236444
Uptime:   8 s, requests:  41, Free heap: 225492
Uptime:  10 s, requests:  58, Free heap: 233848
Uptime:  12 s, requests:  64, Free heap: 233848
Uptime:  14 s, requests:  64, Free heap: 233848
Uptime:  16 s, requests:  64, Free heap: 233848
Uptime:  18 s, requests:  64, Free heap: 233848
Uptime:  20 s, requests:  64, Free heap: 233848
Uptime:  22 s, requests:  64, Free heap: 233848
Uptime:  24 s, requests:  64, Free heap: 233848
Uptime:  26 s, requests:  64, Free heap: 233848
Uptime:  28 s, requests:  64, Free heap: 233848
Uptime:  30 s, requests:  64, Free heap: 233848
Uptime:  32 s, requests:  64, Free heap: 233848
Uptime:  34 s, requests:  64, Free heap: 233848
Uptime:  36 s, requests:  64, Free heap: 233848
Uptime:  38 s, requests:  64, Free heap: 233848
Uptime:  40 s, requests:  64, Free heap: 233848
Uptime:  42 s, requests:  64, Free heap: 233848
Uptime:  44 s, requests:  64, Free heap: 233848
Uptime:  46 s, requests:  64, Free heap: 233848
Uptime:  48 s, requests:  88, Free heap: 232504
Uptime:  50 s, requests: 114, Free heap: 233044
Uptime:  52 s, requests: 124, Free heap: 233044
Uptime:  54 s, requests: 128, Free heap: 233044
Uptime:  56 s, requests: 128, Free heap: 233044
Uptime:  58 s, requests: 128, Free heap: 233044
Uptime:  60 s, requests: 128, Free heap: 233044
Uptime:  62 s, requests: 128, Free heap: 233044
Uptime:  64 s, requests: 128, Free heap: 233044
Uptime:  66 s, requests: 128, Free heap: 233044
Uptime:  68 s, requests: 128, Free heap: 233044
Uptime:  70 s, requests: 128, Free heap: 233044
Uptime:  72 s, requests: 128, Free heap: 233044
Uptime:  74 s, requests: 128, Free heap: 233044
Uptime:  76 s, requests: 128, Free heap: 233044
Uptime:  78 s, requests: 128, Free heap: 233044
Uptime:  80 s, requests: 128, Free heap: 233044
Uptime:  82 s, requests: 128, Free heap: 233044
Uptime:  84 s, requests: 128, Free heap: 233044
Uptime:  86 s, requests: 128, Free heap: 233044
Uptime:  88 s, requests: 128, Free heap: 233044
Uptime:  90 s, requests: 158, Free heap: 226620
Uptime:  92 s, requests: 185, Free heap: 227068
Uptime:  94 s, requests: 192, Free heap: 227068
Uptime:  96 s, requests: 192, Free heap: 227068
Uptime:  98 s, requests: 192, Free heap: 227068
Uptime: 100 s, requests: 192, Free heap: 227068
Uptime: 102 s, requests: 192, Free heap: 227068
Uptime: 104 s, requests: 192, Free heap: 227068
Uptime: 106 s, requests: 192, Free heap: 227068
Uptime: 108 s, requests: 192, Free heap: 227068
Uptime: 110 s, requests: 192, Free heap: 227068
Uptime: 112 s, requests: 192, Free heap: 227068
Uptime: 114 s, requests: 192, Free heap: 227068
Uptime: 116 s, requests: 192, Free heap: 227068
Uptime: 118 s, requests: 192, Free heap: 227068
Uptime: 120 s, requests: 192, Free heap: 227068
Uptime: 122 s, requests: 192, Free heap: 227068
Uptime: 124 s, requests: 192, Free heap: 227068
Uptime: 126 s, requests: 192, Free heap: 227068
Uptime: 128 s, requests: 192, Free heap: 227068
Uptime: 130 s, requests: 192, Free heap: 227068
...
Uptime: 438 s, requests: 192, Free heap: 228892
Uptime: 440 s, requests: 192, Free heap: 228892
Uptime: 442 s, requests: 192, Free heap: 228892
Uptime: 444 s, requests: 192, Free heap: 228892
Uptime: 446 s, requests: 192, Free heap: 228892
Uptime: 448 s, requests: 192, Free heap: 228892
Uptime: 450 s, requests: 192, Free heap: 228892
  • autocannon -c 64 -w 32 -d 30 -t 30 http://192.168.4.1

This is the test I did previously: it runs 64 connections concurrently, over 32 workers, and I am maintaining that for 30 seconds. So I have some workers sending several requests concurrently and also sequentially.

@willmmiles
Copy link
Author

Hi folks! Just trying to understand the issues that this PR is hoping to fix. Is it just the fact that events might be missed, because Queue is overflown, or are there also other things that are being fixed also? Please excuse my ignorance. I've been out of the game for a bit :)

So memory leak and lost events when running under high load, which might be actually connected.

Exactly that. If an event which would've triggered onDisconnect() (FIN or ERROR) is discarded, any resources held by AsyncClient's owner (eg. AsyncWebRequest or AsyncWebResponse) are orphaned and lost forever. Other event drops (RECV or SENT) disrupt the handler logic -- it's hard to parse request headers with a block in the middle missing; and the response code will wait indefinitely if it's never told some bytes were actually transmitted. So the reasonable response from AsyncTCP is to abort the connection, but then we're back to trying to enqueue an ERROR event when we already know the queue is full... you can see where this is going.

The only way through is to guarantee that the onDisconnect() triggering event can always be enqueued, no matter what. The key change in this code is that the end event packet is allocated when AsyncClient is constructed, or construction fails; and the queue implementation guarantees that no other resources are required to enqueue successfully, given an already allocated event.

The second set of changes was a removal of the closed_slot system in favor of using the memory already held by AsyncClient. This served two purposes: one, it eliminates the risk that the underlying LwIP had room for a new connection, but AsyncTCP still had objects yet to be cleaned up (leading to dropped connections that theoretically could have been serviced); and two, a slight reduction in static RAM usage and code complexity. The key insight is that the _pcb member is already owned by the LwIP task (since LwIP reserves the right to deallocate it out from underneath us! Thanks for nothing, tcp_err :( ). All the client calls were already implemented by running them on the LwIP task, so we could use double indirection over the tcp_pcb* to validate the up-to-the-moment current state when the operation is executed on the LwIP task with no extra memory overhead.

@vortigont
Copy link

this autocannon thing could be really confusing. To my surprise I found that -w option does not makes much sense at all, the number of workers does not correlate with cumber of "connections", so maybe it's to reuse multicore CPUs on client, I usually do not to use it at all, one worker is enough to stress esp :) but anyway...
Default queue size of 64 is definitely not enough to accept 64 parallel connections if those are done almost same time, there is a high probability that some of the accept/close events would be missed. We could add some debug messages for such cases to see the losses. With current approach one should adjust queue size to fit the required needs.

@willmmiles
Copy link
Author

willmmiles commented Feb 7, 2025

I'm not sure I understood your point on why std::fwd_list is not usable here, list node could be created with proper alloc exception handling, then locking the queue and moving the node to list chain.

If we're fortunate enough to have exceptions enabled, then yes, std::forward_list could suffice. It'd be a bit of a dance to manage the pre-allocation of the end events; AsyncClient would actually have to declare _end_event as another std::forward_list that could be splice_after()d to guarantee nothrow insertion; and this in turn would require exposing lwip_tcp_event_packet_t in the public header, so std::forward_list could be instantiated ... but it could be done, though it wouldn't be pretty.

Alas, it seems that Arduino ESP32 projects are built with exceptions disabled. Since std::forward_list handles allocation and construction of member objects internally, there's no way to use it safely without risking crashing the whole system on an allocation failure. :( FWIW, I'm quite frustrated that this is the default setting here -- the C++ standards body has been very, very clear that "C++ without exceptions" is never going to be considered a meaningful language target, and the standard libraries will continue to evolve in ways that are incompatible.

boost::intrusive::list might work, but I for one don't want to deal with figuring out how to integrate all of Boost's infrastructure to a project like this ...

On one hand we already use those containers in AsyncWebServer code, so it will bring cxx lib anyway

If using this team's releases, yes. We forked AsyncWebServer before those changes; and that decision blocks us from switching. We don't need the new features -- what we need is heap exhaustion safety, which can't be achieved without exceptions if you're using std containers. I'm happy to share all the work we've done, but as long as you're using std containers, we have no choice but to go our own way. :(

@mathieucarbou
Copy link
Member

mathieucarbou commented Feb 7, 2025

@willmmiles : for espasyncwebserver, 3.x will probably stay as is because it is in bugfix / maintainance mode. We plan on starring a v4 branch once 3.x is more stable to start cleaning stuff and change the apis required to be changed. This will be an opportunity to review all the memory allocations since breaking current API will be possible.

Did you see by the way the stack trace I got today when testing your branch ? I didn't have that yesterday.

Thanks ☺️

@willmmiles
Copy link
Author

@willmmiles : for espasyncwebserver, 3.x will probably stay as is because it is in bugfix / maintainance mode. We plan on starring a v4 branch once 3.x is more stable to start cleaning stuff and change the apis required to be changed. This will be an opportunity to review all the memory allocations since breaking current API will be possible.

Unfortunately your v4 plans are also of no help to us -- ESP8266 is 50%+ of our active user base. (Some of the other maintainers were very surprised to see those stats on our last release!) A lot of us have them deployed in hard-to-replace locations. They're capable little chips with many years of service life left, and it always saddens me to find so many projects wanting to turn them in to e-waste. :(

Did you see by the way the stack trace I got today when testing your branch ? I didn't have that yesterday.

Yup, working on it!

@vortigont
Copy link

Since std::forward_list handles allocation and construction of member objects internally, there's no way to use it safely without risking crashing the whole system on an allocation failure

@willmmiles pls do not get me wrong, I'm not trying to persuade you into using std stuff, just sharing ideas and experience here. It is interesting to understand other project demands. Though I could hardly imagine the situation where heap cap is so small that it won't be able to fit a new dozen bytes node like std::list<lwip_tcp_event_packet_t*> and still being able to run all other heavy stuff like HTTP/WS requests :) As for me the whole stuff would die long before that.
BTW AsyncTCPSock's implementation is quite interesting in this respect based on selected socket events. But it also full of std things though. You are quite constrained in available options here. Have you considered rebuilding arduino with exception handling then? But I guess it will put even more stress on available mcu resources.

@willmmiles
Copy link
Author

@willmmiles pls do not get me wrong, I'm not trying to persuade you into using std stuff, just sharing ideas and experience here. It is interesting to understand other project demands. Though I could hardly imagine the situation where heap cap is so small that it won't be able to fit a new dozen bytes node like std::list<lwip_tcp_event_packet_t*> and still being able to run all other heavy stuff like HTTP/WS requests :) As for me the whole stuff would die long before that.

Somewhat ironically, it's the web requests themselves that "cause" the heap exhaustion. An idle ESP32-S2 system might have maybe 50-70kb of heap available. A single session, busy serializing 10kb of JSON, plus transport buffers, fits easily in available RAM and runs without issue. But when we have many requests in flight at once it can use a lot of heap quickly, with LwIP transfer buffers, upper layer serialization buffers, filesystem buffers, and so on... (Darn HomeAssistant never stops polling that big presets file!)

So the reported behaviour from the users is "everything works great, except sometimes it mysteriously resets". Crash dumps indicated heap exhaustion because too many web requests were being serviced in parallel -- the user just so happened to hit the server at the same moment as their other integrations, resulting in OOM and the world ending.

My first attempt to mitigate the problem was to try to adding a queue in AsyncWebServer, permitting only a couple of requests to be in flight at once, and 503ing anything we didn't think we could handle. It helped - at least from the end user perspective - but because I'm a glutton for punishment, I wasn't going to be satisfied until the web server could no longer bring down an otherwise correctly operating device.

Unfortunately, even queuing responses quickly spirals out of control if you hit it with a stress test like autocannon -- you can't stop new connections from being attempted, LwIP allocating control blocks and receive buffers, or easily throttle incoming request headers. From a library perspective, it's also very difficult to predict how much heap will be needed to serve each request. And of course, frustratingly, when a new connection arrives, you can't avoid having LwIP buffer at least the first MSS of data (IIRC on ESP32 that's ~1.5k), even if you just want to tell them to come back later! Further, if you decline to ack() so as to squelch further incoming traffic, you have to either (a) set up your own timer, (b) orchestrate that other connections will let you know it's safe to continue, or (c) hang up the client for 1000ms waiting for the darn tcp_poll. :(

If I had control over the LwIP stack, I'd both (a) allow drastically smaller poll timers - something in the 10s of ms range; and (b) arrange a way to intentionally solicit retransmission/re-recv of a packet I declined to handle the first time so as not to spend my RAM on something the other TCP stack still has. (Kind of dirty, I know!)

Have you considered rebuilding arduino with exception handling then? But I guess it will put even more stress on available mcu resources.

I haven't investigated it personally yet; maybe someday. Another aspect of the WLED project is that it's big on integrations - both maintained by the core team, as well as user supplied modules, that call on all kinds of third party libraries;. We also support a large variety of boards. It puts a cost on every non-standard environment decision we might make, as the further we get from common platforms, the tougher it is for users to integrate their own features. Still, it's quite possible that exception support is not yet a bridge too far!

@me-no-dev
Copy link
Member

C++ Exceptions are enabled in ESP32 Arduino. Have been for a very long time now. Do you see a different result? Or is it something else that needs enabling?

mathieucarbou added a commit that referenced this pull request Feb 7, 2025
…ts were not correctly cleaned up

Fix inspired by PR #21 from @willmmiles
mathieucarbou added a commit that referenced this pull request Feb 7, 2025
…ts were not correctly cleaned up

Fix inspired by PR #21 from @willmmiles
mathieucarbou added a commit that referenced this pull request Feb 7, 2025
mathieucarbou added a commit that referenced this pull request Feb 7, 2025
mathieucarbou added a commit that referenced this pull request Feb 7, 2025
mathieucarbou added a commit that referenced this pull request Feb 8, 2025
@willmmiles
Copy link
Author

C++ Exceptions are enabled in ESP32 Arduino. Have been for a very long time now. Do you see a different result? Or is it something else that needs enabling?

We're still supporting ESP8266 - not relevant for this repo, but very important for std usage in our AsyncWebServer selection. For this repo, as far as I'm aware std doesn't offer any containers with the ability to pre-allocate before insertion without exposing the complete definition of the value type.

That said, maybe someday I'll look in to enabling exceptions on 8266es; might be the path of least resistance.

@willmmiles
Copy link
Author

Did you see by the way the stack trace I got today when testing your branch ? I didn't have that yesterday.

Yup, working on it!

@mathieucarbou I haven't been able to reproduce this. I'm pushing a couple of small fixes I found through inspection, but I couldn't get it to crash for me. :(

@willmmiles
Copy link
Author

Did you see by the way the stack trace I got today when testing your branch ? I didn't have that yesterday.

Yup, working on it!

@mathieucarbou I haven't been able to reproduce this. I'm pushing a couple of small fixes I found through inspection, but I couldn't get it to crash for me. :(

Finally got a crash locally, though not in AsyncTCP -- this time the victim was AsyncWebServer::_handleDisconnect. There's definitely a use after free somewhere. I will continue to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants