From 4f0119936a068bffab91e155a0a0ed103051a479 Mon Sep 17 00:00:00 2001 From: Gillian Minnehan Date: Wed, 19 Feb 2025 17:09:22 -0500 Subject: [PATCH] Add grammar pass --- _data/authors.yml | 2 +- ...-02-20-why-sleep-for-is-broken-on-esp32.md | 142 +++++++++--------- 2 files changed, 71 insertions(+), 73 deletions(-) diff --git a/_data/authors.yml b/_data/authors.yml index a1ba8dd4..c3d11a51 100644 --- a/_data/authors.yml +++ b/_data/authors.yml @@ -386,4 +386,4 @@ stevenoonan: linkedin: https://www.linkedin.com/in/embeddeddev/ image: /img/author/stevenoonan.png blurb: - is Principal Embedded Systems Engineer at Nomo International, Inc. + is a Principal Embedded Systems Engineer at Nomo International, Inc. diff --git a/_posts/2025-02-20-why-sleep-for-is-broken-on-esp32.md b/_posts/2025-02-20-why-sleep-for-is-broken-on-esp32.md index 208dd782..911330e6 100644 --- a/_posts/2025-02-20-why-sleep-for-is-broken-on-esp32.md +++ b/_posts/2025-02-20-why-sleep-for-is-broken-on-esp32.md @@ -32,8 +32,8 @@ at version v4.4 of Espressif's (IDF). After about two days of work, the project compiled and ran in IDF v5.3. Except, something was not quite right. Things were sluggish. FreeRTOS's [system state API](https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/freertos_idf.html#_CPPv420uxTaskGetSystemStatePC12TaskStatus_tK11UBaseType_tPC27configRUN_TIME_COUNTER_TYPE) -indicated that one the two Xtensa LX6 processors on the ESP32 was pinned at 100% -usage. The culprits were two high priority threads, both of which had a +indicated that one of the two Xtensa LX6 processors on the ESP32 was pinned at +100% usage. The culprits were two high-priority threads, both of which had a relatively small loop with a 10 millisecond sleep between each iteration. Sleeping for 10 milliseconds should leave plenty of time to do other things. @@ -42,7 +42,7 @@ Plus, this code worked fine in IDF v4 without an issue. What did IDF v5 do? ## FreeRTOS on the ESP32 Let's go into some background first. Experienced embedded engineers will likely -find parts of this section as review, but we need to be solid in our +find parts of this section as a review, but we need to be solid in our understanding of the fundamentals before moving on. ### System Tick Period @@ -104,14 +104,14 @@ period of 10 milliseconds. Exactly how much time the thread will be sleep is unknown to this code. So to be clear: sleeping _"1 tick"_ does not mean it will sleep for one tick -_period of time_. In fact, it will basically always sleep some amount less than -one tick period, and sometimes it will basically not sleep at all. +_period of time_. In fact, it will always sleep some amount less than one tick +period, and sometimes it will not sleep at all. So then, is calling `vTaskDelay(1)` in a small loop a problem? Not intrinsically. Like most things, it's fine if you know what you are doing. If you want to ensure a thread (technically called a [Task](https://www.freertos.org/Documentation/02-Kernel/02-Kernel-features/01-Tasks-and-co-routines/00-Tasks-and-co-routines) -in FreeRTOS) exectues its logic is performed on every tick, this a way to do it. +in FreeRTOS) executes its logic is performed on every tick, this a way to do it. There are no problems so long as the logic is not sensitive to tick jitter and its processing only requires a small fraction of the tick period. @@ -141,9 +141,9 @@ what we want, because passing `0` to `vTaskDelay()` does nothing at all. Okay, but what if we _really_ need to wait for less than a tick period? We can [busy wait](https://en.wikipedia.org/wiki/Busy_waiting). The IDF has -`esp_rom_delay_us()` which does just that. But this is an +`esp_rom_delay_us()`, which does just that. But this is an [Internal API](https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-reference/system/internal-unstable.html) -which we don't have source code for, and it's functionality may change or be +which we don't have source code for, and its functionality may change or be removed in the future. ESP does give us `usleep()` though, and that allows precise delays. The caveat here is that the FreeRTOS scheduler is still free to context switch out of the busy loop due to the important bit I said above about @@ -152,7 +152,7 @@ the FreeRTOS scheduler (and perhaps all interrupts) for that period. The other option is to use a free running timer and trigger an interrupt when it expires. The interrupt can signal FreeRTOS to wake the waiting task. This is -great so long as your context switch overheads allow it, and you have a free +great as long as your context switch overheads allow it and you have a free hardware timer to do it. All of this is to say the following: @@ -165,7 +165,7 @@ All of this is to say the following: Alright then, so what does all this have to do with our issue related to a 10 millisecond delay? Well, I am not actually using `vTaskDelay(1)` directly. -Instead I am using C++ thread APIs, and therefore the actual code causing all +Instead, I am using C++ thread APIs, and therefore, the actual code causing all the CPU usage is `std::this_thread::sleep_for(10ms)`. Espressif provides @@ -177,7 +177,7 @@ So how does something like an ESP32 MCU support C++ threads anyway? ### `std::thread`, POSIX Threads, and FreeRTOS Tasks, oh my! -Espressif's IDF compiles with GCC, and therefore uses +Espressif's IDF compiles with GCC and therefore uses [The GNU C++ Library](https://gcc.gnu.org/onlinedocs/libstdc++/) (`libstdc++`). `libstdc++`'s implementation of `std::thread` is built on top of [`pthreads`](https://en.wikipedia.org/wiki/Pthreads), but it is up to the @@ -195,7 +195,7 @@ If that's not enough, we also need to consider the [C standard library](https://en.wikipedia.org/wiki/C_standard_library). IDF uses [newlib](https://en.wikipedia.org/wiki/Newlib) and has custom platform implementations of its sleeping functions. Thankfully, `newlib` is not -precompiled and you can easily find it's source code in the `components/newlib` +precompiled and you can easily find its source code in the `components/newlib` directory of an IDF install. My problem appears in IDF v5 because it upgraded GCC from 8.4 to 13.2, and @@ -236,7 +236,7 @@ If you are following along in your own IDF project, you can find the `xtensa-esp32-elf/esp-2021r2-patch5-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/8.4.0/thread` (or similar, depending on the version). -IDF does not define `_GLIBCXX_USE_NANOSLEEP`, so, no `nanosleep()`. That +IDF does not define `_GLIBCXX_USE_NANOSLEEP`, so no `nanosleep()`. That shouldn't be too surprising for an MCU implementation. Therefore, we go straight to `__sleep_for(__s, __ns)`. @@ -304,11 +304,11 @@ Let's clean it up by taking out the parts not compiled for the ESP32: Pretty simple: get the number of seconds and call `sleep()`, then get the number of microseconds and call `usleep()`. Both of these are calls into `newlib`. We -will get to those in a moment. First let's look at the IDF v5's `sleep_for()`. +will get to those in a moment. First, let's look at the IDF v5's `sleep_for()`. ### IDF 5.x: `std::this_thread::sleep_for()` -Once again `sleep_for()` is a wrapper for `__sleep_for()`. Here is the newer +Once again, `sleep_for()` is a wrapper for `__sleep_for()`. Here is the newer code ([see online](https://github.com/gcc-mirror/gcc/blob/1c0ecf06c0c9f3268ba52d4027e9b78aae4acb51/libstdc%2B%2B-v3/src/c%2B%2B11/thread.cc#L235)): @@ -341,22 +341,21 @@ __sleep_for(chrono::seconds __s, chrono::nanoseconds __ns) } ``` -Now we have a `while` loop. In it `usleep()` is called, but afterwards an -explicit check is made to determine if the minimal amount of time has actually -gone by. The only way out of the loop is if `now` is at or after the `target` -time. +Now we have a `while` loop. In it, `usleep()` is called, but afterward, there is +an explicit check to determine if the minimal amount of time has actually gone +by. The only way out of the loop is if `now` is at or after the `target` time. Why the change? This is because `nanosleep()` may return early if a signal is -delivered to the thread during its sleep. The change is found on -[commit cfef4c3](https://github.com/gcc-mirror/gcc/commit/cfef4c324ac300c0ad120f0fcee376159de84a0c) -The relevant commit comment is the following: +delivered to the thread during its sleep. The change is in +[commit cfef4c3](https://github.com/gcc-mirror/gcc/commit/cfef4c324ac300c0ad120f0fcee376159de84a0c), +with the relevant commit comment as the following: ```git Loop while sleep call is interrupted and until steady_clock - shows requested duration has elapsed. + shows requested duration has elapsed. ``` -ESP IDF doesn't support signals so the change should have no impact on the +ESP IDF doesn't support signals, so the change should have no impact on the ESP32. Oh, but it does! ## IDF Newlib @@ -403,7 +402,7 @@ int clock_gettime(clockid_t clock_id, struct timespec *tp) } ``` -Calling `esp_time_impl_get_time()` gets a value based on a free running hardware +Calling `esp_time_impl_get_time()` gets a value based on a free-running hardware timer with at least 1 us accuracy. So that is how the monotonic clock is implemented on the ESP32. Note that this clock is completely independent of the timer that is controlling the FreeRTOS system tick interrupt. No problems there. @@ -442,15 +441,15 @@ int usleep(useconds_t us) ``` Remember when I said that `esp_rom_delay_us()` is an internal API and we should -use `usleep()` instead? Well `usleep()` is just a wrapper to +use `usleep()` instead? Well, `usleep()` is just a wrapper to `esp_rom_delay_us()` for short periods (e.g. less than one tick period). We can consider `usleep()` the "public" exposure of `esp_rom_delay_us()`, but only when the specified time is less than a system tick period. As mentioned above, this is a busy wait, and since it does not disable the scheduler, it still allows -other threads _of equal or higher priority_ to run. So the timing represents a +other threads _of equal or higher priority_ to run. So, the timing represents a guaranteed _minimum_ only. More importantly, if there are other threads of lower -priority it will _not_ context switch during this busy time. It will just sit in -the thread until the wait is over. +priority, it will _not_ context switch during this busy time. It will just sit +in the thread until the wait is over. This is all good. A guaranteed minimum is how I expect `usleep()` to work. @@ -478,25 +477,24 @@ calculations would give us `vTaskDelay(2)`: 4. But since we are using integers, it is just `2` But what does `2` mean? It means two occurrences of the tick interrupt. Once -again we can ask: when is the next tick interrupt from the moment that +again, we can ask: when is the next tick interrupt from the moment that `vTaskDelay(2)` is called? The tick interrupt could be in 1 nanosecond all the way up to 10 milliseconds from now. After that, the second tick will be the system tick period of 10 milliseconds. So for a tick count of `2`, we will actually sleep between between 10 and 20 milliseconds. -Despite the fact that the comment says it is rounding up to compensate for the -first tick potentially not blocking at all, the compensation does not account -for the worse case minimal timing. In the example I gave, a 15 millisecond -request will sometimes only sleep for 10 milliseconds. Likewise, a 10 -millisecond `usleep()` will sometimes sleep about 0 milliseconds. The greatest -_potential_ differential comes with calling `usleep()` with a multiple of the -tick period. In that case, the actual time slept may be short by an entire tick -period. - -Whether this "shorter than specified" time matters depends on requirements of -course. In my experience a lot of application level logic would probably be fine -with a "10 to 20" millisecond wait. Hardware driver logic and audio pipelines on -the other have specific realtime characteristics where this would absolutely be +Even though the comment says it is rounding up to compensate for the first tick +potentially not blocking at all, the compensation does not account for the +worse-case minimal timing. In the example I gave, a 15-millisecond request will +sometimes only sleep for 10 milliseconds. Likewise, a 10 millisecond `usleep()` +will sometimes sleep about 0 milliseconds. The greatest _potential_ differential +comes with calling `usleep()` with a multiple of the tick period. In that case, +the time spent may be short by an entire tick period. + +Whether this "shorter than specified" time matters depends on requirements. In +my experience, a lot of application-level logic would probably be fine with a +"10 to 20" millisecond wait. On the other hand, hardware driver logic and audio +pipelines have specific realtime characteristics where this would absolutely be a problem. According to `man 3 sleep` and @@ -512,19 +510,19 @@ sleep more if needed. > scheduling of other activity by the system. The fact that ESP32's IDF `usleep()` can be short by up to one system tick -period means doesn't follow the specification. +period means it doesn't follow the specifications. ## Would You Like Precision or Efficiency? Let's recap. In IDF v4, calling `std::this_thread::sleep_for()` calls `usleep()` once. When -`std::this_thread::sleep_for(10ms)` is executed, it calls `vTaskDelay(1)` and +`std::this_thread::sleep_for(10ms)` is executed, it calls `vTaskDelay(1),` and the thread will sleep between 0 and 10 milliseconds. It will usually sleep for less than the time specified. In IDF v5, calling `std::this_thread::sleep_for(10ms)` almost always calls -`usleep()` _twice_. The first time will use `vTaskDelay(1)` and it will usually +`usleep()` _twice_. The first time will use `vTaskDelay(1)`, and it will usually sleep for less than the time specified. Then, back in `libstdc++` `__sleep_for()`, the monotonic clock will be checked and it will be seen that some fractional component of 10 milliseconds remains, causing a second call to @@ -539,48 +537,48 @@ Let's break it down further. For a 10 millisecond delay: But once again, saying `vTaskDelay(1)` means "wait for 1 tick interrupt." That will almost always be less than 10 milliseconds from now. When `vTaskDelay(1)` -returns, somewhere between 0.000001 to 10.000000 milliseconds will have actually -transpired. The newer version of `__sleep_for()` in IDF v5 / GCC 13 will double -check that the correct time has actually elapsed according to the monotonic -clock. That check will fail, so it calls `usleep()` again, this time with the -remainder of the 10 milliseconds. The specified time is less than a system tick -period, so the blocking `esp_rom_delay_us()` is now called. +returns, somewhere between 0.000001 and 10.000000 milliseconds will have +actually transpired. The newer version of `__sleep_for()` in IDF v5 / GCC 13 +will double-check that the correct time has actually elapsed according to the +monotonic clock. That check will fail, so it calls `usleep()` again, this time +with the remainder of the 10 milliseconds. The specified time is less than a +system tick period, so the blocking `esp_rom_delay_us()` is now called. So what about time slicing? Even if `esp_rom_delay_us()` blocks, the FreeRTOS scheduler can switch to another task. Firstly, if this thread is of a higher priority, _no lower priorities will ever run_. But even if everything is of the same priority, the CPU will just switch back to the blocking call on the next -round robin, continuing the blocking wait. Which in our current scenario is +round robin, continuing the blocking wait. In our current scenario, this is horribly inefficient, unnecessary, and unexpected. Any call to `sleep_for()` greater than the tick period has this problem because -the tick interrupt is completely asynchronous to the `sleep_for()` call. This -means when the scheduler returns from `vTaskDelay()` some random remainder of -time will be done with `esp_rom_delay_us()` in order to sleep for the _precise_ -amount of time requested. +the tick interrupt is asynchronous to the `sleep_for()` call. This means when +the scheduler returns from `vTaskDelay()` some random remainder of time will be +done with `esp_rom_delay_us()` in order to sleep for the _precise_ amount of +time requested. The new version of `sleep_for()` is much more precise, but it is at the cost of -computing efficiency on the ESP32, because some fraction of the tick period will +computing efficiency on the ESP32 because some fraction of the tick period will be _busy waited_ instead of yielded. That is very bad to do on an MCU. Of course, none of this is transparent to the application code, and I doubt it was something intentional from Espressif. It is just a consequence of upgrading -to a new version of GCC combined with the way IDF implements `usleep()`. +to a new version of GCC combined with how IDF implements `usleep()`. ## Is `usleep()` broken in IDF v5? Did Espressif actually implement `usleep()` wrong? Yes. It needs to be fixed. -For periods at or longer than system tick, `usleep()` can return before the +For periods at or longer than the system tick, `usleep()` can return before the specified time. It shouldn't do that. It must error on the side of sleeping too long to ensure it _never_ sleeps to little. So yes, it is broken in my view. `stdlibc++` isn't to blame. Since `usleep()` is sometimes short by 1 system tick period, we could just add -another tick count to our calculations. That would cause `usleep()` to sometimes -take longer, but never shorter than the time specified. This would prevent -`__sleep_for()` from calling `usleep()` a second time with a faction of a system -tick period, so no more blocking busy wait. +another tick count to our calculations. That would sometimes cause `usleep()` to +take longer, but never shorter than the specified time. This would prevent +`__sleep_for()` from calling `usleep()` a second time with a fraction of a +system tick period, so no more blocking busy wait. We can be a little more sophisticated by taking a cue from the newer `__sleep_for()` and keep checking the monotonic clock. Something like this: @@ -610,7 +608,7 @@ int usleep(useconds_t us) } ``` -With this approach, small sleep times are precise (as they always were), and +With this approach, short sleep times are precise (as they always were), and longer sleep times are non-blocking and always as long or longer than the requested time. By testing the monotonic clock, the extra tick is only applied when needed. @@ -618,10 +616,10 @@ when needed. I opened a [PR](https://github.com/espressif/esp-idf/pull/15132) with this change to IDF. Hopefully it gets approved. -Since I didn't want to work with a custom, patched IDF, I ended up replacing all -calls to `std::this_thread::sleep_for()` with our own function. It has the same -default signature, with the optional ability to specify a "sleep strategy." We -can now force the custom `sleep_for()` to busy wait or to yield: +Since I didn't want to work with a custom, patched IDF, I replaced all calls to +`std::this_thread::sleep_for()` with our own function. It has the same default +signature, with the option to specify a "sleep strategy." We can now force the +custom `sleep_for()` to busy wait or to yield: ```c++ enum class SleepStrategy { @@ -675,12 +673,12 @@ while allowing more control over how to perform the sleep when using C++. I cut my teeth on bare metal C code where _everything_ was statically allocated. No `malloc()`. No floating point math because there was no FPU. Custom linker scripts. Debugging using GPIO pins and an oscilloscope. Using precalculated -value tables to save a few microseconds in an ISR. We ran at 24 Mhz. At that -time, C was luxurious because the "old" stuff was running at 1 Mhz usually in +value tables to save a few microseconds in an ISR. We ran at 24 MHz. At that +time, C was luxurious because the "old" stuff was running at 1 MHz, usually in assembler. Back then common wisdom was that C++ was just not appropriate for -microcontrollers. Of course that was a long time ago, before "modern C++" and +microcontrollers. Of course, that was a long time ago, before "modern C++" and before MCUs were clocked at hundreds of MHz with L1 caching and multistage instruction pipelines.