Skip to content

Commit

Permalink
Add grammar pass
Browse files Browse the repository at this point in the history
  • Loading branch information
gminn committed Feb 19, 2025
1 parent 16174ae commit 4f01199
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 73 deletions.
2 changes: 1 addition & 1 deletion _data/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -386,4 +386,4 @@ stevenoonan:
linkedin: https://www.linkedin.com/in/embeddeddev/
image: /img/author/stevenoonan.png
blurb:
is Principal Embedded Systems Engineer at Nomo International, Inc.
is a Principal Embedded Systems Engineer at Nomo International, Inc.
142 changes: 70 additions & 72 deletions _posts/2025-02-20-why-sleep-for-is-broken-on-esp32.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ at version v4.4 of Espressif's
(IDF). After about two days of work, the project compiled and ran in IDF v5.3.
Except, something was not quite right. Things were sluggish. FreeRTOS's
[system state API](https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/freertos_idf.html#_CPPv420uxTaskGetSystemStatePC12TaskStatus_tK11UBaseType_tPC27configRUN_TIME_COUNTER_TYPE)
indicated that one the two Xtensa LX6 processors on the ESP32 was pinned at 100%
usage. The culprits were two high priority threads, both of which had a
indicated that one of the two Xtensa LX6 processors on the ESP32 was pinned at
100% usage. The culprits were two high-priority threads, both of which had a
relatively small loop with a 10 millisecond sleep between each iteration.

Sleeping for 10 milliseconds should leave plenty of time to do other things.
Expand All @@ -42,7 +42,7 @@ Plus, this code worked fine in IDF v4 without an issue. What did IDF v5 do?
## FreeRTOS on the ESP32

Let's go into some background first. Experienced embedded engineers will likely
find parts of this section as review, but we need to be solid in our
find parts of this section as a review, but we need to be solid in our
understanding of the fundamentals before moving on.

### System Tick Period
Expand Down Expand Up @@ -104,14 +104,14 @@ period of 10 milliseconds. Exactly how much time the thread will be sleep is
unknown to this code.
So to be clear: sleeping _"1 tick"_ does not mean it will sleep for one tick
_period of time_. In fact, it will basically always sleep some amount less than
one tick period, and sometimes it will basically not sleep at all.
_period of time_. In fact, it will always sleep some amount less than one tick
period, and sometimes it will not sleep at all.
So then, is calling `vTaskDelay(1)` in a small loop a problem? Not
intrinsically. Like most things, it's fine if you know what you are doing. If
you want to ensure a thread (technically called a
[Task](https://www.freertos.org/Documentation/02-Kernel/02-Kernel-features/01-Tasks-and-co-routines/00-Tasks-and-co-routines)
in FreeRTOS) exectues its logic is performed on every tick, this a way to do it.
in FreeRTOS) executes its logic is performed on every tick, this a way to do it.
There are no problems so long as the logic is not sensitive to tick jitter and
its processing only requires a small fraction of the tick period.

Expand Down Expand Up @@ -141,9 +141,9 @@ what we want, because passing `0` to `vTaskDelay()` does nothing at all.

Okay, but what if we _really_ need to wait for less than a tick period? We can
[busy wait](https://en.wikipedia.org/wiki/Busy_waiting). The IDF has
`esp_rom_delay_us()` which does just that. But this is an
`esp_rom_delay_us()`, which does just that. But this is an
[Internal API](https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-reference/system/internal-unstable.html)
which we don't have source code for, and it's functionality may change or be
which we don't have source code for, and its functionality may change or be
removed in the future. ESP does give us `usleep()` though, and that allows
precise delays. The caveat here is that the FreeRTOS scheduler is still free to
context switch out of the busy loop due to the important bit I said above about
Expand All @@ -152,7 +152,7 @@ the FreeRTOS scheduler (and perhaps all interrupts) for that period.

The other option is to use a free running timer and trigger an interrupt when it
expires. The interrupt can signal FreeRTOS to wake the waiting task. This is
great so long as your context switch overheads allow it, and you have a free
great as long as your context switch overheads allow it and you have a free
hardware timer to do it.

All of this is to say the following:
Expand All @@ -165,7 +165,7 @@ All of this is to say the following:

Alright then, so what does all this have to do with our issue related to a 10
millisecond delay? Well, I am not actually using `vTaskDelay(1)` directly.
Instead I am using C++ thread APIs, and therefore the actual code causing all
Instead, I am using C++ thread APIs, and therefore, the actual code causing all
the CPU usage is `std::this_thread::sleep_for(10ms)`.

Espressif provides
Expand All @@ -177,7 +177,7 @@ So how does something like an ESP32 MCU support C++ threads anyway?

### `std::thread`, POSIX Threads, and FreeRTOS Tasks, oh my!

Espressif's IDF compiles with GCC, and therefore uses
Espressif's IDF compiles with GCC and therefore uses
[The GNU C++ Library](https://gcc.gnu.org/onlinedocs/libstdc++/) (`libstdc++`).
`libstdc++`'s implementation of `std::thread` is built on top of
[`pthreads`](https://en.wikipedia.org/wiki/Pthreads), but it is up to the
Expand All @@ -195,7 +195,7 @@ If that's not enough, we also need to consider the
[C standard library](https://en.wikipedia.org/wiki/C_standard_library). IDF uses
[newlib](https://en.wikipedia.org/wiki/Newlib) and has custom platform
implementations of its sleeping functions. Thankfully, `newlib` is not
precompiled and you can easily find it's source code in the `components/newlib`
precompiled and you can easily find its source code in the `components/newlib`
directory of an IDF install.

My problem appears in IDF v5 because it upgraded GCC from 8.4 to 13.2, and
Expand Down Expand Up @@ -236,7 +236,7 @@ If you are following along in your own IDF project, you can find the
`xtensa-esp32-elf/esp-2021r2-patch5-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/8.4.0/thread`
(or similar, depending on the version).
IDF does not define `_GLIBCXX_USE_NANOSLEEP`, so, no `nanosleep()`. That
IDF does not define `_GLIBCXX_USE_NANOSLEEP`, so no `nanosleep()`. That
shouldn't be too surprising for an MCU implementation. Therefore, we go straight
to `__sleep_for(__s, __ns)`.
Expand Down Expand Up @@ -304,11 +304,11 @@ Let's clean it up by taking out the parts not compiled for the ESP32:
Pretty simple: get the number of seconds and call `sleep()`, then get the number
of microseconds and call `usleep()`. Both of these are calls into `newlib`. We
will get to those in a moment. First let's look at the IDF v5's `sleep_for()`.
will get to those in a moment. First, let's look at the IDF v5's `sleep_for()`.
### IDF 5.x: `std::this_thread::sleep_for()`
Once again `sleep_for()` is a wrapper for `__sleep_for()`. Here is the newer
Once again, `sleep_for()` is a wrapper for `__sleep_for()`. Here is the newer
code
([see online](https://github.com/gcc-mirror/gcc/blob/1c0ecf06c0c9f3268ba52d4027e9b78aae4acb51/libstdc%2B%2B-v3/src/c%2B%2B11/thread.cc#L235)):
Expand Down Expand Up @@ -341,22 +341,21 @@ __sleep_for(chrono::seconds __s, chrono::nanoseconds __ns)
}
```

Now we have a `while` loop. In it `usleep()` is called, but afterwards an
explicit check is made to determine if the minimal amount of time has actually
gone by. The only way out of the loop is if `now` is at or after the `target`
time.
Now we have a `while` loop. In it, `usleep()` is called, but afterward, there is
an explicit check to determine if the minimal amount of time has actually gone
by. The only way out of the loop is if `now` is at or after the `target` time.

Why the change? This is because `nanosleep()` may return early if a signal is
delivered to the thread during its sleep. The change is found on
[commit cfef4c3](https://github.com/gcc-mirror/gcc/commit/cfef4c324ac300c0ad120f0fcee376159de84a0c)
The relevant commit comment is the following:
delivered to the thread during its sleep. The change is in
[commit cfef4c3](https://github.com/gcc-mirror/gcc/commit/cfef4c324ac300c0ad120f0fcee376159de84a0c),
with the relevant commit comment as the following:

```git
Loop while sleep call is interrupted and until steady_clock
shows requested duration has elapsed.
shows requested duration has elapsed.
```

ESP IDF doesn't support signals so the change should have no impact on the
ESP IDF doesn't support signals, so the change should have no impact on the
ESP32. Oh, but it does!

## IDF Newlib
Expand Down Expand Up @@ -403,7 +402,7 @@ int clock_gettime(clockid_t clock_id, struct timespec *tp)
}
```
Calling `esp_time_impl_get_time()` gets a value based on a free running hardware
Calling `esp_time_impl_get_time()` gets a value based on a free-running hardware
timer with at least 1 us accuracy. So that is how the monotonic clock is
implemented on the ESP32. Note that this clock is completely independent of the
timer that is controlling the FreeRTOS system tick interrupt. No problems there.
Expand Down Expand Up @@ -442,15 +441,15 @@ int usleep(useconds_t us)
```
Remember when I said that `esp_rom_delay_us()` is an internal API and we should
use `usleep()` instead? Well `usleep()` is just a wrapper to
use `usleep()` instead? Well, `usleep()` is just a wrapper to
`esp_rom_delay_us()` for short periods (e.g. less than one tick period). We can
consider `usleep()` the "public" exposure of `esp_rom_delay_us()`, but only when
the specified time is less than a system tick period. As mentioned above, this
is a busy wait, and since it does not disable the scheduler, it still allows
other threads _of equal or higher priority_ to run. So the timing represents a
other threads _of equal or higher priority_ to run. So, the timing represents a
guaranteed _minimum_ only. More importantly, if there are other threads of lower
priority it will _not_ context switch during this busy time. It will just sit in
the thread until the wait is over.
priority, it will _not_ context switch during this busy time. It will just sit
in the thread until the wait is over.
This is all good. A guaranteed minimum is how I expect `usleep()` to work.
Expand Down Expand Up @@ -478,25 +477,24 @@ calculations would give us `vTaskDelay(2)`:
4. But since we are using integers, it is just `2`
But what does `2` mean? It means two occurrences of the tick interrupt. Once
again we can ask: when is the next tick interrupt from the moment that
again, we can ask: when is the next tick interrupt from the moment that
`vTaskDelay(2)` is called? The tick interrupt could be in 1 nanosecond all the
way up to 10 milliseconds from now. After that, the second tick will be the
system tick period of 10 milliseconds. So for a tick count of `2`, we will
actually sleep between between 10 and 20 milliseconds.
Despite the fact that the comment says it is rounding up to compensate for the
first tick potentially not blocking at all, the compensation does not account
for the worse case minimal timing. In the example I gave, a 15 millisecond
request will sometimes only sleep for 10 milliseconds. Likewise, a 10
millisecond `usleep()` will sometimes sleep about 0 milliseconds. The greatest
_potential_ differential comes with calling `usleep()` with a multiple of the
tick period. In that case, the actual time slept may be short by an entire tick
period.
Whether this "shorter than specified" time matters depends on requirements of
course. In my experience a lot of application level logic would probably be fine
with a "10 to 20" millisecond wait. Hardware driver logic and audio pipelines on
the other have specific realtime characteristics where this would absolutely be
Even though the comment says it is rounding up to compensate for the first tick
potentially not blocking at all, the compensation does not account for the
worse-case minimal timing. In the example I gave, a 15-millisecond request will
sometimes only sleep for 10 milliseconds. Likewise, a 10 millisecond `usleep()`
will sometimes sleep about 0 milliseconds. The greatest _potential_ differential
comes with calling `usleep()` with a multiple of the tick period. In that case,
the time spent may be short by an entire tick period.
Whether this "shorter than specified" time matters depends on requirements. In
my experience, a lot of application-level logic would probably be fine with a
"10 to 20" millisecond wait. On the other hand, hardware driver logic and audio
pipelines have specific realtime characteristics where this would absolutely be
a problem.
According to `man 3 sleep` and
Expand All @@ -512,19 +510,19 @@ sleep more if needed.
> scheduling of other activity by the system.
The fact that ESP32's IDF `usleep()` can be short by up to one system tick
period means doesn't follow the specification.
period means it doesn't follow the specifications.
## Would You Like Precision or Efficiency?
Let's recap.
In IDF v4, calling `std::this_thread::sleep_for()` calls `usleep()` once. When
`std::this_thread::sleep_for(10ms)` is executed, it calls `vTaskDelay(1)` and
`std::this_thread::sleep_for(10ms)` is executed, it calls `vTaskDelay(1),` and
the thread will sleep between 0 and 10 milliseconds. It will usually sleep for
less than the time specified.
In IDF v5, calling `std::this_thread::sleep_for(10ms)` almost always calls
`usleep()` _twice_. The first time will use `vTaskDelay(1)` and it will usually
`usleep()` _twice_. The first time will use `vTaskDelay(1)`, and it will usually
sleep for less than the time specified. Then, back in `libstdc++`
`__sleep_for()`, the monotonic clock will be checked and it will be seen that
some fractional component of 10 milliseconds remains, causing a second call to
Expand All @@ -539,48 +537,48 @@ Let's break it down further. For a 10 millisecond delay:
But once again, saying `vTaskDelay(1)` means "wait for 1 tick interrupt." That
will almost always be less than 10 milliseconds from now. When `vTaskDelay(1)`
returns, somewhere between 0.000001 to 10.000000 milliseconds will have actually
transpired. The newer version of `__sleep_for()` in IDF v5 / GCC 13 will double
check that the correct time has actually elapsed according to the monotonic
clock. That check will fail, so it calls `usleep()` again, this time with the
remainder of the 10 milliseconds. The specified time is less than a system tick
period, so the blocking `esp_rom_delay_us()` is now called.
returns, somewhere between 0.000001 and 10.000000 milliseconds will have
actually transpired. The newer version of `__sleep_for()` in IDF v5 / GCC 13
will double-check that the correct time has actually elapsed according to the
monotonic clock. That check will fail, so it calls `usleep()` again, this time
with the remainder of the 10 milliseconds. The specified time is less than a
system tick period, so the blocking `esp_rom_delay_us()` is now called.
So what about time slicing? Even if `esp_rom_delay_us()` blocks, the FreeRTOS
scheduler can switch to another task. Firstly, if this thread is of a higher
priority, _no lower priorities will ever run_. But even if everything is of the
same priority, the CPU will just switch back to the blocking call on the next
round robin, continuing the blocking wait. Which in our current scenario is
round robin, continuing the blocking wait. In our current scenario, this is
horribly inefficient, unnecessary, and unexpected.
Any call to `sleep_for()` greater than the tick period has this problem because
the tick interrupt is completely asynchronous to the `sleep_for()` call. This
means when the scheduler returns from `vTaskDelay()` some random remainder of
time will be done with `esp_rom_delay_us()` in order to sleep for the _precise_
amount of time requested.
the tick interrupt is asynchronous to the `sleep_for()` call. This means when
the scheduler returns from `vTaskDelay()` some random remainder of time will be
done with `esp_rom_delay_us()` in order to sleep for the _precise_ amount of
time requested.
The new version of `sleep_for()` is much more precise, but it is at the cost of
computing efficiency on the ESP32, because some fraction of the tick period will
computing efficiency on the ESP32 because some fraction of the tick period will
be _busy waited_ instead of yielded. That is very bad to do on an MCU.
Of course, none of this is transparent to the application code, and I doubt it
was something intentional from Espressif. It is just a consequence of upgrading
to a new version of GCC combined with the way IDF implements `usleep()`.
to a new version of GCC combined with how IDF implements `usleep()`.
## Is `usleep()` broken in IDF v5?
Did Espressif actually implement `usleep()` wrong? Yes. It needs to be fixed.
For periods at or longer than system tick, `usleep()` can return before the
For periods at or longer than the system tick, `usleep()` can return before the
specified time. It shouldn't do that. It must error on the side of sleeping too
long to ensure it _never_ sleeps to little. So yes, it is broken in my view.
`stdlibc++` isn't to blame.
Since `usleep()` is sometimes short by 1 system tick period, we could just add
another tick count to our calculations. That would cause `usleep()` to sometimes
take longer, but never shorter than the time specified. This would prevent
`__sleep_for()` from calling `usleep()` a second time with a faction of a system
tick period, so no more blocking busy wait.
another tick count to our calculations. That would sometimes cause `usleep()` to
take longer, but never shorter than the specified time. This would prevent
`__sleep_for()` from calling `usleep()` a second time with a fraction of a
system tick period, so no more blocking busy wait.
We can be a little more sophisticated by taking a cue from the newer
`__sleep_for()` and keep checking the monotonic clock. Something like this:
Expand Down Expand Up @@ -610,18 +608,18 @@ int usleep(useconds_t us)
}
```

With this approach, small sleep times are precise (as they always were), and
With this approach, short sleep times are precise (as they always were), and
longer sleep times are non-blocking and always as long or longer than the
requested time. By testing the monotonic clock, the extra tick is only applied
when needed.

I opened a [PR](https://github.com/espressif/esp-idf/pull/15132) with this
change to IDF. Hopefully it gets approved.

Since I didn't want to work with a custom, patched IDF, I ended up replacing all
calls to `std::this_thread::sleep_for()` with our own function. It has the same
default signature, with the optional ability to specify a "sleep strategy." We
can now force the custom `sleep_for()` to busy wait or to yield:
Since I didn't want to work with a custom, patched IDF, I replaced all calls to
`std::this_thread::sleep_for()` with our own function. It has the same default
signature, with the option to specify a "sleep strategy." We can now force the
custom `sleep_for()` to busy wait or to yield:

```c++
enum class SleepStrategy {
Expand Down Expand Up @@ -675,12 +673,12 @@ while allowing more control over how to perform the sleep when using C++.
I cut my teeth on bare metal C code where _everything_ was statically allocated.
No `malloc()`. No floating point math because there was no FPU. Custom linker
scripts. Debugging using GPIO pins and an oscilloscope. Using precalculated
value tables to save a few microseconds in an ISR. We ran at 24 Mhz. At that
time, C was luxurious because the "old" stuff was running at 1 Mhz usually in
value tables to save a few microseconds in an ISR. We ran at 24 MHz. At that
time, C was luxurious because the "old" stuff was running at 1 MHz, usually in
assembler.

Back then common wisdom was that C++ was just not appropriate for
microcontrollers. Of course that was a long time ago, before "modern C++" and
microcontrollers. Of course, that was a long time ago, before "modern C++" and
before MCUs were clocked at hundreds of MHz with L1 caching and multistage
instruction pipelines.

Expand Down

0 comments on commit 4f01199

Please sign in to comment.