Conversation
m-mcgowan
left a comment
There was a problem hiding this comment.
An excellent piece of work and fantastic to have this feature available! We will release it in 0.8.0-rc.2 so please ensure corresponding docs are available. In general, I'd like to see more source code comments describing why we are done - I've added some inline comments for areas that I feel need some explanation.
| os_thread_t os_thread_current(); | ||
|
|
||
| typedef struct { | ||
| uint8_t reserved; |
There was a problem hiding this comment.
do we need a size field here for extensibility?
There was a problem hiding this comment.
That's a question in the original PR as well :)
It's currently only used within a single module and is not exported, so that shouldn't be a concern at the moment. If we later decide to export it, the first reserved becomes a version field as usual.
|
|
||
| # C compiler flags | ||
| CFLAGS += -g3 -gdwarf-2 -Os -mcpu=cortex-m3 -mthumb | ||
| CFLAGS += -g3 -gdwarf-2 -Os -mcpu=cortex-m3 -mthumb -fomit-frame-pointer |
There was a problem hiding this comment.
could you say a few words about the omit-frame-pointer option. is it only an optimization or does it have a functional purpose?
There was a problem hiding this comment.
This optimization is already enabled by -Os (actually by -O1, see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html), but it's better to explicitly specify it.
Disabling this option will lead to increased flash and stack usage (at the very least +16 bytes on each function call without tail-call optimization) and will interfere with stacktracer implemented in this PR. MbedTLS bignum assembly optimizations will also fail to build as R7 is not available and stores the frame pointer.
| #include <malloc.h> | ||
| #include "timer_hal.h" | ||
|
|
||
| extern "C" void vTaskGetStackInfo( TaskHandle_t pxTask, void** stack_ptr, void** start_stack_ptr, void** end_stack_ptr ); |
There was a problem hiding this comment.
Can this be pushed to a FreeRTOS header?
| return 0; | ||
| } | ||
|
|
||
| os_result_t os_thread_dump(os_thread_t thread, os_thread_dump_callback_t callback, void* ptr) |
There was a problem hiding this comment.
We should document that the callback is executed in a critical section and should ensure it completes independently of the progress of other threads.
There was a problem hiding this comment.
👍
TracerService also only calls it with interrupts disabled.
|
|
||
| int count = 0; | ||
|
|
||
| for (uint32_t* sp = (uint32_t*)info->stack_start; sp <= (uint32_t*)info->stack_end; sp++) { |
There was a problem hiding this comment.
Somewhat documented blackmagic :)
| } else { | ||
| // Insert new at the end | ||
| if (freeSpace() < (sizeof(ThreadEntry) + size(chkpt) + strnlen(info->name, maxThreadNameLength) + 1)) { | ||
| return TRACER_ERROR_NO_SPACE; |
There was a problem hiding this comment.
should we use a circular buffer so that the most recent entries are preserved?
There was a problem hiding this comment.
That's a good idea, however in theory we should not run out of space easily. There is at least 1k (~700 on Electron) bytes in system retained RAM and we do not have that many threads (especially on Electron).
We do not generate full stacktraces on every call (only when needed, e.g. using USB request or in hardfault/watchdog ISR) and every thread entry only has a header struct (8 bytes) + its name (up to 16 bytes) + a checkpoint stored (4 bytes).
| # platforms.h | ||
| ifeq ($(PLATFORM_ID),3) | ||
| INCLUDE_DIRS += $(PROJECT_ROOT)/platform/shared/inc | ||
| INCLUDE_DIRS += $(PROJECT_ROOT)/platform/MCU/gcc/inc |
There was a problem hiding this comment.
This is unexpected - could you explain why the gcc platform is brought in for all user apps?
There was a problem hiding this comment.
Only for GCC platform: ifeq ($(PLATFORM_ID),3)
| BUILD_PATH_EXT=$(COMMUNICATION_BUILD_PATH_EXT) | ||
|
|
||
| DEPENDENCIES = hal dynalib services wiring crypto | ||
| DEPENDENCIES = hal dynalib services wiring crypto platform |
There was a problem hiding this comment.
I imagine this is because of the traces in the logs?
There was a problem hiding this comment.
Correct. platform_tracer.h mainly.
|
|
||
| system_part3_ram_end = 0x2001D800 /* 0x20200000-10K */; | ||
| system_part3_ram_start = 0x2001c000 /* end of SRAM - 16K */; | ||
| system_part3_ram_end = 0x2001D800 - 1K /* 0x20200000-10K-1K */; |
There was a problem hiding this comment.
what's the 1K reserved for?
There was a problem hiding this comment.
At the time of writing system-part3 RAM was overflowing, this change might no longer be needed.
|
|
||
| int tracer_save_checkpoint__(tracer_checkpoint_t* chkpt, uint32_t flags, void* reserved) { | ||
| if (callbacks2.tracer_save_checkpoint) { | ||
| return callbacks2.tracer_save_checkpoint(chkpt, flags, reserved); |
There was a problem hiding this comment.
Can you explain please why we need two distinct callback stores?
There was a problem hiding this comment.
One for system-part1, the other for system-part3. Ideally these should be moved into an .inc and built as part of module-specific source files or I guess both should be enabled simultaneously and built as weak, as otherwise these will probably cause issues with non-clean builds. 👍
…ed_system section size
…hen built by C (as opposed to C++) compiler. Replace const variables like OS_THREAD_INVALID_HANDLE in concurrent_hal.h with defines when being built by C compiler
…d of passing an integer argument to __builtin_return_address
…opy of previous trace data
8e40494 to
95628c0
Compare
submission notes
Problem
This is a rebase of #1369 on develop with renaming according to [CH8404] and some additional backup SRAM overflow checks.
Solution
When a device hard faults, crashes, freezes or is otherwise misbehaving, it is presently difficult to know what the last action the device took is and whether the fault lies in application code, system code, peripheral code etc..
A checkpoints API would allow application and system firmware to record their current execution progress and provide an indication of where execution halted prior to the crash. This information is then published to the cloud on next successful connection.
It is also possible to get stack trace of all the threads using a naive stack unwinder, that just scans through the thread stack, looking for flash addresses (within system and user part bounds) and checks if there is a branching instruction at that location.
It would be useful to store per-thread stack traces in addition to checkpoint info whenever the device enters a hardfault, a panic state, or on demand. This will provide a better overview of the system state and help understand the cause of the crash or deadlock for difficult to reproduce issues.
System code may use the following macros defined in
tracer_service.h:TRACER_CHECKPOINT()- regular instruction address type checkpointTRACER_PANIC_CHECKPOINT()- special instruction address type checkpoint which should only be called from a panic handler, which will prevent further modification of diagnostic dataTRACER_CRASH_CHECKPOINT(pc)- special instruction address type checkpoint which should only be called from a crash (hardfault) handler, which will prevent further modification of diagnostic data. An instruction address needs to be manually passed as an argument.TRACER_UPDATE()- forces a full system state update (including stacktraces).All
LOGstatements, except forLOG_DUMP, make a call toTRACER_CHECKPOINT().Both hardfault and panic handlers make a call to their respective macros:
TRACER_CRASH_CHECKPOINT(),TRACER_PANIC_CHECKPOINT()Application code may use the following macros defined in
spark_wiring_tracer.h:CHECKPOINT()- a standard variant ofCHECKPOINT()macro. Implementation depends on a preprocessor macroTRACER_ELF_AVAILABLE.CHECKPOINT()macro internally callsTRACER_CHECKPOINT(), making it an instruction address type checkpointCHECKPOINT()macro internally makes a call toTRACER_TEXT_CHECKPOINT(), which saves the location of the checkpoint in textual format:__FILENAME__:__LINE__CHECKPOINT(text)- a specialized variant ofCHECKPOINT()macro, which forces a textual checkpoint with the text passed as an argumentAll
Loggerclass calls, except forprint,printf,writeanddump, internally include a call to_LOG_CHECKPOINT()macro defined in spark_wiring_diagnostic.h, which is only enabled whenTRACER_ELF_AVAILABLE == 1. It acquires an address of the calling function and uses that as the checkpoint instruction address.Steps to Test
app/checkpoint(see README.md)Example App
app/checkpointReferences
Completeness