perf: reduce CPU load on F051 (and other low-end MCUs)#379
Draft
AlexKlimaj wants to merge 1 commit into
Draft
Conversation
The F051 at 48MHz was running out of CPU cycles at high eRPM with bidirectional dshot. This removes the major cycle sinks: Shared code (all MCUs): - replace the divide-by-2000 duty cycle scaling in the 20khz routine with a Q16 reciprocal multiply, the scale factor is recomputed in the main loop only when tim1_arr changes (cortex-m0 has no divider, each soft division cost 50+ cycles at 20khz) - replace the recursive map() calls for throttle to duty cycle in setInput with startup-computed Q16 slopes (exact straight line, the old binary interpolation deviated up to 5/2000 from linear) - make the current filter window a compile time constant so the divide becomes a multiply sequence - decode dshot frames in a single pass into one packed word instead of a 16-entry int array plus three re-reads (also frees 64B of RAM) - only write NVIC priorities when the dshot priority scheme actually changes instead of every main loop iteration - enable LTO so hot helpers inline across translation units, CAN builds keep it off (gcc 10 maybe-uninitialized false positives in canard/dsdl) as does E230 (flash too tight for -O3 LTO inlining) F051 specific: - run the commutation hot path from 0WS RAM instead of 1WS flash: comparator/commutation/20khz IRQ handlers, commutate, comStep, interruptRoutine, PeriodElapsedCallback, tenKhzRoutine via a new .ramfunc section copied by the existing startup data init - inline getCompOutputLevel, it was an out of line call in the comparator ISR zero cross filter loop (up to 12 calls per interrupt) - stop enabling the SysTick interrupt, its handler is empty Fixes found by LTO type checking: - g431: commutation_interval extern was uint16_t, actual is uint32_t - a153: aTxBuffer extern was int8_t[18], actual is uint8_t[49] - dshot.c: remove unused mismatched zero_crosses extern All 253 targets build. Worst case F051: flash 92.3%, RAM keeps 2.1KB headroom above the heap/stack reserve. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The F051 (Cortex-M0, 48 MHz, no hardware divider, 1 flash wait-state) runs out of CPU cycles at high eRPM with bidirectional DShot. This PR removes the major cycle sinks in the interrupt hot paths. Most changes benefit every MCU; a few are F051-specific.
Shared-code changes (all MCUs)
tenKhzRoutine—(duty_cycle * tim1_arr) / 2000ran a ~50–200 cycle soft division (plus the prop-brake/active-brake variants) every 50 µs tick. Now a Q16 reciprocal multiply ((duty * scale) >> 16); the scale factor is recomputed in the main loop at idle priority only whentim1_arrchanges. Verified within 1 timer count of the exact division for all duty values and ARR up to 6012 (max with 8 kHz PWM setting).setInput— the two hotmap()calls (recursive binary interpolation, ~10 nested calls each, runs at DShot frame rate) replaced with startup-computed Q16 straight lines. Note this is more linear than before: the oldmap()deviated up to 5/2000 (0.25 % throttle) from a true straight line; servo-input and low-ratemap()calls are unchanged.computeDshotDMApacked pulses into a 16-entryintarray and re-read it three times to build CRC and value. Now a single packed word with the same CRC/value math (bit-for-bit equivalent, incl. the inverted-CRC bidirectional case). Frees 64 B RAM.numReadingswas aconstvariable, forcing a runtime soft division per 1 kHz sample. Now a#defineso the compiler emits a multiply sequence. Window stays 50 samples, zero behavior change.-flto) — inlines hot helpers across translation units (getCompOutputLevel,comStep,commutate, LL wrappers); the whole firmware already compiles+links in one gcc invocation so no build restructuring was needed. Opt-outs:_CANbuilds: gcc 10 LTO trips-Werror=maybe-uninitializedfalse positives in canard/dsdl code, and CAN targets have ample flash anyway.-O3LTO inlining overflows its 27 K flash.F051-specific changes
.ramfuncinput section inside.data(loaded by the existing startup copy loop),RAM_FUNCmacro intargets.h(empty on other MCUs). In RAM:ADC1_COMP/TIM6_DAC/TIM14IRQ handlers,interruptRoutine,PeriodElapsedCallback,commutate,comStep,tenKhzRoutine,getBemfState, comparator helpers. ~2.9 KB of code; worst-case F051 target keeps 2.1 KB RAM headroom above the 1.5 KB heap/stack reserve.getCompOutputLevelinlined incomparator.h— it was an out-of-line cross-TU call inside the comparator-ISR zero-cross filter loop (up tofilter_level= 12 calls per interrupt, ~15–20 cycles each, now ~4).UTILITY_TIMERis TIM17).Latent bugs found by LTO type checking
Mcu/g431/Src/stm32g4xx_it.c:commutation_intervaldeclaredvolatile uint16_t, actual definition isvolatile uint32_t— the early-zero-cross guard read only half the variable.Mcu/a153/Src/DMA.c:aTxBufferdeclaredint8_t[18], actual isuint8_t[49].Src/dshot.c: removed unusedextern int zero_crosses(actual isvolatile uint32_t).filename[]needed__attribute__((used))to survive LTO into its.file_namesection.Verification
make all, exit 0, no region overflows). Worst-case F051 flash: 92.3 % (ARK_4IN1).__aeabi_idivcalls are in the 1 kHz PID block only); all tagged functions confirmed at0x2000xxxxaddresses.map()(new version is the exact line).What still needs bench testing (why this is a draft)
filter_level12). Noise immunity on real hardware should be verified at high eRPM / high current;filter_levelmapping may want retuning.LOOP_FREQUENCY_HZback to 10 kHz on F051, PID fixed-point scale change from 10000 to 8192 (would require retuning), extendingRAM_FUNCto G031/G071.🤖 Generated with Claude Code