Skip to content

perf: reduce CPU load on F051 (and other low-end MCUs)#379

Draft
AlexKlimaj wants to merge 1 commit into
am32-firmware:mainfrom
AlexKlimaj:f051-cpu-optimizations
Draft

perf: reduce CPU load on F051 (and other low-end MCUs)#379
AlexKlimaj wants to merge 1 commit into
am32-firmware:mainfrom
AlexKlimaj:f051-cpu-optimizations

Conversation

@AlexKlimaj

Copy link
Copy Markdown
Contributor

Motivation

The F051 (Cortex-M0, 48 MHz, no hardware divider, 1 flash wait-state) runs out of CPU cycles at high eRPM with bidirectional DShot. This PR removes the major cycle sinks in the interrupt hot paths. Most changes benefit every MCU; a few are F051-specific.

Shared-code changes (all MCUs)

  • Q16 duty-cycle scaling in tenKhzRoutine(duty_cycle * tim1_arr) / 2000 ran a ~50–200 cycle soft division (plus the prop-brake/active-brake variants) every 50 µs tick. Now a Q16 reciprocal multiply ((duty * scale) >> 16); the scale factor is recomputed in the main loop at idle priority only when tim1_arr changes. Verified within 1 timer count of the exact division for all duty values and ARR up to 6012 (max with 8 kHz PWM setting).
  • Throttle→duty slopes in setInput — the two hot map() calls (recursive binary interpolation, ~10 nested calls each, runs at DShot frame rate) replaced with startup-computed Q16 straight lines. Note this is more linear than before: the old map() deviated up to 5/2000 (0.25 % throttle) from a true straight line; servo-input and low-rate map() calls are unchanged.
  • Single-pass DShot decodecomputeDshotDMA packed pulses into a 16-entry int array and re-read it three times to build CRC and value. Now a single packed word with the same CRC/value math (bit-for-bit equivalent, incl. the inverted-CRC bidirectional case). Frees 64 B RAM.
  • Current filternumReadings was a const variable, forcing a runtime soft division per 1 kHz sample. Now a #define so the compiler emits a multiply sequence. Window stays 50 samples, zero behavior change.
  • NVIC priority writes — the DShot priority swap wrote 3–4 NVIC registers every main-loop iteration; now only on actual state change (first pass always writes, preserving boot behavior).
  • LTO enabled (-flto) — inlines hot helpers across translation units (getCompOutputLevel, comStep, commutate, LL wrappers); the whole firmware already compiles+links in one gcc invocation so no build restructuring was needed. Opt-outs:
    • _CAN builds: gcc 10 LTO trips -Werror=maybe-uninitialized false positives in canard/dsdl code, and CAN targets have ample flash anyway.
    • E230: -O3 LTO inlining overflows its 27 K flash.

F051-specific changes

  • Hot code runs from RAM (0WS) instead of flash (1WS) — new .ramfunc input section inside .data (loaded by the existing startup copy loop), RAM_FUNC macro in targets.h (empty on other MCUs). In RAM: ADC1_COMP/TIM6_DAC/TIM14 IRQ handlers, interruptRoutine, PeriodElapsedCallback, commutate, comStep, tenKhzRoutine, getBemfState, comparator helpers. ~2.9 KB of code; worst-case F051 target keeps 2.1 KB RAM headroom above the 1.5 KB heap/stack reserve.
  • getCompOutputLevel inlined in comparator.h — it was an out-of-line cross-TU call inside the comparator-ISR zero-cross filter loop (up to filter_level = 12 calls per interrupt, ~15–20 cycles each, now ~4).
  • SysTick interrupt disabled — its handler is empty; the counter keeps running, only the 1 kHz interrupt (and its ISR entry/exit cost) is gone. Nothing on F051 uses SysTick (UTILITY_TIMER is TIM17).

Latent bugs found by LTO type checking

  • Mcu/g431/Src/stm32g4xx_it.c: commutation_interval declared volatile uint16_t, actual definition is volatile uint32_t — the early-zero-cross guard read only half the variable.
  • Mcu/a153/Src/DMA.c: aTxBuffer declared int8_t[18], actual is uint8_t[49].
  • Src/dshot.c: removed unused extern int zero_crosses (actual is volatile uint32_t).
  • filename[] needed __attribute__((used)) to survive LTO into its .file_name section.

Verification

  • All 253 targets build cleanly (make all, exit 0, no region overflows). Worst-case F051 flash: 92.3 % (ARK_4IN1).
  • Disassembly: the 20 kHz handler's per-tick path is division-free (remaining __aeabi_idiv calls are in the 1 kHz PID block only); all tagged functions confirmed at 0x2000xxxx addresses.
  • Math equivalence checked exhaustively host-side: Q16 duty scale within 1 timer count of exact; throttle slope within 5/2000 of old map() (new version is the exact line).

What still needs bench testing (why this is a draft)

  • Comparator filter timing: inlining the comparator read shrinks the zero-cross filter loop's sampling window (~5 µs → ~1.2 µs at filter_level 12). Noise immunity on real hardware should be verified at high eRPM / high current; filter_level mapping may want retuning.
  • RAM-resident ISRs change interrupt latency characteristics (for the better, but worth scoping).
  • Normal motor operation: startup, sine mode, bidirectional DShot telemetry, RC-car reverse, brake-on-stop, current limiting.
  • Not attempted here (possible follow-ups): dropping LOOP_FREQUENCY_HZ back to 10 kHz on F051, PID fixed-point scale change from 10000 to 8192 (would require retuning), extending RAM_FUNC to G031/G071.

🤖 Generated with Claude Code

The F051 at 48MHz was running out of CPU cycles at high eRPM with
bidirectional dshot. This removes the major cycle sinks:

Shared code (all MCUs):
- replace the divide-by-2000 duty cycle scaling in the 20khz routine
  with a Q16 reciprocal multiply, the scale factor is recomputed in the
  main loop only when tim1_arr changes (cortex-m0 has no divider, each
  soft division cost 50+ cycles at 20khz)
- replace the recursive map() calls for throttle to duty cycle in
  setInput with startup-computed Q16 slopes (exact straight line,
  the old binary interpolation deviated up to 5/2000 from linear)
- make the current filter window a compile time constant so the
  divide becomes a multiply sequence
- decode dshot frames in a single pass into one packed word instead of
  a 16-entry int array plus three re-reads (also frees 64B of RAM)
- only write NVIC priorities when the dshot priority scheme actually
  changes instead of every main loop iteration
- enable LTO so hot helpers inline across translation units, CAN
  builds keep it off (gcc 10 maybe-uninitialized false positives in
  canard/dsdl) as does E230 (flash too tight for -O3 LTO inlining)

F051 specific:
- run the commutation hot path from 0WS RAM instead of 1WS flash:
  comparator/commutation/20khz IRQ handlers, commutate, comStep,
  interruptRoutine, PeriodElapsedCallback, tenKhzRoutine via a new
  .ramfunc section copied by the existing startup data init
- inline getCompOutputLevel, it was an out of line call in the
  comparator ISR zero cross filter loop (up to 12 calls per interrupt)
- stop enabling the SysTick interrupt, its handler is empty

Fixes found by LTO type checking:
- g431: commutation_interval extern was uint16_t, actual is uint32_t
- a153: aTxBuffer extern was int8_t[18], actual is uint8_t[49]
- dshot.c: remove unused mismatched zero_crosses extern

All 253 targets build. Worst case F051: flash 92.3%, RAM keeps 2.1KB
headroom above the heap/stack reserve.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant