Fix gfx1250 FPSan payload and mulmod paths#20
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses two gfx1250 FPSan correctness/performance issues encountered under RocJITsu: (1) excessive time spent in compiler-emitted 128-bit remainder helper loops during modular multiplication for small moduli, and (2) incorrect handling of packed fp4/fp6 sub-byte storage payloads when widening during unpack paths.
Changes:
- Add a payload-preserving
subbyte_widen_to<...>()helper to widen fp4/fp6 storage codes into the destinationValue<...>without applying an algebraic cast. - Optimize
alg_mulmod()to use a reduced 64-bit multiply/mod path when the modulus fits in 32 bits, preserving the existing 128-bit path for true 64-bit moduli. - Update gfx1250 cvt_scale unpack implementations for fp4/fp6 to use the new payload-preserving widening helper.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| include/fpsan/detail/subbyte_widen.hpp | Adds subbyte_widen_to() to preserve sign-extended storage payload bits when widening into a destination Value. |
| include/fpsan/detail/algebraic.hpp | Optimizes modular multiplication for small moduli to avoid 128-bit remainder helper slow paths. |
| include/fpsan/amdgcn_cvt.hpp | Routes fp4/fp6 scaled-unpack FPSan-family paths through subbyte_widen_to() to preserve storage payload semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
This looks good (and Codex agrees) but @lua1235 please do a follow-up to:
- Sweep the entire codebase for other instances of unnecessary 128-bit arithmetic that could be avoided in common cases thanks to early-returns like this.
- Make unit-test changes to cover any new early-return paths.
- Taking a step back: the need for this runtime logic only exists because the current design is not taking advantage of the fact that certain quantities are compile-time constants. That is, we have been very conservative with templatization, and I wonder if that was a mistake: now we have more expensive runtime code causing overhead and the issues that Jakub dealt with here. So as a significantly bigger project, @lua1235 could explore a substantial redesign introducing more templatization such that these paths become dispatched at compile-time. Concretely in the specific instance here, n could be a compile-time constant, and the specific types here like u64 would be replaced by Element-type-specific, Semantics-specific types.
EDIT: rereading algebraic.hpp, I see some constexpr keywords -- some functions such as
FPSAN_HOST_DEVICE constexpr AlgModulus alg_modulus(AlgVariant v, unsigned w)are constexpr , meaning that even though we may get an AlgModulus as the return value from that function, it may still be a compile-time constant, which would mean that Jakub's early return evaluates at compile-time. If that's the case, that's pretty good --- much lighter syntax than templatization --- but it's still possible that templatization would still provide higher benefits if it allows, for instance, to avoid using u64 unconditionally when narrow element types don't need it.
My project suggestion becomes more of an audit: assess what is already being evaluated at compile time, what the resulting code looks like, and what if any is the room for improvement.
FWIW, constexpr functions can also be used at runtime, which makes some constexpr annotations misleading if they are never used in a context producing compile-time constants. |
|
Yes, and assessing which case we are falling into here --- which of these constexpr functions are actually evaluating at compile time --- is a key part of the audit I'd suggest doing now. |
Found while running the full hip-fpsan suite under RocJITsu:
alg_mulmod(). For small moduli, we now reduce the operands and use a 64-bit multiply/mod path, while preserving the existing 128-bit path for true 64-bit cases.With these fixes, the rocjitsu-wrapped gfx1250 hip-fpsan suite passes: 783/783 tests.