Skip to content

Fix gfx1250 FPSan payload and mulmod paths#20

Merged
kuhar merged 1 commit into
mainfrom
users/kuhar/fix-payload-mulmod
Jun 19, 2026
Merged

Fix gfx1250 FPSan payload and mulmod paths#20
kuhar merged 1 commit into
mainfrom
users/kuhar/fix-payload-mulmod

Conversation

@kuhar

@kuhar kuhar commented Jun 19, 2026

Copy link
Copy Markdown
Member

Found while running the full hip-fpsan suite under RocJITsu:

  1. SWMMAC FPSan tests appeared to hang; tracing showed they were making progress, but spending excessive time in compiler-emitted 128-bit remainder helper loops from alg_mulmod(). For small moduli, we now reduce the operands and use a 64-bit multiply/mod path, while preserving the existing 128-bit path for true 64-bit cases.
  2. The remaining fp6 conversion failure came from unpacking sub-byte storage payloads and then applying an algebraic cast. fp4/fp6 are storage-only formats here, so the FPSan path should preserve the sign-extended storage payload bits in the destination value. The unpack helpers now route through a payload-preserving widening helper instead.

With these fixes, the rocjitsu-wrapped gfx1250 hip-fpsan suite passes: 783/783 tests.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two gfx1250 FPSan correctness/performance issues encountered under RocJITsu: (1) excessive time spent in compiler-emitted 128-bit remainder helper loops during modular multiplication for small moduli, and (2) incorrect handling of packed fp4/fp6 sub-byte storage payloads when widening during unpack paths.

Changes:

  • Add a payload-preserving subbyte_widen_to<...>() helper to widen fp4/fp6 storage codes into the destination Value<...> without applying an algebraic cast.
  • Optimize alg_mulmod() to use a reduced 64-bit multiply/mod path when the modulus fits in 32 bits, preserving the existing 128-bit path for true 64-bit moduli.
  • Update gfx1250 cvt_scale unpack implementations for fp4/fp6 to use the new payload-preserving widening helper.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
include/fpsan/detail/subbyte_widen.hpp Adds subbyte_widen_to() to preserve sign-extended storage payload bits when widening into a destination Value.
include/fpsan/detail/algebraic.hpp Optimizes modular multiplication for small moduli to avoid 128-bit remainder helper slow paths.
include/fpsan/amdgcn_cvt.hpp Routes fp4/fp6 scaled-unpack FPSan-family paths through subbyte_widen_to() to preserve storage payload semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bjacob bjacob left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good (and Codex agrees) but @lua1235 please do a follow-up to:

  • Sweep the entire codebase for other instances of unnecessary 128-bit arithmetic that could be avoided in common cases thanks to early-returns like this.
  • Make unit-test changes to cover any new early-return paths.
  • Taking a step back: the need for this runtime logic only exists because the current design is not taking advantage of the fact that certain quantities are compile-time constants. That is, we have been very conservative with templatization, and I wonder if that was a mistake: now we have more expensive runtime code causing overhead and the issues that Jakub dealt with here. So as a significantly bigger project, @lua1235 could explore a substantial redesign introducing more templatization such that these paths become dispatched at compile-time. Concretely in the specific instance here, n could be a compile-time constant, and the specific types here like u64 would be replaced by Element-type-specific, Semantics-specific types.

EDIT: rereading algebraic.hpp, I see some constexpr keywords -- some functions such as

FPSAN_HOST_DEVICE constexpr AlgModulus alg_modulus(AlgVariant v, unsigned w)

are constexpr , meaning that even though we may get an AlgModulus as the return value from that function, it may still be a compile-time constant, which would mean that Jakub's early return evaluates at compile-time. If that's the case, that's pretty good --- much lighter syntax than templatization --- but it's still possible that templatization would still provide higher benefits if it allows, for instance, to avoid using u64 unconditionally when narrow element types don't need it.

My project suggestion becomes more of an audit: assess what is already being evaluated at compile time, what the resulting code looks like, and what if any is the room for improvement.

@kuhar

kuhar commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

are constexpr , meaning that even though we may get an AlgModulus as the return value from that function, it may still be a compile-time constant, which would mean that Jakub's early return evaluates at compile-time.

FWIW, constexpr functions can also be used at runtime, which makes some constexpr annotations misleading if they are never used in a context producing compile-time constants.

@kuhar kuhar merged commit c55cbaa into main Jun 19, 2026
1 check passed
@bjacob

bjacob commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Yes, and assessing which case we are falling into here --- which of these constexpr functions are actually evaluating at compile time --- is a key part of the audit I'd suggest doing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants