-
Notifications
You must be signed in to change notification settings - Fork 287
_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is definitely incorrect for these intrinsics to be using the fast-math flag. Here is the LLVM IR that clang generates for this intrinsic: %0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x) |
That allows LLVM to add the elements in any order, and also do re-association optimizations when the result is fed into another
(It seems like Intel uses |
Specifically, if I were to do something like let sum = _mm512_reduce_add_ps(a);
let vec = _mm512_set_pd(sum, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0);
let sum2 = _mm512_reduce_add_ps(vec); then in my reading of I don't think this is a correct implementation of the Intel vendor intrinsic. |
I do not believe it is, no. |
@nikic has confirmed to me that Ralf's concerns about So yes, these are very not correct implementations. |
The problem is that the lack of the |
Intel specifies a very specific order of summation. It's not left-to-right, which is what the no-reassoc version would do. From what I understand there is anyway no hardware operation that actually performs this particular kind of summation, so either LLVM needs to have support for this specific operation (and lower it to the best instruction sequence), or we need to do the lowering ourselves in the implementation of |
@RalfJung Hmm. There are a few different possible sequences a compiler can use, but one of the "obvious" ones is a sequence that just repeatedly uses the "do one round of tree-reduction" instruction, which works like you might imagine from that description I just gave. |
Yes, in practice reassoc on reductions produces a tree reduction. Of course, this is not guaranteed from a semantics perspective. |
@RalfJung fwiw, Niki mentions that "perform a tree reduction" was proposed in the past as a possible annotation for the reduces, so perhaps that's the tree we should be barking up this time. |
If that's a possibility then that would make most sense, yes -- have an intrinsic that reduces in a well-defined order that matches what the Intel docs say (i.e., tree reduction). |
My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C? |
|
Thanks! Filed an issue: llvm/llvm-project#82813 |
@nikic it seems I don't know how to talk to LLVM people, they don't seem to agree with me on what it even means to have a LangRef. :/ Maybe you can help move the discussion in llvm/llvm-project#82813 somewhere productive? |
We can also not use |
This is now fixed by #1594: all of these are now implemented by explicitly expanding to a sequence of operations instead of using the LLVM intrinsics. |
Should we remove the |
They may still be useful for generic simd (cc @workingjubilee) |
Today I learned about the existence of the
simd_reduce_add_unordered
intrinsic. When called on a float, this compiles to LLVM'svector.reduce.fadd
with the "fast" flag set, which means that passing in NAN or INF is UB and optimizations are allowed "to treat the sign of a zero argument or zero result as insignificant" (which I think means the sign of input zeros is non-deterministically swapped and returned zeros have non-deterministic sign).This intrinsic is not used a lot in stdarch, but it has a total of 8 uses (all in
avx512f.rs
). 4 of these are integer intrinsics, where this should be entirely equivalent tosimd_reduce_add
; not sure why the "unordered" version is used. The other 4 are float intrinsics,_mm512_reduce_add_ps
being the first:https://github.com/rust-lang/stdarch/blob/4d9c0bb591336792c4c4baf293d0acc944e57e28/crates/core_arch/src/x86/avx512f.rs#L31262-L31270
Neither the docs here nor Intel's docs mention that this is UB on NAN or INF, and the concerns around signed zeros and doing the addition in an unspecified order. Given that the Intel docs should be the authoritative docs (since this is a vendor intrinsic), why is it even correct to use fast-math flags here? Either the docs need to be updated to state the fast-math preconditions, or the implementation needs to be updated to avoid the fast-math flag. Maybe it should only use "reassoc", not the full but unsafe "fast" flag? But even that should probably be mentioned in the docs.
The text was updated successfully, but these errors were encountered: