-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Missed autovectorization for slice.iter.fold, works for slice.iter.copied.fold #113789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
keys.iter().fold(true, |a, x| a & (*x < max))
} does vectorize. Then the question is why that's not necessary for the |
@rustbot label A-codegen I-heavy I-slow |
@the8472 thanks, that's a very good data point that I did not consider. That would be consistent with llvm not eagerly de-referencing. In fact this also gets vectorized: pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
keys.iter().fold(true, |a, x| (*x < max) && a)
} I could live with that behavior and explanation and would be ok with closing the issue. |
The IR for the closure does mark the pointer as dereferencable, which means it could be hoisted out of the comparison. ; example::check_range_fold::{{closure}}
; Function Attrs: inlinehint nonlazybind uwtable
define internal noundef zeroext i1 @"_ZN7example16check_range_fold28_$u7b$$u7b$closure$u7d$$u7d$17hcf54473b332a3b17E"(ptr noalias noundef align 8 dereferenceable(8) %_1, i1 noundef zeroext %a, ptr noalias noundef readonly align 4 dereferenceable(4) %x) unnamed_addr #0 !dbg !121 {
start:
%_0 = alloca i8, align 1
br i1 %a, label %bb2, label %bb1, !dbg !123
bb1: ; preds = %start
store i8 0, ptr %_0, align 1, !dbg !123
br label %bb3, !dbg !123
bb2: ; preds = %start
%_5 = load i32, ptr %x, align 4, !dbg !124
%_7 = load ptr, ptr %_1, align 8, !dbg !125
%_6 = load i32, ptr %_7, align 4, !dbg !125
%_4 = icmp ult i32 %_5, %_6, !dbg !124
%0 = zext i1 %_4 to i8, !dbg !123
store i8 %0, ptr %_0, align 1, !dbg !123
br label %bb3, !dbg !123
bb3: ; preds = %bb2, %bb1
%1 = load i8, ptr %_0, align 1, !dbg !126
%2 = trunc i8 %1 to i1, !dbg !126
ret i1 %2, !dbg !126
} After optimizations the unrolled loop loads the values unconditionally, so it's not like it found it beneficial to skip the memory accesses and bail out early .LBB0_5:
cmp dword ptr [rdi + 4*r8], edx
setb r9b
and r9b, al
cmp dword ptr [rdi + 4*r8 + 4], edx
setb al
cmp dword ptr [rdi + 4*r8 + 8], edx
setb r10b
and r10b, al
and r10b, r9b
cmp dword ptr [rdi + 4*r8 + 12], edx
setb al
cmp dword ptr [rdi + 4*r8 + 16], edx
setb r9b
and r9b, al
cmp dword ptr [rdi + 4*r8 + 20], edx
setb r11b
and r11b, r9b
and r11b, r10b
cmp dword ptr [rdi + 4*r8 + 24], edx
setb r9b
cmp dword ptr [rdi + 4*r8 + 28], edx
setb al
and al, r9b
and al, r11b
add r8, 8
cmp rsi, r8
jne .LBB0_5 CC @nikic |
This issue looks similar to #105259, with slightly simpler code to reproduce it |
I was looking into optimizing a function that checks that all values in a slice are in range. It is not that surprising that the version with
all
does not get optimized because returning early (although in theory rust should be allowed to read more elements from the slice before breaking), but it is surprising that addingcopied
before folding makes a difference in autovectorization.Sample code (https://rust.godbolt.org/z/5eznWbMcf):
check_range_all
compares one element per loop iteration, usingcopied
does not change the assembly at all (both functions are merged)check_range_fold
unrolls the check 8 times, each iteration it branchless, but does not use any vector instructionscheck_range_copied_fold
usesavx
instructions and checks 32 elements per loop iterationThe text was updated successfully, but these errors were encountered: