Skip to content

Missed autovectorization for slice.iter.fold, works for slice.iter.copied.fold #113789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jhorstmann opened this issue Jul 17, 2023 · 5 comments
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-codegen Area: Code generation C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@jhorstmann
Copy link
Contributor

I was looking into optimizing a function that checks that all values in a slice are in range. It is not that surprising that the version with all does not get optimized because returning early (although in theory rust should be allowed to read more elements from the slice before breaking), but it is surprising that adding copied before folding makes a difference in autovectorization.

Sample code (https://rust.godbolt.org/z/5eznWbMcf):

pub fn check_range_all(keys: &[u32], max: u32) -> bool {
    keys.iter().all(|x| *x < max)
}

pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| a && *x < max)
}

pub fn check_range_copied_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().copied().fold(true, |a, x| a && x < max)
}
  • check_range_all compares one element per loop iteration, using copied does not change the assembly at all (both functions are merged)
  • check_range_fold unrolls the check 8 times, each iteration it branchless, but does not use any vector instructions
  • check_range_copied_fold uses avx instructions and checks 32 elements per loop iteration
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jul 17, 2023
@the8472
Copy link
Member

the8472 commented Jul 18, 2023

pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| a & (*x < max))
}

does vectorize. Then the question is why that's not necessary for the copied() iterator.

@Jules-Bertholet
Copy link
Contributor

@rustbot label A-codegen I-heavy I-slow

@rustbot rustbot added A-codegen Area: Code generation I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jul 18, 2023
@jhorstmann
Copy link
Contributor Author

@the8472 thanks, that's a very good data point that I did not consider. That would be consistent with llvm not eagerly de-referencing. In fact this also gets vectorized:

pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
    keys.iter().fold(true, |a, x| (*x < max) && a)
}

I could live with that behavior and explanation and would be ok with closing the issue.

@the8472
Copy link
Member

the8472 commented Jul 18, 2023

The IR for the closure does mark the pointer as dereferencable, which means it could be hoisted out of the comparison.

; example::check_range_fold::{{closure}}
; Function Attrs: inlinehint nonlazybind uwtable
define internal noundef zeroext i1 @"_ZN7example16check_range_fold28_$u7b$$u7b$closure$u7d$$u7d$17hcf54473b332a3b17E"(ptr noalias noundef align 8 dereferenceable(8) %_1, i1 noundef zeroext %a, ptr noalias noundef readonly align 4 dereferenceable(4) %x) unnamed_addr #0 !dbg !121 {
start:
  %_0 = alloca i8, align 1
  br i1 %a, label %bb2, label %bb1, !dbg !123

bb1:                                              ; preds = %start
  store i8 0, ptr %_0, align 1, !dbg !123
  br label %bb3, !dbg !123

bb2:                                              ; preds = %start
  %_5 = load i32, ptr %x, align 4, !dbg !124
  %_7 = load ptr, ptr %_1, align 8, !dbg !125
  %_6 = load i32, ptr %_7, align 4, !dbg !125
  %_4 = icmp ult i32 %_5, %_6, !dbg !124
  %0 = zext i1 %_4 to i8, !dbg !123
  store i8 %0, ptr %_0, align 1, !dbg !123
  br label %bb3, !dbg !123

bb3:                                              ; preds = %bb2, %bb1
  %1 = load i8, ptr %_0, align 1, !dbg !126
  %2 = trunc i8 %1 to i1, !dbg !126
  ret i1 %2, !dbg !126
}

After optimizations the unrolled loop loads the values unconditionally, so it's not like it found it beneficial to skip the memory accesses and bail out early

.LBB0_5:
        cmp     dword ptr [rdi + 4*r8], edx
        setb    r9b
        and     r9b, al
        cmp     dword ptr [rdi + 4*r8 + 4], edx
        setb    al
        cmp     dword ptr [rdi + 4*r8 + 8], edx
        setb    r10b
        and     r10b, al
        and     r10b, r9b
        cmp     dword ptr [rdi + 4*r8 + 12], edx
        setb    al
        cmp     dword ptr [rdi + 4*r8 + 16], edx
        setb    r9b
        and     r9b, al
        cmp     dword ptr [rdi + 4*r8 + 20], edx
        setb    r11b
        and     r11b, r9b
        and     r11b, r10b
        cmp     dword ptr [rdi + 4*r8 + 24], edx
        setb    r9b
        cmp     dword ptr [rdi + 4*r8 + 28], edx
        setb    al
        and     al, r9b
        and     al, r11b
        add     r8, 8
        cmp     rsi, r8
        jne     .LBB0_5

CC @nikic

@workingjubilee workingjubilee added the A-autovectorization Area: Autovectorization, which can impact perf or code size label Jul 22, 2023
@Noratrieb Noratrieb removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jul 30, 2023
@jhorstmann
Copy link
Contributor Author

This issue looks similar to #105259, with slightly simpler code to reproduce it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-codegen Area: Code generation C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

6 participants