Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte count has poor codegen with autovectorization #136500

Open
danielhuang opened this issue Feb 3, 2025 · 0 comments
Open

Byte count has poor codegen with autovectorization #136500

danielhuang opened this issue Feb 3, 2025 · 0 comments
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@danielhuang
Copy link
Contributor

danielhuang commented Feb 3, 2025

I'm currently writing some code that counts the amount of a certain byte (newlines in this example) in a large byte slice (>1GB):

fn count(b: &[u8]) -> usize {
    b.iter().filter(|&&x| x == b'\n').count()
}

When compiled with -C target-feature=+avx2, avx2 instructions are emitted from autovectorization, but is still around 2x slower than bytecount.

Using portable_simd, the code can be made faster:

fn count_simd(b: &[u8]) -> usize {
    let (begin, mid, end) = b.as_simd::<64>();
    count(begin)
        + count(end)
        + mid
            .iter()
            .map(|x| {
                x.simd_eq(Simd::splat(b'\n'))
                    .select(Simd::splat(1u8), Simd::splat(0u8))
                    .reduce_sum()
            })
            .map(|x| x as usize)
            .sum::<usize>()
}

This has similar performance with bytecount::count.

https://rust.godbolt.org/z/6b5bTKoa9

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Feb 3, 2025
@jieyouxu jieyouxu added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-autovectorization Area: Autovectorization, which can impact perf or code size C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. A-target-feature Area: Enabling/disabling target features like AVX, Neon, etc. labels Feb 3, 2025
@saethlin saethlin added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants