Lets consider a simple reduction, such as a dot product:
pub fn add(a:&[f32], b:&[f32]) -> f32 {
a.iter().zip(b.iter()).fold(0.0, |c,(x,y)| c+x*y))
}
Using rustc 1.68 with -C opt-level=3 -C target-feature=+avx2,+fma
I get
.LBB0_5:
vmovss xmm1, dword ptr [rdi + 4*rsi]
vmulss xmm1, xmm1, dword ptr [rdx + 4*rsi]
vmovss xmm2, dword ptr [rdi + 4*rsi + 4]
vaddss xmm0, xmm0, xmm1
vmulss xmm1, xmm2, dword ptr [rdx + 4*rsi + 4]
vaddss xmm0, xmm0, xmm1
vmovss xmm1, dword ptr [rdi + 4*rsi + 8]
vmulss xmm1, xmm1, dword ptr [rdx + 4*rsi + 8]
vaddss xmm0, xmm0, xmm1
vmovss xmm1, dword ptr [rdi + 4*rsi + 12]
vmulss xmm1, xmm1, dword ptr [rdx + 4*rsi + 12]
lea rax, [rsi + 4]
vaddss xmm0, xmm0, xmm1
mov rsi, rax
cmp rcx, rax
jne .LBB0_5
which is a scalar implementation with loop unrolling, not even contracting the mul+add into FMAs. From this code to simd code should be easy, why does rustc not optimize this?
If I replace f32 with i32 I get the desired auto-vectorization:
.LBB0_5:
vmovdqu ymm4, ymmword ptr [rdx + 4*rax]
vmovdqu ymm5, ymmword ptr [rdx + 4*rax + 32]
vmovdqu ymm6, ymmword ptr [rdx + 4*rax + 64]
vmovdqu ymm7, ymmword ptr [rdx + 4*rax + 96]
vpmulld ymm4, ymm4, ymmword ptr [rdi + 4*rax]
vpaddd ymm0, ymm4, ymm0
vpmulld ymm4, ymm5, ymmword ptr [rdi + 4*rax + 32]
vpaddd ymm1, ymm4, ymm1
vpmulld ymm4, ymm6, ymmword ptr [rdi + 4*rax + 64]
vpmulld ymm5, ymm7, ymmword ptr [rdi + 4*rax + 96]
vpaddd ymm2, ymm4, ymm2
vpaddd ymm3, ymm5, ymm3
add rax, 32
cmp r8, rax
jne .LBB0_5
This is because floating points are not associative, meaning in general
a+(b+c) != (a+b)+c. So summing up floating points becomes are serial task, because the compiler will not reorder((a+b)+c)+dinto(a+b)+(c+d). The last can be vectorized, the first cannot.In most cases the programmer does not care about the differences in summing order.
gcc and clang provide the
-fassociative-mathflag which will allow the compiler to reorder floating point operations for performance.rustc does not provide this and for all I know llvm also does not accept flags which will change this behavior.
In nightly Rust you can use
#![feature(core_intrinsics)]to get the optimization:This does not use fma. So for fma you have to use:
I am not aware of a stable Rust solution, which does not involve explicit simd intrinsics.