I have std::vector<double> X,Y
both of size N
(with N%16==0
) and I want to calculate sum(X[i]*Y[i])
. That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.
How do I get GCC to emit that AVX code? -mfma
is part of the solution, but do I need other switches?
And is std::vector<double>::operator[]
hindering this? I know I can transform
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];
to
size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];
so the compiler can spot that &X[0]
doesn't change in the loop. But is this sufficient or even necessary?
Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.
Did you look at the assembly? I put
into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with
-O3 -mfma
and I seeSo it uses fma. However, it doest not vectorize the loop (The
s
insd
means single (i.e. not packed) and thed
means double floating point).To vectorize the loop you need to enable associative math e.g. with
-Ofast
. Using-Ofast -mavx2 -mfma
givesSo now it's vectorized (
pd
means packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.Note that unlike GCC, Clang does not use fma by default with
-mfma
. In order to use fma with Clang use-ffp-contract=fast
(e.g.-O3 -mfma -ffp-contract=fast
) or#pragma STDC FP_CONTRACT ON
or enable associative math with e.g.-Ofast
You're going to want to enable associate math anyway if you want to vectorize the loop with Clang.See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.
GCC creates a lot of extra code to handle misalignment and for
N
not a multiples of 8. You can tell the compiler to assume the arrays are aligned using__builtin_assume_aligned
and that N is a multiple of 8 usingN & -8
The following code with
-Ofast -mavx2 -mfma
produces the following simple assembly