I have std::vector<double> X,Y both of size N (with N%16==0) and I want to calculate sum(X[i]*Y[i]). That's a classical use case for Fused Multiply and Add (FMA), which should be fast on AVX-capable processors. I know all my target CPU's are Intel, Haswell or newer.
How do I get GCC to emit that AVX code? -mfma is part of the solution, but do I need other switches?
And is std::vector<double>::operator[] hindering this? I know I can transform
size_t N = X.size();
double sum = 0.0;
for (size_t i = 0; i != N; ++i) sum += X[i] * Y[i];
to
size_t N = X.size();
double sum = 0.0;
double const* Xp = &X[0];
double const* Yp = &X[0];
for (size_t i = 0; i != N; ++i) sum += Xp[i] * Yp[i];
so the compiler can spot that &X[0] doesn't change in the loop. But is this sufficient or even necessary?
Current compiler is GCC 4.9.2, Debian 8, but could upgrade to GCC 5 if necessary.
Did you look at the assembly? I put
into http://gcc.godbolt.org/ and looked at the assembly in GCC 4.9.2 with
-O3 -mfmaand I seeSo it uses fma. However, it doest not vectorize the loop (The
sinsdmeans single (i.e. not packed) and thedmeans double floating point).To vectorize the loop you need to enable associative math e.g. with
-Ofast. Using-Ofast -mavx2 -mfmagivesSo now it's vectorized (
pdmeans packed doubles). However, it's not unrolled. This is currently a limitation of GCC. You need to unroll several times due to the dependency chain. If you want to have the compiler do this for you then consider using Clang which unrolls four times otherwise unroll by hand with intrinsics.Note that unlike GCC, Clang does not use fma by default with
-mfma. In order to use fma with Clang use-ffp-contract=fast(e.g.-O3 -mfma -ffp-contract=fast) or#pragma STDC FP_CONTRACT ONor enable associative math with e.g.-OfastYou're going to want to enable associate math anyway if you want to vectorize the loop with Clang.See Fused multiply add and default rounding modes and https://stackoverflow.com/a/34461738/2542702 for more info about enabling fma with different compilers.
GCC creates a lot of extra code to handle misalignment and for
Nnot a multiples of 8. You can tell the compiler to assume the arrays are aligned using__builtin_assume_alignedand that N is a multiple of 8 usingN & -8The following code with
-Ofast -mavx2 -mfmaproduces the following simple assembly