I've got some code written with numpy, and I'm considering porting it to Fortran for better performance.
One operation I do several times is summing the element-wise product of two arrays:
sum(A*B)
It looks like fused multiply-add instructions would help with this. My current processor doesn't support these instructions, so I can't test things yet. However, I may upgrade to a new processor that does support FMA3 (an Intel Haswell processor).
Does anyone know if compiling the program with "-march=native" (or the ifort equivalent) will be enough to get the compiler (either gfortran or ifort) to wisely use SIMD instructions to optimize that code, or do you think I'll have to baby the compilers or code?
Thanks to Xiaolei Zhu's tip, I now know that gfortran will use fused multiply-add to optimize
sum(A*B)
. For example, with this code:I can compile it with
f95 sum.f95 -o sum -O3 -march=core-avx2
, andobjdump -d sum | grep vfmadd
displaysSo gfortran unrolled the loop and put in 7 fused multiply-add instructions. If I create larger, random, multi-dimensional arrays, I still see vfmadd231ss pop up once (so it doesn't unroll the loop).