Will gfortran or ifort compilers wisely use SIMD instructions when summing the product of two arrays?

3.8k Views Asked by At

I've got some code written with numpy, and I'm considering porting it to Fortran for better performance.

One operation I do several times is summing the element-wise product of two arrays:

sum(A*B)

It looks like fused multiply-add instructions would help with this. My current processor doesn't support these instructions, so I can't test things yet. However, I may upgrade to a new processor that does support FMA3 (an Intel Haswell processor).

Does anyone know if compiling the program with "-march=native" (or the ifort equivalent) will be enough to get the compiler (either gfortran or ifort) to wisely use SIMD instructions to optimize that code, or do you think I'll have to baby the compilers or code?

3

There are 3 best solutions below

0
On BEST ANSWER

Thanks to Xiaolei Zhu's tip, I now know that gfortran will use fused multiply-add to optimize sum(A*B). For example, with this code:

program test implicit none

real, dimension(7) :: a, b

a = (/ 2.0, 3.0, 5.0, 7.0, 11.0, 13.0, 17.0 /)

b = (/ 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0 /)

print *, sum(a*b)
endprogram

I can compile it with f95 sum.f95 -o sum -O3 -march=core-avx2, and objdump -d sum | grep vfmadd displays

40088b: c4 e2 71 99 44 24 30 vfmadd132ss 0x30(%rsp),%xmm1,%xmm0

400892: c4 e2 69 b9 44 24 34 vfmadd231ss 0x34(%rsp),%xmm2,%xmm0

400899: c4 e2 61 b9 44 24 38 vfmadd231ss 0x38(%rsp),%xmm3,%xmm0

4008a0: c4 e2 59 b9 44 24 3c vfmadd231ss 0x3c(%rsp),%xmm4,%xmm0

4008a7: c4 e2 51 b9 44 24 40 vfmadd231ss 0x40(%rsp),%xmm5,%xmm0

4008ae: c4 e2 49 b9 44 24 44 vfmadd231ss 0x44(%rsp),%xmm6,%xmm0

4008b5: c4 e2 41 b9 44 24 48 vfmadd231ss 0x48(%rsp),%xmm7,%xmm0

So gfortran unrolled the loop and put in 7 fused multiply-add instructions. If I create larger, random, multi-dimensional arrays, I still see vfmadd231ss pop up once (so it doesn't unroll the loop).

1
On

If you use -march=native on a machine with SIMD, the compiler should generate SIMD instructions, although I've always used -xHost flag instead with ifort.

But I am not so sure how to make them do it "wisely". My feeling is that at -O3 level ifort and gfortran both tend to be overly aggressive on vectorization (that is, they use the SIMD functionality more often than they should). Very often I have to turn off vectorization to get the most efficient code. This, of course, may or may not be true for you.

It will usually be better to use vector libraries that are optimized for this task. You can use vdmul in MKL or gsl_vector_mul in GSL to do this.

Using -march=NEWARCH will result in a code tuned for the architecture NEWARCH but cannot run on an earlier architecture. You can use the -mtune=NEWARCH flag where NEWARCH is the architecture of your new processor. This will generate code tuned for the new architecture but still executable on the old one. Since you do not yet have the new machine, -mtune is probably what you need at the moment.

With ifort you can use vectorization report flags to show which part of the program has been vectorized. For example, ifort flag -vec-report=1 will give you such information during compilation. I am sure there will be an equivalent flag in gfortran.

2
On

gfortran versions where sum(a*b) gave better vectorization than dot_product(a,b) are long obsolete. The code you show is using serial AVX2 fma instructions.

In the implementation of dot_product without indirect indexing or other complications (a simple loop by itself), fma will likely be slower than the combination of simd parallel multiply and add instructions, because the multiply can be done out of the latency critical path. gfortran use of parallel simd fma for dot_product can be quite effective in the more complicated cases.

You will need either -O2 -ftree-vectorize -ffast-math -march=native or -O3 -ffast-math -march=native (as well as suitable vector lengths) to vectorize this, and gfortran may fail to vectorize inside an OpenMP parallel region.

gfortran 4.9 appears to have dropped the option -ftree-vectorizer-verbose. -fdump-tree-vect writes details of vectorization passes to a .vect file, with different names chosen for different major gcc versions.