GCC 6.1.0 vs Intel Compiler 15 and Auto Vectorization performance

978 Views Asked by At

Is it possible to achieve similar level of performance in GCC in terms of SSE2/AVX?

It looks like Intel Compiler 15 is superior in auto vectorization efficiency. As benchmark I've used classic flops.c benchmark (https://github.com/AMDmi3/flops/blob/master/flops.c)

And here are results for my Intel Xeon E5-2690 (Sandy Bridge)

Intel Compiler 15 [ /O2 /arch:AVX /fp:fast ]

FLOPS C Program (double Precision), V2.0 18 Dec 1992

Module     Error        RunTime      MFLOPS
                        (usec)
 1    -2.5613e-010      0.0034   4177.1562
 2    -1.4166e-013      0.0058   1209.1768
 3     3.1904e-010      0.0011  15487.5445
 4     9.0594e-014      0.0011  14065.9341
 5    -6.2284e-014      0.0034   8652.6807
 6     3.3640e-014      0.0021  13994.3450
 7     9.4360e-012      0.0101   1193.4732
 8     3.7637e-014      0.0022  13677.6492

Iterations      =  512000000
NullTime (usec) =     0.0000
MFLOPS(1)       =  1730.8542
MFLOPS(2)       =  2971.1755
MFLOPS(3)       =  6296.4960
MFLOPS(4)       = 14153.0984

GCC 6.1.0 [ -m32 -mavx -Ofast ]

FLOPS C Program (double Precision), V2.0 18 Dec 1992

Module     Error        RunTime      MFLOPS
                        (usec)
 1     1.8119e-013      0.0034   4177.1562
 2    -1.4166e-013      0.0055   1283.6676
 3     8.2157e-015      0.0013  13600.0000
 4     1.8874e-015      0.0023   6655.1127
 5    -2.7645e-014      0.0048   6060.4082
 6     5.1903e-014      0.0041   7159.1128
 7    -8.4583e-011      0.0200    598.5387
 8    -1.4488e-014      0.0041   7293.4473

Iterations      =  512000000
NullTime (usec) =     0.0000
MFLOPS(1)       =  1823.5616
MFLOPS(2)       =  1585.2039
MFLOPS(3)       =  3663.4158
MFLOPS(4)       =  7799.1296

Something tells me that I forgot to enable some special switch in GCC.

Ps. Yes I know that Intel Compiler has reduced precision.

0

There are 0 best solutions below