I am looking for an simple example where using vectorization and parallelization on Xeon Phi this has better perfomance than only-Xeon. Could you help me please?
I am trying with the next example. I comment the lines 14, 18 and 19 for run on only-Xeon and uncoment these for Xeon-Phi, but only-Xeon has better performance than Xeon-phi
1.void main(){
2.double *a, *b, *c;
3.int i,j,k, ok, n=100;
4.int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );
5.ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));
6.ok = posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));
7.ok = posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));
8.for(i=0; i<n; i++)
9.{
10. a[i] = (int) rand();
11. b[i] = (int) rand();
12. c[i] = 0.0;
13.}
14.#pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))
15.#pragma omp parallel for
16.for( i = 0; i < n; i++ )
17. for( k = 0; k < n; k++ )
18. #pragma vector aligned
19. #pragma ivdep
20. for( j = 0; j < n; j++ ){
21. c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j]
22.}
First couple words about autovectorization. Advantage of autovectorization is simplicity. You need to set some keywords than magic happens and compiler make fast code for you. If you want to go this way try this manual.
The disadvantage of this approach is that there is no easy way to understand how compiler make his work. In vectorization report you will see "LOOP WAS VECTORIZED" or "LOOP WAS NOT VECTORIZED". But if you want truly understand how your code works the only way is look in your program assembly. This is not a problem to get assembly. You need to compile program with
-fcode-asm
. But I think if you need to read assembly to check how "simple autovectorization" method works it is not so simple.Alternative to autovectorization are intrinsics (actually, this is not single alternative). Think about intrinsics like assembly wrapped with C functions. Many intrinsics internally wrap single assembly command.
I recommend to use this intrinsics guide.
So my simple way steps:
Let's do it with your program. There are many differences with your program:
My code work time:
Intel Xeon 5680
Intel Xeon Phi (KNC) SE10X
Code: