vectorization and parallelization Xeon Phi

188 Views Asked by At

I am looking for an simple example where using vectorization and parallelization on Xeon Phi this has better perfomance than only-Xeon. Could you help me please?

I am trying with the next example. I comment the lines 14, 18 and 19 for run on only-Xeon and uncoment these for Xeon-Phi, but only-Xeon has better performance than Xeon-phi

1.void main(){
2.double *a, *b, *c;
3.int i,j,k, ok, n=100;
4.int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );
5.ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));
6.ok = posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));
7.ok = posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));
8.for(i=0; i<n; i++)
9.{
10.    a[i] = (int) rand();
11.    b[i] = (int) rand();
12.    c[i] = 0.0;
13.}
14.#pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))
15.#pragma omp parallel for
16.for( i = 0; i < n; i++ )
17.    for( k = 0; k < n; k++ )
18.        #pragma vector aligned
19.        #pragma ivdep
20.        for( j = 0; j < n; j++ ){
21.                c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j]        
22.}
1

There are 1 best solutions below

2
On

First couple words about autovectorization. Advantage of autovectorization is simplicity. You need to set some keywords than magic happens and compiler make fast code for you. If you want to go this way try this manual.

The disadvantage of this approach is that there is no easy way to understand how compiler make his work. In vectorization report you will see "LOOP WAS VECTORIZED" or "LOOP WAS NOT VECTORIZED". But if you want truly understand how your code works the only way is look in your program assembly. This is not a problem to get assembly. You need to compile program with -fcode-asm. But I think if you need to read assembly to check how "simple autovectorization" method works it is not so simple.

Alternative to autovectorization are intrinsics (actually, this is not single alternative). Think about intrinsics like assembly wrapped with C functions. Many intrinsics internally wrap single assembly command.

I recommend to use this intrinsics guide.

So my simple way steps:

  1. Make single thread reference implementation. You will use it to check correctness of intrinsics version.
  2. Implement SSE intrinsics version. SSE intrinsics are much simpler and can be tested on Xeon.
  3. Implement AVX-512 version for Xeon Phi.
  4. Measure your speed.

Let's do it with your program. There are many differences with your program:

  1. I use float instead double.
  2. I use _mm_malloc instead posix_memalign.
  3. I suppose n is divided by 16 without remainder (16 floats in AVX-512 vector register). I don't work with loop peeling in this example.
  4. I use native mode instead of offload mode. KNL is bootable so it is not necessary to use offload mode anymore.
  5. Also I think your program is not correct because it modifies c array from several threads in one moment of time. But lets think it is not important and we just need some calculation job.

My code work time:

Intel Xeon 5680

  • reference calc time: 97.677505 seconds
  • Intrinsics calc time: 6.189296 seconds

Intel Xeon Phi (KNC) SE10X

  • reference calc time: 199.0 seconds
  • Intrinsics calc time: 2.78 seconds

Code:

#include <stdio.h>
#include <omp.h>
#include <math.h>
#include "immintrin.h"
#include <assert.h>

#define F_E_Q(X,Y,N) (round((X) * pow(10, N)-(Y) * pow(10, N)) == 0)

void reference(float* a, float* b, float* c, int n, int nPadded);
void intrinsics(float* a, float* b, float* c, int n, int nPadded);

char *test(){
  int n=4800;
  int nPadded = n;

  assert(n%16 == 0);

  float* a = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
  float* b = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
  float* cRef = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
  float* c = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
  assert(a != NULL);
  assert(b != NULL);
  assert(cRef != NULL);
  assert(c != NULL);

  for(int i=0, max = n*nPadded; i<max; i++){
    a[i] = (int) rand() / 1804289408.0;
    b[i] = (int) rand() / 1804289408.0;
    cRef[i] = 0.0;
    c[i] = 0.0;
  }
  debug_arr("a", "%f", a, 0, 9, 1);
  debug_arr("b", "%f", b, 0, 9, 1);
  debug_arr("cRef", "%f", cRef, 0, 9, 1);
  debug_arr("c", "%f", c, 0, 9, 1);

  double t1 = omp_get_wtime();
  reference(a, b, cRef, n, nPadded);
  double t2 = omp_get_wtime();
  debug("reference calc time: %f", t2-t1);

  t1 = omp_get_wtime();
  intrinsics(a, b, c, n, nPadded);
  t2 = omp_get_wtime();
  debug("Intrinsics calc time: %f", t2-t1);

  debug_arr("cRef", "%f", cRef, 0, 9, 1);
  debug_arr("c", "%f", c, 0, 9, 1);

  for(int i=0, max = n*nPadded; i<max; i++){
    assert(F_E_Q(cRef[i], c[i], 2));
  }

  _mm_free(a);                
  _mm_free(b);
  _mm_free(cRef);
  _mm_free(c);
  return NULL;
}

void reference(float* a, float* b, float* c, int n, int nPadded){
  for(int i = 0; i < n; i++ )
    for(int k = 0; k < n; k++ )
      for(int j = 0; j < n; j++ )
        c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];        
}

#if __MIC__

void intrinsics(float* a, float* b, float* c, int n, int nPadded){
  #pragma omp parallel for
  for(int i = 0; i < n; i++ )
    for(int k = 0; k < n; k++ )
      for(int j = 0; j < n; j+=16 ){
        __m512 aPart = _mm512_extload_ps(a + i*nPadded+k, _MM_UPCONV_PS_NONE, _MM_BROADCAST_1X16, _MM_HINT_NONE);
        __m512 bPart = _mm512_load_ps(b + k*nPadded+j);
        __m512 cPart = _mm512_load_ps(c + i*nPadded+j);
        cPart = _mm512_add_ps(cPart, _mm512_mul_ps(aPart, bPart));
        _mm512_store_ps(c + i*nPadded+j, cPart);
      }
}

#else

void intrinsics(float* a, float* b, float* c, int n, int nPadded){
  #pragma omp parallel for
  for(int i = 0; i < n; i++ )
    for(int k = 0; k < n; k++ )
      for(int j = 0; j < n; j+=4 ){
        __m128 aPart = _mm_load_ps1(a + i*nPadded+k);
        __m128 bPart = _mm_load_ps(b + k*nPadded+j);
        __m128 cPart = _mm_load_ps(c + i*nPadded+j);
        cPart = _mm_add_ps(cPart, _mm_mul_ps(aPart, bPart));
        _mm_store_ps(c + i*nPadded+j, cPart);
      }
}

#endif