Intel Xeon Phi provides using the "IMCI" instruction set ,
I used it to do "c = a*b" , like this:
float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
__m512 x_1Vec = _mm512_load_ps(x+i);
__m512 y_1Vec = _mm512_load_ps(y+i);
__m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
_mm512_store_pd(z+i,ans);
}
And test it's performance , when the N SIZE is 1048576,
it need cost 0.083317 Sec , I want to compare the performance with auto-vectorization
so the other version code like this:
_Cilk_for(size_t i = 0; i < N; i++)
z[i] = x[i] * y[i];
This version cost 0.025475 Sec(but sometimes cost 0.002285 or less, I don't know why?)
If I change the _Cilk_for to #pragma omp parallel for, the performance will be poor.
so, if the answer like this, why we need to use intrinsics?
Did I make any mistakes any where?
Can someone give me some good suggestion to optimize the code?
My answer below equally applies to Intel Xeon and Intel Xeon Phi.
In your second code snippet you seem to use "explicit" vectorization, which is currently achievable when using Cilk Plus and OpenMP4.0 "frameworks" supported by all recent versions of Intel Compiler and also by GCC4.9. (I said that you seem to use explicit vectorization, because Cilk_for was originally invented for the purpose of multi-threading, however most recent version of Intel Compiler might automatically parallelize and vectorize the loop, when cilk_for is used)