I am trying to optimize an algorithm that masks an array. The initial code looks like this:
void mask(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *m)
{
unsigned int rep=size_x*size_y;
while (rep--)
{
*(source++) &= *(m++);
}
}
I have tried to do Loop Unrolling + prefetching
void mask_LU4(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
_mm_prefetch(&source[16], _MM_HINT_T0);
_mm_prefetch(&mask[16], _MM_HINT_T0);
source[0] &= mask[0];
source[1] &= mask[1];
source[2] &= mask[2];
source[3] &= mask[3];
source += 4;
mask += 4;
}
}
and use intrinsics
void fmask_SIMD(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
__m128i *s,*m ;
s = (__m128i *) source;
m = (__m128i *) mask;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
*s = _mm_and_si128(*s,*m);
source+=4;mask+=4;
s = (__m128i *) source;
m = (__m128i *) mask;
}
}
However the performance is the same. I have tried to perform sw prefetch both to the SIMD and Loop Unrolling version and it I couldn't see any improvement. Any ideas on how I could optimize this algorithm?
P.S.1: I am using gcc 4.8.1 and I compile with -march=native and -Ofast.
P.S.2: I am using an Intel Core i5 3470 @3.2Ghz, Ivy bridge architecture. L1 DCache 4X32KB (8-way), L2 4x256, L3 6MB, RAM-DDR3 4Gb (Dual Channel, DRAM @798,1Mhz)
Your operation is memory bandwidth bound. However, that does not necessarily mean your operation is achieving the maximum memory bandwidth. To get closer to the maximum memory bandwidth you need to use multiple threads. Using OpenMP (add
-fopenmpto GCC's options) you can do this:If you wanted to not modify the source but use a different destination then you can use stream instructions like this:
This would not be any faster than using the same destination and source. However, if you already planned to use a destination not equal to the source this would likely be faster (for some value of
rep) than using_mm_store_si128.