I am trying to optimize an algorithm that masks an array. The initial code looks like this:
void mask(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *m)
{
unsigned int rep=size_x*size_y;
while (rep--)
{
*(source++) &= *(m++);
}
}
I have tried to do Loop Unrolling + prefetching
void mask_LU4(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
_mm_prefetch(&source[16], _MM_HINT_T0);
_mm_prefetch(&mask[16], _MM_HINT_T0);
source[0] &= mask[0];
source[1] &= mask[1];
source[2] &= mask[2];
source[3] &= mask[3];
source += 4;
mask += 4;
}
}
and use intrinsics
void fmask_SIMD(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
__m128i *s,*m ;
s = (__m128i *) source;
m = (__m128i *) mask;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
*s = _mm_and_si128(*s,*m);
source+=4;mask+=4;
s = (__m128i *) source;
m = (__m128i *) mask;
}
}
However the performance is the same. I have tried to perform sw prefetch both to the SIMD and Loop Unrolling version and it I couldn't see any improvement. Any ideas on how I could optimize this algorithm?
P.S.1: I am using gcc 4.8.1 and I compile with -march=native
and -Ofast
.
P.S.2: I am using an Intel Core i5 3470 @3.2Ghz, Ivy bridge architecture. L1 DCache 4X32KB (8-way), L2 4x256, L3 6MB, RAM-DDR3 4Gb (Dual Channel, DRAM @798,1Mhz)
Your problem might be memory bound, but that does not mean you cannot process more per cycle. Usually when you have a low payload operation (like yous here, it's only an AND after all), it makes sense to combine many loads and stores. In most CPUs most loads will be combined by the L2 cache into a single cache line load (especially if they're consecutive). I'd suggest increasing the loop unrolling to at least 4 SIMD packets here, along with the prefetching. You'd still be memory bound, but you will get fewer cache misses, giving you a slightly better performance.