Cilk_for returns wrong data in array

178 Views Asked by At

I am new to multi threading programming. Recently, i have a project, which i apply cilk_for into it. Here is the code:.

void myfunction(short *myarray)
{
m128i *array = (m128i*) myarray
cilk_for(int i=0; i<N_LOOP1; i++)
    {
        for(int z = 0; z<N_LOOP2; z+=8)
        {
            array[z]        =  _mm_and_si128(array[z],mym128i);
            array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
            array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
            array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
            array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
            array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
            array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
            array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
            array+=8;
        }
    }
}

After the above code ran, ridiculous thing happens. The data in array isn't updated correctly. For example, if i have an array with 1000 elements, there is a chance that the array will be updated correctly (1000 elements are AND-ed). But there is also a chance that some parts of the array will be omited (first element to 300th element are AND-ed, 301st element to 505th element aren't AND-ed, 506th element to 707th element are AND-ed, etc,...). These omited parts are random in each individual run, so i think the problem here is about cache miss. Am I right? Please tell me, any help is appreciated. :)

1

There are 1 best solutions below

4
On

The issue is that the array pointer is not synchronized between the threads cilk is spawning and your array variable is incremented in each loop iteration. This works only in a linear execution. In your code snippet multiple threads are accessing the same elements in your array while other parts of the array are not processed at all.

To solve this I would propose to calculate the index within the outer loop so that every thread spawned with Cilk is able to calculate the address independently. Maybe you can do something like:

void myfunction(short *myarray)
    {
    cilk_for (int i=0; i<N_LOOP1; i++)
        {
            m128i *array = (m128i*) myarray + i * N_LOOP2 * 8;
            for(int z = 0; z<N_LOOP2; z+=8)
            {
                array[z]        =  _mm_and_si128(array[z],mym128i);
                array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
                array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
                array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
                array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
                array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
                array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
                array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
                array+=8;
            }
        }
    }

BTW: Why do you need to do a manual loop unrolling here? The compiler should do that automatically.