Cilk_for returns wrong data in array

211 Views Asked by At

I am new to multi threading programming. Recently, i have a project, which i apply cilk_for into it. Here is the code:.

void myfunction(short *myarray)
{
m128i *array = (m128i*) myarray
cilk_for(int i=0; i<N_LOOP1; i++)
    {
        for(int z = 0; z<N_LOOP2; z+=8)
        {
            array[z]        =  _mm_and_si128(array[z],mym128i);
            array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
            array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
            array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
            array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
            array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
            array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
            array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
            array+=8;
        }
    }
}

After the above code ran, ridiculous thing happens. The data in array isn't updated correctly. For example, if i have an array with 1000 elements, there is a chance that the array will be updated correctly (1000 elements are AND-ed). But there is also a chance that some parts of the array will be omited (first element to 300th element are AND-ed, 301st element to 505th element aren't AND-ed, 506th element to 707th element are AND-ed, etc,...). These omited parts are random in each individual run, so i think the problem here is about cache miss. Am I right? Please tell me, any help is appreciated. :)

1

There are 1 best solutions below

4
Alexander Weggerle On

The issue is that the array pointer is not synchronized between the threads cilk is spawning and your array variable is incremented in each loop iteration. This works only in a linear execution. In your code snippet multiple threads are accessing the same elements in your array while other parts of the array are not processed at all.

To solve this I would propose to calculate the index within the outer loop so that every thread spawned with Cilk is able to calculate the address independently. Maybe you can do something like:

void myfunction(short *myarray)
    {
    cilk_for (int i=0; i<N_LOOP1; i++)
        {
            m128i *array = (m128i*) myarray + i * N_LOOP2 * 8;
            for(int z = 0; z<N_LOOP2; z+=8)
            {
                array[z]        =  _mm_and_si128(array[z],mym128i);
                array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
                array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
                array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
                array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
                array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
                array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
                array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
                array+=8;
            }
        }
    }

BTW: Why do you need to do a manual loop unrolling here? The compiler should do that automatically.