Is vectorization profitable in this case?

138 Views Asked by At

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:

int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";

//All positions of array1 and array2 are set to 0 here;

int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1

int bb = 0;

for(int i=0; i<size; i++){

   if(A[i] + bb > B[i]){
      bb = 1;
      sArray1[i] = S;
      sArray2[i] = 1;
   }
   else
      bb = 0;

}

Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.

What I thought about:

  • I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
  • When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.

Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.

2

There are 2 best solutions below

3
On

You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).

The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably). And the action to be taken for single chunk of bb=1 could be vectorized, too. The loop transformation is as follows:

size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;

while (i_chunk_start < isize) {
  if(bb_chunk) {
    /* find end of current chunk */
    for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
      if(A[i_chunk_end] < B[i_chunk_end]) {
        break;
      }
    }
    /* process current chunk */
    for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
      sArray1[i] = S;
      sArray2[i] = 1;
    }
    bb_chunk = 0;
  } else {
    /* find end of current chunk */
    for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
      if(A[i_chunk_end] > B[i_chunk_end]) {
        break;
      }
    }
    bb_chunk = 1;
  }

  /* prepare for next chunk */
  i_chunk_start = i_chunk_end;
 }

Now, each of the inner loops (all for loops) could potentially get vectorized.

Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.

1
On

The effect of your loop body depends on two conditions:

  • A[i] > B[i]
  • A[i] + 1 > B[i]

Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).

You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.

Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.

I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.