Is vectorization profitable in this case?

Question

Is vectorization profitable in this case?

133 Views Asked by a3mlord At 16 August 2025 at 23:10

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:

int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";

//All positions of array1 and array2 are set to 0 here;

int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1

int bb = 0;

for(int i=0; i<size; i++){

   if(A[i] + bb > B[i]){
      bb = 1;
      sArray1[i] = S;
      sArray2[i] = 1;
   }
   else
      bb = 0;

}

Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.

What I thought about:

I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.

Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.

Original Q&A

There are 2 best solutions below

**Nicolai Stange** · Answer 1

You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).

The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably). And the action to be taken for single chunk of bb=1 could be vectorized, too. The loop transformation is as follows:

size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;

while (i_chunk_start < isize) {
  if(bb_chunk) {
    /* find end of current chunk */
    for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
      if(A[i_chunk_end] < B[i_chunk_end]) {
        break;
      }
    }
    /* process current chunk */
    for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
      sArray1[i] = S;
      sArray2[i] = 1;
    }
    bb_chunk = 0;
  } else {
    /* find end of current chunk */
    for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
      if(A[i_chunk_end] > B[i_chunk_end]) {
        break;
      }
    }
    bb_chunk = 1;
  }

  /* prepare for next chunk */
  i_chunk_start = i_chunk_end;
 }

Now, each of the inner loops (all for loops) could potentially get vectorized.

Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.

**anatolyg** · Answer 2

The effect of your loop body depends on two conditions:

A[i] > B[i]
A[i] + 1 > B[i]

Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).

You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.

Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.

I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.

Is vectorization profitable in this case?

There are 2 best solutions below

Related Questions in C

Related Questions in VECTORIZATION

Related Questions in SSE

Trending Questions

Popular # Hahtags

Popular Questions