I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:
int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";
//All positions of array1 and array2 are set to 0 here;
int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1
int bb = 0;
for(int i=0; i<size; i++){
if(A[i] + bb > B[i]){
bb = 1;
sArray1[i] = S;
sArray2[i] = 1;
}
else
bb = 0;
}
Please note the loop-carried dependency, in bb
- each comparison depends upon bb
's value, which is modified on the previous iteration.
What I thought about:
- I can be absolutely certain of some cases. For example, when
A[i]
is already greater thanB[i]
, I do not need to know what valuebb
carries from the previous iteration; - When
A[i]
equalsB[i]
, I need to know what valuebb
carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.
Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.
You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same
bb
).The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably). And the action to be taken for single chunk of bb=1 could be vectorized, too. The loop transformation is as follows:
Now, each of the inner loops (all for loops) could potentially get vectorized.
Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.