I have a loop that reverses elements in an array. I have simplified and reduced the problem to the following:
for (int x=0;x<w/2;++x) {
int il = x;
int ir = w-1-x;
type_copy l = data[il];
type_copy r = data[ir];
data[il] = r;
data[ir] = l;
}
This code reverses the elements, but is rather slow. For one thing, it can't be auto-vectorized since the array accesses are discontiguous. For another thing, the accesses on the right hand side are backwards from an ideal cache traversal. Lastly, there is probably some stalling because the load for the next loop cycle can't happen before the data from the last one was committed, since the compiler probably can't tell that the self-aliased pointer doesn't ever hit itself.
In my case, sizeof(type_copy)
is either 4*sizeof(uint8_t)
= 4
or else 4*sizeof(float)
= 4*4
= 16
. Therefore, note that byte-level reversal is unacceptable.
My question is: how can this code be optimized, iff it can be?
Assuming your data types are like:
you can try SSE intrinsics. For uint8_t_data there is quite good speed improvement:
Output:
However for float_data not much speed improvement:
Output: