Optimizing Reversal of Items

150 Views Asked by At

I have a loop that reverses elements in an array. I have simplified and reduced the problem to the following:

for (int x=0;x<w/2;++x) {
    int il =     x;
    int ir = w-1-x;
    type_copy l = data[il];
    type_copy r = data[ir];
    data[il] = r;
    data[ir] = l;
}

This code reverses the elements, but is rather slow. For one thing, it can't be auto-vectorized since the array accesses are discontiguous. For another thing, the accesses on the right hand side are backwards from an ideal cache traversal. Lastly, there is probably some stalling because the load for the next loop cycle can't happen before the data from the last one was committed, since the compiler probably can't tell that the self-aliased pointer doesn't ever hit itself.

In my case, sizeof(type_copy) is either 4*sizeof(uint8_t) = 4 or else 4*sizeof(float) = 4*4 = 16. Therefore, note that byte-level reversal is unacceptable.

My question is: how can this code be optimized, iff it can be?

4

There are 4 best solutions below

2
On BEST ANSWER

Assuming your data types are like:

struct float_data
{
    float f1;
    float f2;
    float f3;
    float f4;
};

struct uint8_t_data
{
    uint8_t f1;
    uint8_t f2;
    uint8_t f3;
    uint8_t f4;
};

you can try SSE intrinsics. For uint8_t_data there is quite good speed improvement:

typedef uint8_t_data type_copy;

for (int x = 0; x<w / 2; x += 4) 
{
    int il = x;
    int ir = w - 1 - x - 3;

    __m128i dl = _mm_loadu_si128((const __m128i*)&data[il]);
    __m128i dr = _mm_loadu_si128((const __m128i*)&data[ir]);
    _mm_storeu_si128((__m128i*)&data[ir], _mm_shuffle_epi32(dl, _MM_SHUFFLE(0, 1, 2, 3)));
    _mm_storeu_si128((__m128i*)&data[il], _mm_shuffle_epi32(dr, _MM_SHUFFLE(0, 1, 2, 3)));
}

Output:

g++ -O3 non vectorized: 16ms
g++ -O3 vectorized: 5ms

However for float_data not much speed improvement:

typedef float_data type_copy;

for (int x = 0; x<w / 2; x+=2) {
    int il = x;
    int ir = w - 1 - x - 1;

    __m256 dl = _mm256_loadu_ps((const float*)&data[il]);
    __m256 dr = _mm256_loadu_ps((const float*)&data[ir]);

    _mm256_storeu_ps((float*)&data[ir], _mm256_permute2f128_ps(dl, dl, 1));
    _mm256_storeu_ps((float*)&data[il], _mm256_permute2f128_ps(dr, dr, 1));

}

Output:

g++ -O3 -mavx non vectorized: 27ms
g++ -O3 -msse4.2 non vectorized: 25ms
g++ -O3 -mavx vectorized: 24ms
0
On

Hopefully it is better:

for (int x=0;x<w/2;++x) {
    std::swap(data[x], data[w-i-x]);    
}

If you don't prefer to use standard template library function, reduce the number of assignments and local variables as follows:

  • Removed 3 local variables compared to your implementation
  • Removed 3 extra assignment operations

for (int x=0;x<w/2;++x) { type_copy l = data[x]; data[x] = data[w-1-x]; data[w-l-x] = l; }

2
On

The reason your code cannot be parallelized very well is because there is a dependency between the data:

for (int x=0;x<w/2;++x) {
   int il =     x;
   int ir = w-1-x;
   type_copy l = data[il];
   type_copy r = data[ir];
   data[il] = r;
   data[ir] = l;
}

There are 3 operations here: compute l/r indexes, read from array, write to array. Each one of these is dependent on the result of the previous operation so they cannot be done in parallel. Notice I group operations for left or right under the same category.

The first thing to do is try an brake that dependency.

Instead of reading ad writing in the same cycle try reading data for iteration N and writing data for iteration N-1; this can be done in the same cycle.

int il =   0;
int ir = w-1;
type_copy l = data[il];
type_copy r = data[ir];

for (int x=0;x<w/2;++x) {
   data[il] = r;
   data[ir] = l;
   il =     x;
   ir = w-1-x;
   l = data[il];
   r = data[ir];       
}

Or even better, precompute the indexes used for reading as well:

int il_0 =   0;
int ir_0 = w-1;
int il_1 =   1;
int ir_1 = w-2;
type_copy l = data[il_0];
type_copy r = data[ir_0];

for (int x=0;x<w/2;++x) {
   data[il_0] = r;
   data[ir_0] = l;       
   l = data[il_1];
   r = data[ir_1];
   il_0 = il_1;
   ir_0 = ir_1;       
   il_1 = il_1++; 
   ir_1 = ir_1--;
}

One other thing worth trying is copying more then one data sample; e.g read/write 2,4,..etc samples in the same cycle. This should improve the parallelism of your code.

1
On

The code certainly can be optimized, but doing so might be platform dependent. For instance, AMD64 inherits a bunch of useful instructions from x86 SSE, including PSHUFD or VPPERM which can reorder words arbitrarily within a vector and MASKMOVDQU which can do combining partial writes.