Vectorize 2d-array access (GCC)

933 Views Asked by At

I understand the basic ideas of vectorization. I am thinking transform one of my programs into to the vectorized version. But it seems complicated.

There is a table (2d-array) table[M][N], and two vectors X[1..4] and Y[1..4]. Can I do operations as below? Any thoughts?

X[1..4] = table[X[1..4]][Y[1..4]]

(sequential version: X[i] = table[X[i]][Y[i]])

In another word, can the following loop be vectorized?

    for(k=0; k<4; k++) {
        tmp1 = X[k];
        tmp2 = Y[k];
        X[k] = table[tmp1][tmp2];
    }

Note: X[] always contains distinct values.

Currently, it is implemented in C Language.

4

There are 4 best solutions below

0
On

If you are copying adjacent memory cells, you can use memcpy() to copy a whole chunk of data. But since that's the the case here, no can do, you have to use the loop.

0
On

1) Technically, hayesti has answered your question (inserts/hoirzontal instructions on SSE/AVX of vgather on AVX2 will make it possible). But note also that both GCC4.9 and ICC will vectorize given code snippet (so no any need for intrinsics/hand-coding for truely random access), although for GCC you'll probably need #pragma omp simd, and you may also need -vec-threshold0 for ICC on SSE machines.

2) Practically, if you have to vectorize given code "as is", it will never be very good speed-up, because you need to "ammortize" big overhead (latency) of vgather or vinsert-s with sufficient vector computation (which you don't have in your example) to make vectorization "profitable". And no need to say that you will also need appropriate loop trip counts, etc.

I just checked static cost model estimation for vectorized version of your code using one of fresh ICC compilers' report (or "Intel Vectorization Advisor") outputs.

  • For SSE code generation it was: 0.5x (i.e. slowdown)
  • For AVX code generation it was: 1.1x speed-up (as upper bound)
  • For AVX2 code generation it was: 1.3x - 1.4x speed-up (as upper bound).

Now, remember that all these speed-ups are optimistic upper bounds provided by very nice optimizing compiler (I don't know if GCC will be better or worse). Depending on your indices layouts, matrix size, and overall bandwidth-latency balances, as well as some other reasons - you'll often be way below given 1.4x, even for AVX2, hardly expecting significant speed-ups. In order to make such access pattern really profitable, you need some extra (vectorized) computations with X[k] in loop body to amortize overhead (as opposed to just copying data from one place to another).

At the same time, there are good news as well. For short-term future AVX-512 machines (KNL Xeon Phi, some future Xeon) vgather performance will likely change/improve, so that even simple data copy may give some extra speed-ups, but anyway this is not something you will observe on today' AVX/AVX2 machines.

Very last minor note: if you deal with sparse matrix (and that's why you tell about given indirect references), then possibly you may think of sparse-data compressed row storage format and as a result different SIMD trade-offs, although it's too far from original scope of the question.

2
On

In theory this is fine, but it depends on what processor you have. You need vector gather functionality which was added to x86 processors through AVX2 and first appeared in the Haswell microarchitecture. The pseudocode would look something like this

vr1 := simd_load4(x)
vr2 := simd_load4(y)
vr3 := vr1 * 4; // multiply by the number of rows
vr4 := vr3 + vr2;
vr5 := simd_gather(base=&table, offsets=vr4)
simd_store(x, vr5)

The SSE/AVX version might look like this

__m128i vr1 = _mm_load_si128 (x);
__m128i vr2 = _mm_load_si128 (y);
__m128i vr3 = _mm_mul_epi32 (vr1, _mm_set1_epi32 (4));
__m128i vr4 = _mm_add_epi32 (vr3, vr2);
__m128i vr5 = _mm_i32gather_epi32 (table, vr4, 1);
_mm_store_si128 (x, vr5);
0
On

It can be done on ARM NEON via VTBL instruction.

NEON can process LUTs up to 32byte rather quickly.