How to gather arbitrary indexes in VCL with AVX2 enabled

162 Views Asked by At

I want to vectorize following code using gather instructions in VCL. Some operations should be performed on the indexes of vSource defined by other vector VInd:

vector<int> vSource;
vector<int> vInd;
for (auto i = 0; i < vSource.size();i++) {
    vSource[ vInd[i] ]; //some work
}

vInd contains completely random indexes, so I cannot shuffle them or do other cheap workaround. Desired output example:

vector<int> vSource = {1,2,3,4,5,6,7,8,9,10,11,12,13};
vector<int> vInd = {2,1,5,3,10,5,8,2,10,2,5,3};
3   2   6   4   11  6   9   3   

I can vectorize my code using AVX2.

void intrinGather(vector<int> & vSource, vector <int> & vInd) {
    __m256i ind = _mm256_loadu_si256((__m256i*) & vInd[0]);
    __m256i vec = _mm256_i32gather_epi32(&vSource[0], ind, 4);
}

However VCL version compiles only if I use compile-time indexes. How to pass arbitrary indexes as a parameter to VCL?

void VCLGather(vector<int> & vSource, vector<int> ind) {
    Vec8i vec;
    vec=gather8i<2,1,5,3,10,5,8,2>(&vSource[0]); //compiles
    //vec=gather8i<ind[0],ind[3],ind[2],ind[10],ind[6],ind[8],ind[7],ind[1]>(&vSource[0]); //doesn't compile
}

I'm perfectly fine with intrinGather function, but want to keep code in the same VCL-using style and features like multi-architecture code. Is it possible?

3

There are 3 best solutions below

0
A Fog On BEST ANSWER

The VCL template function lookup<n>(index, table) is indeed intended for this purpose.

VCL will search for the optimal implementation of your function. It will use a permute instruction rather than a gather instruction if n is not too big, because permute instructions are much faster than gather instructions. The n parameter is added in order to enable this optimization

The lookup<n> templates are limiting each index to the interval 0 ≤ i < n for security reasons. If you don't want this security then you may set n = INT_MAX. I will change the code to make sure the interval check is optimized away in this case.

8
Peter Cordes On

VCL types can implicitly convert to/from __m256i (thanks to overloaded cast operators), so you can just use _mm256_i32gather_epi32.

Since you know you have run-time variable indices, you know they can't be template parameters; that template is I think for letting template metaprogramming optimize a fixed gather into maybe some loads + shuffles, e.g. if multiple elements come from near each other.

If you search for gather in https://github.com/vectorclass/version2/blob/master/vectori256.h, you'll find that there's a wrapper function template<int n> Vec8i lookup(Vec8i const index, void const * table), but that tries to emulate shuffles which just use the low few bits of the index: it clamps or modulos the vector of indices before using it with _mm256_i32gather_epi32.

And the template functions you found for fixed indices, like gather8i.


So there don't appear to be any wrappers for just _mm256_i32gather_epi32. That's not surprising, VCL isn't trying to hide the Intel intrinsics, just add convenience on top of them, like operator overloads. When a raw intrinsic does exactly what you want, just use it, especially if a quick search of the header file doesn't find another function that uses it without stuff you don't want.


If you want to write code that's adaptable to different vector widths the way you can with VCL wrapper functions and operators, you could write your own overloaded wrappers.

#include <immintrin.h>
#include <vectorclass.h>

// works with GCC with -O2 or higher.
// clang, or gcc -O0, would need hard-coded or template-parameter scale

#ifdef __AVX512F__
// VCL should define Vec16i if AVX-512 is available.
inline __attribute__((always_inline))  // because scale needs to be a compile-time constant
Vec16i vpgatherdd(Vec16i idx, const void *base, int scale){
   // __m512i version, intrinsic takes void* and this arg order
   return _mm512_i32gather_epi32(idx, base, scale);
}
#endif

// AVX2
inline __attribute__((always_inline))
Vec8i vpgatherdd(Vec8i idx, const void *base, int scale){
   // __m256i version introduced with AVX2, intrinsic takes int* and other arg order
   return _mm256_i32gather_epi32((const int*)base, idx, scale);
}

inline __attribute__((always_inline))
Vec4i vpgatherdd(Vec4i idx, const void *base, int scale){
   // __m128i version, same as __m256i version
   return _mm_i32gather_epi32((const int*)base, idx, scale);
}

If you always use it with scale=4 you might omit that function arg and hard-code it into the definition, like I did on Godbolt to check that this would compile. (scale has to be an immediate, so a constant expression for the intrinsic, at least after inlining + constant propagation with optimization enabled. GCC allows this, but clang still complains even with optimization enabled, so you'd have to use a template parameter, perhaps with a default of 4. Or of course just hard-code the 4 into the wrapper functions if you don't need to use it any other way.)

Taking void* for the base makes it easy to use with any pointer, although you might want to take int* to prevent accidentally passing it the address of a std::vector control blocks, like &vec instead of vec.data(), especially if you fold the scale=4 into the function.

As is, this is a pure wrapper for exactly what the asm instruction can do, nothing more, nothing less, just like the intrinsic. You can use it with base=0 and scale=1 to dereference 32-bit pointers, instead of indexing an array. Or with scale=8 to grab an int from 2-element structs, or with scale=1 or 2 to do potentially unaligned loads, or use byte offsets.

(Well, the asm instruction also takes a mask, _mm256_mask_i32gather_epi32, but mostly that's about being able to make partial progress on a page fault on one element. You can of course start with a mask that's not all-ones. The instruction isn't faster in that case, so it's not great if your masks are often sparse.)

You might want to name your wrapper function something more generic that doesn't include the element size, but C++ overloads only work based on args, not return value, so a generic gather(Vec8i) function couldn't distinguish vpgatherdd from vpgatherdq using 32-bit indices to access 64-bit elements.

You could I guess template on the destination type and make template overloads, as a way to let you write code like gather<T>(vec, base, sizeof (dst[0])). Maybe you'd want to bake scale into the overloads / template specializations instead of having the caller need to come up with it.

0
Vladislav Kogan On

Answer 1 (best answer)

VCL have lookup<>() function. If given INT_MAX as template parameter, it will perform almost the same as raw instrinsics. No need to reinvent the wheel.

#include <climits>
Vec8i lookingup, idx;
for (auto i = 0; i < vecsize;i+=8) {
    idx.load(&vInd[0]);
    lookingup = lookup<INT_MAX>(idx,&vSource[0]);
    lookingup.store(&vDest[i]);
} 

Answer 2: write raw instrinsics function

VCL doesn't have direct equivalent or _mm256_gather. Best way is do conditionally call AVX2 intrinsic directly when AVX2 is enabled using VCL INSTRSET macro.

if (INSTRSET>=8) { //you can add _mm512 gatherer if you want as well
    intrinGather(vSource,vInd);
}
else {
    loadScalar(vSource,vInd);
}

Answer 3: write custom wrapper for VCL

You can wrap Vec16i, Vec8i and Vec4i in the same fashion. Note that Intel syntax for _mm512_gather and _mm256, _mm128 are sligtly different.

#ifdef __AVX512F__
__attribute__((always_inline))
inline Vec16i vpgatherdd(const int *base, Vec16i idx){
   return _mm512_i32gather_epi32(idx, base, sizeof(int));
}
#endif
#ifdef __AVX2__
__attribute__((always_inline))
inline Vec8i vpgatherdd(const int *base, Vec8i idx){
   return _mm256_i32gather_epi32(base, idx, sizeof(int));
}
#endif

int main()
{
    vector<int> vSource = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    vector<int> vInd = {2,1,5,3,10,5,8,2,10,2,5,3,10,12,2,14,11,5,8};
    //Overloaded functions wrapper example
    Vec8i ind;
    ind.load(&vInd[0]);
    Vec8i gathered = vpgatherdd(&vSource[0], ind);
}