GCC/Clang function attributes per template instantiation

574 Views Asked by At

I have some hand-vectorized C++ code that I'm trying to make a distribute-able binary for via function multiversioning. Since the code uses SIMD intrinsics for different instruction sets (SSE2, AVX2, AVX512), it uses template specializations to decide on which intrinsics to use.

The overall structure is roughly as follows:

template <unsigned W, unsigned N> struct SIMD {};  // SIMD abstraction

template <> struct SIMD<128, 8> {  // specialization for specific dimensions
  using Vec = __m128i;
  static always_inline Vec add(Vec a, Vec b) { return _mm_add_epi8(a, b); }
  ...  // many other SIMD methods
};

... // many other dimension specializations for different instruction sets

template <unsigned W, unsigned N> class Worker {
  void doComputation(int x) {
    using S = SIMD<W, N>;
    ... // do computations using S:: methods
  }
}

Now the issue is that I need different instantiations of Worker to have different attributes, since each will target a different instruction set. Something like this:

template __attribute__((target("avx2")))     void Worker<256, 8>::doComputation(int x);
template __attribute__((target("avx512bw"))) void Worker<512, 8>::doComputation(int x);
...

so that these different instantiations get compiled for those different targets. However, this still produces an error on Clang:

error: always_inline function 'add' requires target feature 'avx2', but
       would be inlined into function 'doComputation' that is compiled
       without support for 'avx2'

If I annotate the original method with __attribute__((target("avx2,avx512"))) it compiles but executes an illegal hardware instruction at runtime if there is no AVX-512 support, so I guess my intuition of using the annotated specializations as above doesn't work.

Is there a way to express this with Clang or GCC using function attributes?

3

There are 3 best solutions below

2
elfprince13 On

I have found that trying to use different attributes (even standardized ones like [[noreturn]]) on different specializations of the same function template is a recipe for a bad time. My solution here would be to add a layer of lexical indirection: split your implementations out into WorkerSSE2, WorkerAVX2, WorkerAVX512BW, etc, and then have Worker select between them depending on what features are detected at run time. If you're trying to precompile all of the different architectures' implementations into a single "universal" binary, while having generic code stitching it all together, at some point in the control-flow you need a runtime indirection that chooses which implementation to dispatch.

You can do this through a function table every time a wrapper-function is called, or set up a bunch of function pointers that get set once when the program first loads (this is how, e.g., most OpenGL implementations work).

0
Michael Haephrati On

Have you tried using 'target_clones'? See https://stackoverflow.com/questions/71000786/how-to-tell-gccs-target-clones-to-compile-for-all-simd-levels answer.

Looking at your code, once you make these changes, and use 'target_clones', it should solve the problem in clang.

template <unsigned W, unsigned N> struct SIMD {};  // SIMD abstraction

template <> struct SIMD<128, 8> {  // specialization for the specific dimensions
  using Vec = __m128i;
  static Vec add(Vec a, Vec b) { return _mm_add_epi8(a, b); }
  // ...  // many other SIMD methods
};

// ... // many other dimension specializations for different instruction sets

template <unsigned W, unsigned N> class Worker {
  void doComputation(int x) __attribute__((target_clones("default", "sse2", "avx2", "avx512bw")));
};

template <unsigned W, unsigned N>
void Worker<W, N>::doComputation(int x) {
  using S = SIMD<W, N>;
  // ... // do computations using S:: methods
}

// Specializations for different instruction sets
template <>
void Worker<256, 8>::doComputation(int x) {
  // Implementation for AVX2
}

template <>
void Worker<512, 8>::doComputation(int x) {
  // Implementation for AVX-512
}
0
user541686 On

If I understand the question correctly, the premise is wrong—and the answer is in the question itself.

Namely, the solution is to explicitly instantiate the templates via template __attribute__((target(...))) ..., and it works just fine:

#include <xmmintrin.h>
#include <immintrin.h>

template <unsigned W, unsigned N> struct SIMD {};

template <> struct SIMD<128, 8> {
    using Vec = __m128i;
    __attribute__((always_inline, target("sse4.1"))) static Vec stream_load(void const *p) { return _mm_stream_load_si128(static_cast<Vec const *>(p)); }
    __attribute__((always_inline, target("sse2"))) static void stream(void *p, Vec value) { return _mm_stream_si128(static_cast<Vec *>(p), value); }
};

template <> struct SIMD<256, 8> {
    using Vec = __m256i;
    __attribute__((always_inline, target("avx2"))) static Vec stream_load(void const *p) { return _mm256_stream_load_si256(static_cast<Vec const *>(p)); }
    __attribute__((always_inline, target("avx"))) static void stream(void *p, Vec value) { return _mm256_stream_si256(static_cast<Vec *>(p), value); }
};

template<unsigned W, unsigned N>
struct Worker {
    void doComputation(void *t, void const *p) {
        using S = SIMD<W, N>;
        return S::stream(t, S::stream_load(p));
    }
};

template __attribute__((target("sse2,sse4.1"))) void Worker<128, 8>::doComputation(void *, void const *);
template __attribute__((target("avx,avx2"))) void Worker<256, 8>::doComputation(void *, void const *);

If I'm wrong here, someone please post a non-compiling example.