I have some hand-vectorized C++ code that I'm trying to make a distribute-able binary for via function multiversioning. Since the code uses SIMD intrinsics for different instruction sets (SSE2, AVX2, AVX512), it uses template specializations to decide on which intrinsics to use.
The overall structure is roughly as follows:
template <unsigned W, unsigned N> struct SIMD {}; // SIMD abstraction
template <> struct SIMD<128, 8> { // specialization for specific dimensions
using Vec = __m128i;
static always_inline Vec add(Vec a, Vec b) { return _mm_add_epi8(a, b); }
... // many other SIMD methods
};
... // many other dimension specializations for different instruction sets
template <unsigned W, unsigned N> class Worker {
void doComputation(int x) {
using S = SIMD<W, N>;
... // do computations using S:: methods
}
}
Now the issue is that I need different instantiations of Worker to have different attributes, since each will target a different instruction set. Something like this:
template __attribute__((target("avx2"))) void Worker<256, 8>::doComputation(int x);
template __attribute__((target("avx512bw"))) void Worker<512, 8>::doComputation(int x);
...
so that these different instantiations get compiled for those different targets. However, this still produces an error on Clang:
error: always_inline function 'add' requires target feature 'avx2', but
would be inlined into function 'doComputation' that is compiled
without support for 'avx2'
If I annotate the original method with __attribute__((target("avx2,avx512"))) it compiles but executes an illegal hardware instruction at runtime if there is no AVX-512 support, so I guess my intuition of using the annotated specializations as above doesn't work.
Is there a way to express this with Clang or GCC using function attributes?
I have found that trying to use different attributes (even standardized ones like
[[noreturn]]) on different specializations of the same function template is a recipe for a bad time. My solution here would be to add a layer of lexical indirection: split your implementations out intoWorkerSSE2,WorkerAVX2,WorkerAVX512BW, etc, and then haveWorkerselect between them depending on what features are detected at run time. If you're trying to precompile all of the different architectures' implementations into a single "universal" binary, while having generic code stitching it all together, at some point in the control-flow you need a runtime indirection that chooses which implementation to dispatch.You can do this through a function table every time a wrapper-function is called, or set up a bunch of function pointers that get set once when the program first loads (this is how, e.g., most OpenGL implementations work).