I am using Agner Fog's vectorclass library to use SIMD instructions (AVX specifically) in my application. Since it is best to use struct-of-array datastructures for easily employing SIMD, I quite often use:
std::vector<Vec8d> some_var;
or even
struct some_struct {
std::vector<Vec8d> a;
std::vector<Vec8d> b;
}
I wonder if this is bad (performance-wise or even just completely wrong?) considering that the std::vector internal Vec8d* array may in fact not be aligned?
I would generally use
vector<double>, and standard SIMD load/store intrinsics to access the data. That avoids tying the interface and all code that touches it to that specific SIMD vector width and wrapper library. You can still pad the size to a multiple of 8 doubles so you don't have to include cleanup handling in your loops.However, you might want to use a custom allocator for that
vector<double>so you can get it to align your doubles. Unfortunately, even if that allocator's underlying memory allocation is compatible with new/delete, it will have a different C++ type thanvector<double>so you can't freely assign / move it to such a container if you use that elsewhere.I'd worry that if you do ever want to access individual
doubleelements of your vector, doingVec8vec[i][j]might lead to much worse asm (e.g. a SIMD load and then a shuffle or store/reload from VCL'soperator[]) thanvecdouble[i*8 + j](presumably just avmovsd), especially if it means you need to write a nested loop where you wouldn't otherwise need one.avec.load (&doublevec[8]);should generate (almost or exactly) identical asm toavec = Vec8vec[1];. If the data is in memory, the compiler will need to use a load instruction to load it. It doesn't matter what "type" it had; types are a C++ thing, not an asm thing; a SIMD vector is just a reinterpretation of some bytes in memory.But if this is the easiest way you can convince a C++17 compiler to align a dynamic array by 64, then it's maybe worth considering. Still nasty and will cause future pain if/when porting to ARM NEON or SVE, because Agner's VCL only wraps x86 SIMD last I checked. Or even porting to AVX2 will suck.
A better way might be a custom allocator (I think Boost has some already-written) that you can use as the 2nd template param to something like
std::vector<double, aligned_allocator<64>>. This is also type-incompatible withstd::vector<double>if you want to pass it around and assign it to othervector<>s, but at least it's not tied to AVX512 specifically.If you aren't using a C++17 compiler (so std::vector doesn't respect alignof(T) > alignof(max_align_t) i.e. 16), then don't even consider this; it will fault when compilers like GCC and Clang use
vmovapd(alignment-required) to store a__m512d.You'll want to get your data aligned; 64-byte alignment makes a bigger difference with AVX512 than with AVX2 on current AVX512 CPUs (Skylake-X).
MSVC (and I think ICC) for some reason choose to always use unaligned load/store instructions (except for folding loads into memory source operands even with legacy SSE instructions, thus requiring 16-byte alignement) even when compile-time alignment guarantees exist. I assume that's why it happens to work for you.
For an SoA data layout, you might want to share a common size for all arrays, and use
aligned_alloc(compatible withfree, notdelete) or something similar to manage sizes fordouble *members. Unfortunately there's no standard aligned allocator that supports an aligned_realloc, so you always have to copy, even if there was free virtual address space following your array that a non-crappy API could have let your array grow into without copying. Thanks, C++.