I've been trying to get up to speed on where we are taking advantage of vectorisation.
Of course the answer to optimisation is always to profile, make a change and profile again but you don't necessarily know what CPU will be used when your application is deployed let alone what capabilities will be in the next CPU around the corner.
It seems the best option is AoSoA style programming.
So we kind of know collectively that the layout of a structure should be something like (simplified pseudo-code):
struct block
{
ALIGN_AND_PAD int16_t field1[blockSize];
ALIGN_AND_PAD int32_t field2[blockSize];
};
struct AoSoA
{
AoS* block[arraySize/blockSize];
}
rather than:
std::vector< someStruct >
We can observe that if blockSize=1 we have AoS and if blockSize=arraySize we have SoA.
It is unclear what block size is best given various widths of buses and cache lines. So long as a block fits in the right multiple of 64.
Not so long ago AVX2 was introduced. This contains a gather instruction specifically aimed at "enabling vector elements to be loaded from non-contiguous memory locations". I dimly recall learning about gather scatter back in the 90s when I was using a Sparc (though I may have been reading a book about a CRAY or some such thing at the time).
Gather as a mainstream operation would appear to reduce the advantages of using AoSoA or rather reduce the disadvantages of using a conventional AoS layout. I think I am correct in assuming it is not a sufficient gain (yet) to render AoSoA obsolete.
If I want to make my code clean, future proof and performant on a wide variety of architectures how should I approach this problem?
How should I choose the appropriate block size and alignment?
My thinking is to roll my own and make block-size either a run or compile-time parameter and calculate strides and indices to access fields directly. i.e. write functions like:
Container::Container(blockSize); //constructor
int16_t Container::getField1(index);
int32_t Container::getField2(index);
Container::insert(someStruct); //disassemble
someStruct Container::getStruct(index); //reassamble
Is this sensible? I can't help being concerned that by putting the index calculation in my code rather then letting the compiler generate it I risk making things worse.
Why can't mainstream compilers like gcc & clang create this representation automatically as an optimisation pass and also decide what blockSize is best?
I think I saw an SoA annotation for an intel compiler somewhere and there are definitely a few research papers that suggest it.
There are a few template libraries that help create AoSoA for C++ but some are quite old, some seem compiler specific.
Is there any work towards making something more standard? For example a compiler annotation that would work in either gcc or clang or both or a Boost library?