Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers):
a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3
Into four vectors a, b, c, d
?
a: {a0, a1, a2, a3}
b: {b0, b1, b2, b3}
c: {c0, c1, c2, c3}
d: {d0, d1, d2, d3}
I'm not sure whether this is relevant or not, but in my actual application I have 16 vectors and as such a0
and a1
are 16*4 bytes apart in memory.
What you need here is 4 loads followed by a 4x4 transpose:
Note: this is probably more efficient than using AVX2 gathered loads, since they generate a read cycle per element, which makes them really only useful when the access pattern is unknown or difficult to work with.