_mm256_blendv_pd()
looks at bits in positions 63, 127, 191 and 255. Is there an efficient way to scatter 4 lower bits of a uint8_t
into these positions of an AVX register?
Alternatively, is there an efficient way to broadcast these bits, so that like a result of _mm256_cmp_pd()
each bit is repeated in the corresponding 64-bit component of an AVX register?
The instruction set is AVX2 (Ryzen CPU if other features are needed).
The most efficient approach would be to use a lookup vector containing 16 256-bit entries, indexed by the uint-8.