Having 4 bits, how to produce a mask for AVX register?

1.8k Views Asked by At

_mm256_blendv_pd() looks at bits in positions 63, 127, 191 and 255. Is there an efficient way to scatter 4 lower bits of a uint8_t into these positions of an AVX register?

Alternatively, is there an efficient way to broadcast these bits, so that like a result of _mm256_cmp_pd() each bit is repeated in the corresponding 64-bit component of an AVX register?

The instruction set is AVX2 (Ryzen CPU if other features are needed).

3

There are 3 best solutions below

5
On

The most efficient approach would be to use a lookup vector containing 16 256-bit entries, indexed by the uint-8.

5
On

Assuming that the uint8_t exists in a general purpose register; the approach is:

  1. Use PDEP to transform four bits to four byte (highest bits)
  2. transfer four bytes from 32-bit GPR to the low part of YMM register
  3. Put the values in place (Bits 63, 127, 191, 255)

So I came up with two versions - one with memory and the other one without:

Approach with memory:

.data
  ; Always use the highest bytes of a QWORD as target / 128 means 'set ZERO' 
  ddqValuesDistribution:    .byte  3,128,128,128,128,128,128,128, 2,128,128,128,128,128,128,128, 1,128,128,128,128,128,128,128, 0,128,128,128,128,128,128,128
.code
  ; Input value in lower 4 bits of EAX
  mov     edx, 0b10000000100000001000000010000000
  pdep    eax, eax, edx
  vmovd   xmm0, eax
  vpshufb ymm0, ymm0, ymmword ptr [ddqValuesDistribution]

This one comes out at 5 uOps on Haswell and Skylake.


Approach without memory variable (improved thanks to @Peter Cordes):

  mov  edx, 0b10000000100000001000000010000000
  pdep eax, eax, edx
  vmovd xmm0, eax 
  vpmovsxbq ymm0, xmm0

This one comes out at 4 uOps on Haswell and Skylake(!) and can be further improved by moving the mask in EDX to a variable.
The output is different from the first version (all ones vs. only highest bit set).

0
On

The obvious solution: use those 4 bits as index into a lookup table. You already knew that, so let's try something else.

The variable shift based approach: broadcast that byte into every qword, then shift it left by { 63, 62, 61, 60 }, lining up the right bit in the msb. Not tested, something like this:

_mm256_sllv_epi64(_mm256_set1_epi64x(mask), _mm256_set_epi64x(63, 62, 61, 60))

As a bonus, since the load does not depend on the mask, it can be lifted out of loops.

This is not necessarily a great idea on Ryzen, 256-bit loads from memory have a higher throughput than even just the vpsllvq by itself (which is 2 µops like most 256b operations on Ryzen), but here we also have a vmovq (if that byte does not come from a vector register) and a wide vpbroadcastq (2 µops again).

Depending on the context, it may be worth doing or not. It depends.