bit twiddling to right pack bits

240 Views Asked by At

I have the following code which right packs every 4 bits of a 64 bit int. This is the naive way of doing it, I am using a lookup table and a loop. I am wondering if there is a faster bit twiddling, swar/simd, parallel way to do this any faster? (msb() returns most significant bit)

def pack(X):

    compact = [
    0b0000,   # 0
    0b0001,  #  1
    0b0001,  # 10
    0b0011,  # 11
    0b0001,  #100
    0b0011,  #101
    0b0011,  #110
    0b0111,  #111
    0b0001, #1000
    0b0011, #1001
    0b0011, #1010
    0b0111, #1011
    0b0011, #1100
    0b0111, #1101
    0b0111, #1110
    0b1111, #1111
    ]

    K = 0
    while X:
        i = msb(X)
        j = (i//4 )*4
        a = (X & (0b1111 << j))>>j
        K |= compact[a] << j
        X = X & ~(0b1111 << j)
    return K
2

There are 2 best solutions below

1
On

An alternative that does not need any special SIMD instruction is to take each of the 4 bits into account separately:

def pack(x):
    r = x & 0x11111111
    r |= r + ((x >> 1) & 0x11111111)
    r |= r + ((x >> 2) & 0x11111111)
    r |= r + ((x >> 3) & 0x11111111)
    return r
0
On

Most SIMD ISAs have a byte shuffle that can be used to implement a 16-entry LUT with 4-bit indices. e.g. x86 SSSE3 pshufb or ARM/AArch64 vtbl/tbl.

Apparently msb() is just an optimization to skip all-zero nibbles, not a real data dependency, and this is pure vertical SIMD across nibbles.

So it's just a matter of splitting up into 4-bit nibbles and packing back again. For x86, probably an odd/even split and doing the nibble LUT twice is better than packing them together (e.g. punpcklbw or movlhps)

; asm pseudocode; translate into intrinsics in a language of your choice

; constants:
    XMM7 = _mm_set1_epi8(0x0f)
    XMM6 = LUT
; input in XMM0, perhaps from  vmovq xmm0, rdi  or a load

    vpsrld xmm1, xmm0, 4          ; v >> 4
    vpand  xmm0, xmm0,  XMM7      ; v &= 0x0f
    vpand  xmm1, xmm1,  XMM7
    vpshufb xmm0, XMM6, xmm0      ; low nibbles
    vpshufb xmm1, XMM6, xmm1      ; high nibbles
    vpslld xmm1, xmm1, 4          ; high << 4   ; alternative: make a shifted copy of the LUT to avoid this
    vpor   xmm0, xmm0, xmm1

 ; result in low qword of XMM0; in C you might want  _mm_cvtsi128_si64
  ;  vmovq  rax, xmm0     get it back into an integer registers if necessary

This can actually do two 64-bit integers in parallel, in the high and low halves of XMM0, if you're doing this in a loop.

With AVX-512 VBMI for vpermb, you don't need to AND away the high bit before LUT lookups. (vpshufb uses the high bit of the index to conditionally zero that element in the output, meaning you need it to be zero in most cases of using it as a LUT.)

Doing only one vpshufb could involve vpunpcklbw to duplicate each byte, possibly allowing re-combining with vpmaddubsw with a constant like set1_epi16(0x1001) to shift-and-add pairs of bytes, before vpackuswb. Or perhaps a broadcast load to duplicate the whole 64-bit input, then AVX2 vpsrlvq to only right-shift the high half. Then AND/vpshufb once each instead of twice. Then vpunpckhqdq + vpslld + vpor to get the high half back down and combine. So neither of these seem great.