I have the following code which right packs every 4 bits of a 64 bit int. This is the naive way of doing it, I am using a lookup table and a loop. I am wondering if there is a faster bit twiddling, swar/simd, parallel way to do this any faster? (msb() returns most significant bit)
def pack(X):
compact = [
0b0000, # 0
0b0001, # 1
0b0001, # 10
0b0011, # 11
0b0001, #100
0b0011, #101
0b0011, #110
0b0111, #111
0b0001, #1000
0b0011, #1001
0b0011, #1010
0b0111, #1011
0b0011, #1100
0b0111, #1101
0b0111, #1110
0b1111, #1111
]
K = 0
while X:
i = msb(X)
j = (i//4 )*4
a = (X & (0b1111 << j))>>j
K |= compact[a] << j
X = X & ~(0b1111 << j)
return K
Most SIMD ISAs have a byte shuffle that can be used to implement a 16-entry LUT with 4-bit indices. e.g. x86 SSSE3
pshufbor ARM/AArch64vtbl/tbl.Apparently
msb()is just an optimization to skip all-zero nibbles, not a real data dependency, and this is pure vertical SIMD across nibbles.So it's just a matter of splitting up into 4-bit nibbles and packing back again. For x86, probably an odd/even split and doing the nibble LUT twice is better than packing them together (e.g.
punpcklbwormovlhps)This can actually do two 64-bit integers in parallel, in the high and low halves of XMM0, if you're doing this in a loop.
With AVX-512 VBMI for
vpermb, you don't need to AND away the high bit before LUT lookups. (vpshufbuses the high bit of the index to conditionally zero that element in the output, meaning you need it to be zero in most cases of using it as a LUT.)Doing only one
vpshufbcould involvevpunpcklbwto duplicate each byte, possibly allowing re-combining withvpmaddubswwith a constant likeset1_epi16(0x1001)to shift-and-add pairs of bytes, beforevpackuswb. Or perhaps a broadcast load to duplicate the whole 64-bit input, then AVX2vpsrlvqto only right-shift the high half. Then AND/vpshufb once each instead of twice. Then vpunpckhqdq + vpslld + vpor to get the high half back down and combine. So neither of these seem great.