I have the following code which right packs every 4 bits of a 64 bit int. This is the naive way of doing it, I am using a lookup table and a loop. I am wondering if there is a faster bit twiddling, swar/simd, parallel way to do this any faster? (msb() returns most significant bit)
def pack(X):
    compact = [
    0b0000,   # 0
    0b0001,  #  1
    0b0001,  # 10
    0b0011,  # 11
    0b0001,  #100
    0b0011,  #101
    0b0011,  #110
    0b0111,  #111
    0b0001, #1000
    0b0011, #1001
    0b0011, #1010
    0b0111, #1011
    0b0011, #1100
    0b0111, #1101
    0b0111, #1110
    0b1111, #1111
    ]
    K = 0
    while X:
        i = msb(X)
        j = (i//4 )*4
        a = (X & (0b1111 << j))>>j
        K |= compact[a] << j
        X = X & ~(0b1111 << j)
    return K
 
                        
An alternative that does not need any special SIMD instruction is to take each of the 4 bits into account separately: