I need to shuffle a 16 bit unsigned integer in a way that the even indexes land in the lower byte, and the odd indexes land in the upper byte.
input:
fedcba98 76543210 (contiguously numbered)
output:
fdb97531 eca86420 (even and odd separated)
My code looks like this at the moment:
typedef unsigned short u16;
u16 segregate(u16 x)
{
u16 g = (x & 0x0001);
u16 h = (x & 0x0004) >> 1;
u16 i = (x & 0x0010) >> 2;
u16 j = (x & 0x0040) >> 3;
u16 k = (x & 0x0100) >> 4;
u16 l = (x & 0x0400) >> 5;
u16 m = (x & 0x1000) >> 6;
u16 n = (x & 0x4000) >> 7;
u16 o = (x & 0x0002) << 7;
u16 p = (x & 0x0008) << 6;
u16 q = (x & 0x0020) << 5;
u16 r = (x & 0x0080) << 4;
u16 s = (x & 0x0200) << 3;
u16 t = (x & 0x0800) << 2;
u16 u = (x & 0x2000) << 1;
u16 v = (x & 0x8000);
return g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v;
}
I wonder if there is a more elegant solution than simply extracting and shifting each individual bit?
There is a very convenient web resource that helps solving many bit permutation problems: Code generator for bit permutations. In this particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" to this page produces pretty fast code.
Unfortunately this code generator cannot produce 64-bit code (though anyone could download sources and add this option). So if we need to perform 4 permutations in parallel using 64-bit instructions, we have to extend all involved bitmasks to 64 bits manually:
Level of parallelism could be increased even more (8 or 16 permutations at once) with SSE instructions. (And recent versions of gcc can vectorize this code automatically).
If parallelism is not required and data cache is not extensively used by other parts of the program, better alternative would be to use lookup table. Various LUT approacehes are already discussed in other answers, still some more could be said here:
Here is implementation of second approach:
But fastest approach (if portability is not an issue) is using
pext
instruction from BMI2 instruction set as noted by Nils Pipenbrinck. With a pair of 64-bitpext
we could perform 4 16-bit shuffles in parallel. Sincepext
instruction is intended exactly for this kind of bit permutations, this approach easily outperforms all others.