I need to unpack 12-bit data stored packed, 2 unsigned 12-bit fields stored in 24-bits. I'd like to store them in a byte[] in a little endian uint16 order.
The packed format is a bit odd; byte[0] is the upper 8 significant bits of the first 12-bit number and and byte[2] is the upper 8 significant bits of the second 12-bit number. The middle byte[1] has the lower 4 bits of both; the first value in the lower nibble and the second value in the upper.
Here's a visual: the full boxes are bytes, letters represent nibbles. Lower addresses on the left, so SIMD bit-shift left will actually move data to the right across boxes but left within boxes.

I have written 2 working versions in C#.
private byte[] Unpack12b2(byte[] input) //slower
{
int eod = input.Length / 3 * 3; //protect against file sizes that aren't a multiple of 3
byte[] output = new byte[eod * 3 / 2];
int j = 0;
int loop = 0;
for (int i = 0; i < eod; i+=3)
{
j = i + loop++;
output[j] = (byte)((input[i] << 4) | (input[i + 1] & 0xf));
output[j + 1] = (byte)(input[i] >> 4);
output[j + 2] = (byte)((input[i + 2] << 4) | (input[i + 1] >> 4));
output[j + 3] = (byte)(input[i + 2] >> 4);
}
return output;
}
private ushort[] Unpack12b(byte[] input) //slightly faster
{
int outputIndex = 0;
int byte1, byte2, byte3;
int eod = input.Length / 3 * 3; //protect against file sizes that aren't a multiple of 3
ushort[] output = new ushort[eod / 3 * 2];
for (int i = 0; i < eod; i += 3)
{
byte1 = input[i];
byte2 = input[i + 1];
byte3 = input[i + 2];
output[outputIndex++] = (ushort)((byte1 << 4) | (byte2 & 0xf));
output[outputIndex++] = (ushort)((byte3 << 4) | (byte2 >> 4));
}
return output;
}
This is the closest answer I've found, but the packed format in that question is much easier to deal with. SIMD unpack 12-bit fields to 16-bit
I'd really like to speed this up. The output is an array of 200+ million bytes so it's a lot of loops and this function is called repeatedly.
Any thoughts on how to speed this up would be appreciated. Ideally I'd like to implement something with AVX2 in C++, but I'm getting lost in how to shuffle in nibbles rather than bytes.
If you can't use C#
System.Runtime.Intrinsics.X86to do this in C#, then yes calling out to a function created by a C++ compiler could be good. To balance between marshalling overhead vs. cache misses, you might want to work in chunks of 64K input producing 85K output, which you read while it's still hot in L2 cache, before moving on to the next chunk. But if you need random access to the unpacked output, you might be stuck with doing the whole thing at once and getting at best L3 cache hits, or even misses all the way to DRAM.Most of the techniques from SIMD unpack 12-bit fields to 16-bit are applicable, like doing a 32-byte load that splits down the middle of two 12-byte halves, setting up for a
vpshufbwhich will get the 3 bytes we want into each 4-byte chunk.Moving nibbles around requires bit-shifts (or AVX-512 instructions like
vpmultishiftqb, a bitfield-extract). x86 SIMD only has 16-bit and wider shifts, not 8-bit element size, unfortunately. (Left-shift by 1 bit can be done by adding to itself, otherwise you can use a wider shift and AND with a mask to clear any bits that got shifted in across byte boundaries.)Visualizing how a left or right shift will move bits between bytes is vastly easier with a diagram where shift-left is left, which one might call "big endian", highest byte on the left like Intel uses in their manuals for SIMD shuffle instructions (e.g.
punpcklbw). And like the comments in my answer on the linked 12-to-16 unpack question for a normal bit layout, where that layout makes it look like the two contiguous 12-bit fields it is (unlike here where it is indeed weird). I'd normally use letters starting with A, but picked different ones to avoid confusion with your diagram. (I used earlier letters for more-significant nibbles in this case; I sometimes go the other way around, to match up with the element numbering for shuffles, where0is the right-most element that gets loaded/stored from the lowest address, so D C B A in a diagram makes sense.)The initial byte-shuffle (
vpshufb) that expanded 12 bytes to 16 (in each lane) could give us any arrangement we want within each 32-bit chunk, likeST ST UV QRorUV ST QR UVif either of those was a useful setup for 32-bit or 16-bit shifts (and AND/OR)For example, if we had
ST QRin the high u16, our desired (0QRS) could be obtained with a rotate-left by 4 bits of the 16-bit STQR to bring the T down to the bottom and left-shift theUVpart. Then mask to clear the garbage (T) in the high nibble. But we don't have SIMD rotates until AVX-512, and even then only in 32 and 64-bit element size. And we'd need something different for the other 16-bit word.A rotate can be done with
(x<<4) | (x>>12). But if we're emulating it anyway, we could start with two different inputs, and/or shift by amounts that don't add to 16 or 32.Starting with
UV ST | QR ST, 32-bit shifts (by 20 and 12) can create00 00 | 0UV SandxQ RS | x0 00. We could OR those together without messing up theUVSor theQRS(slightly cheaper than avpblendwon Intel), but that's not exactly what we want. Still, in general for problems like this it's worth considering 32-bit shifts.A simple right shift by 4 bits (
_mm256_srli_epi16(v, 4)) would turnQR STinto the0Q RSwe want in the high u16 (word) of each u32 dword element. So that's ready to_mm256_blend_epi16like in the older Q&A, if we can come up with something that generates0UVTat the bottom of a 32-bit element.0UVTis trickier: neither byte order (UV ST or ST UV) has the bits we want contiguous with each other.But with
UV ST, the right shift we want for the high half puts also theUandVnibbles in the right place, leaving just the problem of replacing the low 4 bits (S) with theTnibble. In the original initialv(before the shift), we had a copy ofTthere, so 3 bitwise operations can "blend" it in.So it turns out
QR ST | UV STis a good layout for the initialvpshufb, making the required data movement closer between the two u16 halves.With just a shifts and 3 bitwise operations:
AND/OR between those two vectors can produce
0Q RS | 0U VT, only needing to replace the low nibble in the low word. (Otherwise keeping everything from the shift result).Putting that into a function that loads 24 bytes and returns 32 (ready for the caller to store in a loop), borrowing the code from my answer on SIMD unpack 12-bit fields to 16-bit. I adjusted the shuffle-control vector by swapping the low 2 bytes in each 4-byte chunk vs. that answer. (
setrtakes args in little-endian order). That gives usUV STin the low word of each dword while still havingQR STin the high word.This is only 4 instructions after
vpshufb, and three of them are cheap bitwise booleans that can run on any vector execution port on most recent Intel / AMD CPUs, even Haswell. (https://uops.info/). So one more uop than the simpler data arrangement in terms of front-end throughput. Also, only one extra vector constant beyond thevpshufbcontrol vector.One AVX-512
vpternlogdcould replace the three AND/ANDNOT/OR instructions, using the same constant to blend with bit granularity. (Compilers will do this for you if you compile with-march=skylake-avx512orznver4or whatever; Godbolt)Blends with byte or wider granularity can use SSE4 / AVX2
vpblendvbwith a control vector, which is 2 uops on Intel (1 on AMD) or only 1 for the SSE version.Multiply-add instructions for moving+combining in a single-uop
One other possibility for moving nibbles around is to multiply by a power of 2, with
pmaddubsw(_mm256_maddubs_epi16, which treats one input as signed, the other as unsigned, and adds horizontal pairs of bytes into 16-bit results). With the unsigned input as the data we want to combine (so it gets zero-extended to 16-bit), we can use1<<4= 16 as the signed multiplier.After masking our inputs to clear the nibble we don't want from each 16-bit word, can we do everything with one
vpmaddubsw? No, because as a multiply it can only shift left. So we can't get the S from ST to the bottom of the 0QRS output we want. (And we can't instead produceQRSxand right shift, because our 8-bit multiplier constants can't hold256.)We could
vpblendwbetweenQR ST | UV STand avpsrlw-by-4 shift of that to generate0Q RS | UV ST... But that doesn't quite work as an input forvpmaddubsweither. TheQwould need to get multiplied by256. But0QRSis what we already want for that element, so we could just blend aftervpmaddwd, between it and a shift, which is better instruction-level parallelism anyway since they can happen in parallel.Details on getting
0UVTfromUV ST: AND to mask away theS, givingUV 0T. Then treated as two u8 integersUVand0T, doUV*16 + 0T*1to getUVT. So the other input forpmaddubswfor this element should be10 01(hex).This would just cost one extra instruction (an 8-bit multiply-add) vs. the version for a simpler bit order.
vpmaddubswis a multiply so it's not the most efficient instruction, but modern mainstream x86 cores have good throughput for it. (2/clock since Skylake and Zen 3, 1/clock on Zen 2: https://uops.info/ On Intel at least, it competes for throughput with vector shifts, but Skylake and later can run those on port 0 or 1. Its latency isn't a concern: out-of-order exec hides that and we're only doing a short chain of ops for each vector.) Hopefully it doesn't waste too much power and reduce turbo frequency.This is strictly worse than the shift/and/andnot/or version, which uses the same number of uops but more of them are cheaper, and it has fewer vector constants to load for setup. I came up with this
pmaddubswversion first; I'm leaving it in the answer as an example of a bit-movement technique that's sometimes useful in other problems. If we didn't need the blend at the end to treat the two u16 halves differently, themaddversion could be better.Note that
maddcould work with aQR ST | ST UVbyte order: you'd just line up the0x10multiplier with the other byte. Unlike with 16 or 32-bit shifts where bits being contiguous across byte boundaries is important.