I need to unpack two 16-bit values from each 24 bits of input. (3 bytes -> 4 bytes). I already did it the naïve way but I'm not happy with the performance.
For example, InBuffer is __m128i:
value1 = (uint16_t)InBuffer[0:11] // bit-ranges
value2 = (uint16_t)InBuffer[12:24]
value3 = (uint16_t)InBuffer[25:36]
value4 = (uint16_t)InBuffer[37:48]
... for all the 128 bits.
After the unpacking, The values should be stored in __m256i variable.
How can I solve this with AVX2? Probably using unpack / shuffle / permute intrinsics?
I'm assuming you're doing this in a loop over a large array. If you only used
__m128iloads, you'd have 15 useful bytes, which would only produce 20 output bytes in your__m256ioutput. (Well, I guess the 21st byte of output would be present, as the 16th byte of the input vector, the first 8 bytes of a new bitfield. But then your next vector would need to shuffle differently.)Much better to use 24 bytes of input, producing 32 bytes of output. Ideally with a load that splits down the middle, so the low 12 bytes are in the low 128-bit "lane", avoiding the need for a lane-crossing shuffle like
_mm256_permutexvar_epi32. Instead you can just_mm256_shuffle_epi8to put bytes where you want them, setting up for some shift/and.It compiles like this (Godbolt) with
clang -O3 -march=znver2. Of course an inline version would load the vector constants once, outside a loop.On Intel CPUs (before Ice Lake)
vpblendwonly runs on port 5 (https://uops.info/), competing withvpshufb(...shuffle_epi8). But it's a single uop (unlikevpblendvbvariable-blend) with an immediate control. Still, that means a back-end ALU bottleneck of at best one vector per 2 cycles on Intel. If your src and dst are hot in L2 cache (or maybe only L1d), that might be the bottleneck, but this is already 5 uops for the front end, so with loop overhead and a store you're already close to a front-end bottleneck.Blending with another
vpand/vporwould cost more front-end uops but would mitigate the back-end bottleneck on Intel (before Ice Lake). It would be worse on AMD, wherevpblendwcan run on any of the 4 FP execution ports, and worse on Ice Lake wherevpblendwcan run on p1 or p5. And like I said, cache load/store throughput might be a bigger bottleneck than port 5 anyway, so fewer front-end uops are definitely better to let out-of-order exec see farther.This may not be optimal; perhaps there's some way to set up for
vpunpcklwdby getting the even (low) and odd (high) bit fields into the bottom 8 bytes of two separate input vectors even more cheaply? Or set up so we can blend with OR instead of needing to clear garbage in one input withvpblendwwhich only runs on port 5 on Skylake?Or something we can do with
vpsrlvd? (But notvpsrlvw- that would require AVX-512).If you have AVX512VBMI,
vpmultishiftqbis a parallel bitfield-extract. You'd just need to shuffle the right 3-byte pairs into the right 64-bit SIMD elements, then one_mm256_multishift_epi64_epi8to put the good bits where you want them, and a_mm256_and_si256to zero the high 4 bits of each 16-bit field will do the trick. (Can't quite take care of everything with 0-masking, or shuffling some zeros into the input for multishift, because there won't be any contiguous with the low 12-bit field.) Or you could set up for just ansrli_epi16that works for both low and high, instead of needing an AND constant, by having the multishift bitfield-extract line up both output fields with the bits you want at the top of the 16-bit element.This may also allow a shuffle with larger granularity than bytes, although
vpermbis actually fast on CPUs with AVX512VBMI, and unfortunately Ice Lake'svpermwis slower thanvpermb.With AVX-512 but not AVX512VBMI, working in 256-bit chunks lets us do the same thing as AVX2 but avoiding the blend. Instead, use merge-masking for the right shift, or
vpsrlvwwith a control vector to only shift the odd elements. For 256-bit vectors, this is probably as good asvpmultishiftqb.