I wonder if there is any fast method to do a 24 bit to 16 bit quantization on an array of audio samples (using intrinsics or asm).
Source format is signed 24 le.
Update : Managed to get the conversion done like described :
static void __cdecl Convert24bitToStereo16_SSE2(uint8_t* src, uint8_t* dst, int len)
{
__m128i shuffleMask = _mm_setr_epi8(-1,0,1,2,-1,3,4,5,-1,6,7,8,-1,9,10,11);
__asm
{
mov eax, [src] // src
mov edi, [dst] // dst
mov ecx, [len] // len
movdqu xmm0,xmmword ptr [shuffleMask]
convertloop:
movdqu xmm1, [eax] // read 4 samples
lea eax, [eax + 12] // inc pointer
pshufb xmm1,xmm0 // shuffle using mask
psrldq xmm1, 2 // shift right
movdqu xmm2, [eax] // read next 4 samples
lea eax, [eax + 12] // inc pointer
pshufb xmm2, xmm0 // shuffle
psrldq xmm2, 2 // shift right
packusdw xmm1, xmm2 // pack upper and lower samples
movdqu [edi], xmm1 // write 8 samples
lea edi, [edi + 16]
sub ecx, 24
jg convertloop
}
}
Now for the dithering - how to avoid quantization effects ?
Any hint is welcome. Thx
Your final code looks weird. Why shuffle and then do a bytewise shift of the entire register? Instead, set up you shuffle control mask to put things in the right place to start with.
Also,
packusdwdoesn't convert full-range 32bit to full-range 16bit. It saturates (to 0xffff) any 32bit element greater than 2^16-1. So you have to right-shift the data yourself, to go from 24bit full range to 16bit full range. (In audio, the conversion from 16 to 24 bits is done by adding 8 zero bits as least-signifcant bits, not most-significant.)Anyway, the implication of this is that we want to pack the high 16b of every 24bits of input back-to-back. We can just do this with a shuffle.
Also, be careful about reading past the end of the array. Each
movdqureads 16B, but you only use the first 12.I could have used the same mask twice, and used
PUNPCKLQDQto put the high 8B into the top half of the reg holding the low 8B. However,punpckinstructions compete for the same port aspshufb. (ports 1, 5 on Nehalem/Sandybridge/IvyBridge, port 5 only on Haswell.)porcan run on any of ports 0,1,5, even on Haswell, so it doesn't create a port5 bottleneck problem.Loop overhead is too high without unrolling to saturate port5 even on Haswell, but it's close. (9 fused-domain uops, 2 of them requiring port5. There's no loop-carried dependency, and enough of the uops are loads/stores that 4uops per cycle should be possible.) Unrolling by 2 or 3 should do the trick. Nehalem/Sandybridge/Ivybridge won't bottleneck on execution ports, since they can shuffle on two ports. Core2 takes 4 uops for
PSHUFB, and can only sustain 1 per 2 cycles, but it's still the fastest way to do this data movement. Penryn (aka wolfdale) should be fast for this too, but I haven't looked at details. Decoder throughput will be an issue on pre-Nehalem, though.So if everything's in L1 cache, we can generate 16B of 16b audio per 2 cycles. (Or less, with some unrolling, on pre-Haswell.)
AMD CPUs (e.g. Steamroller) also have
pshufbon the same port aspunpck, while booleans can run on either of the other 2 vector ports, so it's the same situation. Shuffles are higher latency than on Intel, but throughput is still 1 per cycle.If you want proper rounding instead of truncation, add something like 2^7 to the samples before truncation. (Probably requiring some sign-adjustment.) If you want dithering, you need something even more complex, and should google that up, or look for a library implementation. Audacity is open source, so you could look at how they do it.