I have been trying to figure out the best way to use AMD64 SIMD instructions to implement a lerp to be used with large sets of u8 values but I can't seem to figure out the correct instructions without requiring all the SIMD extensions.
The formula I am working with right now is
u8* a;
u8* b;
u8* result;
size_t count;
u16 total;
u16 progress;
u32 invertedProgress = total - progress;
for(size_t i = 0; i < count; i++){
result[i] = (u8)((b[i] * progress + a[i] * invertedProgress) / total);
}
I am thinking it would look something like:
u8* a;
u8* b;
u8* result;
size_t count;
u16 total;
u16 progress;
__m128i mmxZero;
__m128i mmxProgress;
__m128i mmxInvertedProgress;
__m128i mmxProductA;
__m128i mmxProductB;
mmxZero = _mm_xor_ps(zero, zero); // Is there a clear?
mmxProgress = Fill with progress;
mmxTotal = Fill with total;
mmxInvertedProgress = mmxTotal;
mmxInvertedProgress = _mm_unpacklo_epi8(mmxInvertedProgres, mmxZero);
mmxInvertedProgress = _mm_sub_epi8(mmxTotal, progress);
for(size_t i = 0; i < count; i += 8){
mmxProductA = load A;
// u8 -> u16
mmxProductA = _mm_unpacklo_epi8(mmxProductA, mmxZero);
mmxProductB = load B;
// u8 -> u16
mmxProductB = _mm_unpacklo_epi8(mmxProductB, mmxZero);
// a * (total - progress)
mmxProductA = _mm_mullo_epi16(mmxProductA, mmxInvertedProgress);
// b * progress
mmxProductB = _mm_mullo_epi16(mmxProductB, mmxProgress);
// a * (total - progress) + b * progress
mmxProductA = _mm_add_epi16(mmxProductA, mmxProductB);
// (a * (total - progress) + b * progress) / total
mmxProductA = _mm_div_epi16(mmxProductA, mmxTotal);
mmxProductA = saturated u16 -> u8;
store result = maxProductA;
}
There are a couple of things here that I just could not seem to find digging around in the guide, mostly related to loading and storing values.
I know there are some newer instructions that can do larger amounts at the same time, this initial implementation is supposed to work on older chips.
For this example I am also ignoring alignment and the potential for a buffer overrun, I figured that was a little out of scope for the question.
Good question. As you found out, SSE has no integer divide instruction, and (unlike ARM NEON) it doesn’t have multiplication or FMA for bytes.
Here’s what I usually do instead. The code below splits vectors into even/odd bytes, uses 16-bit multiplication instructions to scale separately, then merges them back into bytes.