I am currently vectorizing some code to store 32-bit pixel data using AVX2 intrinsics. Since the AVX2 registers are 256 bits, I can operate on 8 pixels simultaneously. The code I currently have works to load 8 pixels from one buffer and then store them to another buffer:
// Load 256 bits (8 pixels) from memory into register YMMx
BitmapOctoPixel = _mm256_load_si256((const __m256i*)((PIXEL32*)GameBitmap->Memory + BitmapOffset));
// adjust the colors
// As an example, the YMM0 register currently holds these pixels:
// AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB
// YMM0 = FF33281EFF000000-FF33281E00FFFFFF-00FFFFFF00FFFFFF-00FFFFFF00FFFFFF
// store the result into the destination buffer
_mm256_store_si256((__m256i*)((PIXEL32*)gBackBuffer.Memory + MemoryOffset), BitmapOctoPixel);
Now I would like to only move the pixels where the alpha channel (the "AA" component) is 255. I am not trying to do alpha blending. I only want to store the pixels that have 0xFF as an alpha value.
I think I can do this using a mask and the _mm256_maskstore_epi32()
function, but I have not been able to figure it out yet after several hours of trying.
Thanks
First of all, note that
_mm256_maskstore_epi32
is pretty slow on AMD Zen / Zen2, like 19 uops and one per 6 cycle throughput. (https://uops.info/). Masked loads are fine, but masked stores are only efficient on Intel hardware. You might want to blend with the original value and do a full vector store.maskstore uses the high bit of the 32-bit element as the control for store or not.
So you need to create a vector that has that bit set when alpha is exactly ==
0xFF
.Conveniently the 8-bit alpha is already at the top of the 32-bit element, so its high bit is the control bit for the whole 32-bit element. We can just use a packed-8-bit compare for equality to set all bits of the alpha channel (including the high bit) to 0 or 1, according to the whole alpha byte being
0xFF
.maskstore
doesn't care at all about the other bits in the mask, so it doesn't matter that the 8-bit compare result for other parts of the pixel are basically garbage.set1_epi8(-1)
instead ofset1_epi32(0xFF000000)
makes the constant cheaper to create: the compiler can create all-ones by comparing a register to itself, instead of loading the constant from memory. (Godbolt; of course this function will inline in real use cases.)After inlining, the all-ones vector can get hoisted out of a loop.
If exact equality wasn't what you needed, e.g.
alpha >= 0xF0
, you might have had to range-shift to signed (by subtracting or xoring0x80
) before avpcmpgtb
_mm256_cmpgt_epi8. After that adjustment, you could do a dword integer compare to create 32-bit mask elements, so you could use this withvpblendvb
(integer byte-blend).If alpha had been in a different position in the 32-bit element, left-shift before compare.
BTW, if you're storing pixels back where you found them, you could also consider
vblendvps
with the original data before a regular store, instead of maskstore.There is no 32-bit granularity integer blend, so you'd have to
_mm256_castsi256_ps
to keep the compiler happy about using_mm256_blendv_ps
on__m256i
variables.FP blend will cost an extra cycle or 2 of bypass latency on most CPUs, but no throughput cost as long as OoO exec can hide that latency, which is likely when you're working on independent vectors of pixels. But doing it this way saves instructions vs.
vpxor
/vpcmpgtd
to set up forvpblendvb
.Avoiding maskstore is very good on AMD.