Using SIMD, how do I conditionally move only the pixels with an alpha channel value of 255?

341 Views Asked by At

I am currently vectorizing some code to store 32-bit pixel data using AVX2 intrinsics. Since the AVX2 registers are 256 bits, I can operate on 8 pixels simultaneously. The code I currently have works to load 8 pixels from one buffer and then store them to another buffer:

// Load 256 bits (8 pixels) from memory into register YMMx           
BitmapOctoPixel = _mm256_load_si256((const __m256i*)((PIXEL32*)GameBitmap->Memory + BitmapOffset));

// adjust the colors

// As an example, the YMM0 register currently holds these pixels:
//        AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB
// YMM0 = FF33281EFF000000-FF33281E00FFFFFF-00FFFFFF00FFFFFF-00FFFFFF00FFFFFF

// store the result into the destination buffer
_mm256_store_si256((__m256i*)((PIXEL32*)gBackBuffer.Memory + MemoryOffset), BitmapOctoPixel);

Now I would like to only move the pixels where the alpha channel (the "AA" component) is 255. I am not trying to do alpha blending. I only want to store the pixels that have 0xFF as an alpha value.

I think I can do this using a mask and the _mm256_maskstore_epi32() function, but I have not been able to figure it out yet after several hours of trying.

Thanks

2

There are 2 best solutions below

1
On BEST ANSWER

First of all, note that _mm256_maskstore_epi32 is pretty slow on AMD Zen / Zen2, like 19 uops and one per 6 cycle throughput. (https://uops.info/). Masked loads are fine, but masked stores are only efficient on Intel hardware. You might want to blend with the original value and do a full vector store.


maskstore uses the high bit of the 32-bit element as the control for store or not.
So you need to create a vector that has that bit set when alpha is exactly == 0xFF.

Conveniently the 8-bit alpha is already at the top of the 32-bit element, so its high bit is the control bit for the whole 32-bit element. We can just use a packed-8-bit compare for equality to set all bits of the alpha channel (including the high bit) to 0 or 1, according to the whole alpha byte being 0xFF. maskstore doesn't care at all about the other bits in the mask, so it doesn't matter that the 8-bit compare result for other parts of the pixel are basically garbage.


void store_opaque_only(void *dst, __m256i pixels)
{
// As an example, the YMM0 register currently holds these pixels:
//        AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB
// YMM0 = FF33281EFF000000-FF33281E00FFFFFF-00FFFFFF00FFFFFF-00FFFFFF00FFFFFF

    __m256i opaque = _mm256_cmpeq_epi8(pixels, _mm256_set1_epi8(-1));
    _mm256_maskstore_epi32(dst, opaque, pixels);
}

set1_epi8(-1) instead of set1_epi32(0xFF000000) makes the constant cheaper to create: the compiler can create all-ones by comparing a register to itself, instead of loading the constant from memory. (Godbolt; of course this function will inline in real use cases.)

# gcc10.2 -O3 -march=skylake
store_opaque_only:
    vpcmpeqd        ymm1, ymm1, ymm1           # all-ones
    vpcmpeqb        ymm1, ymm0, ymm1           # opaque =  pixels == -1
    vpmaskmovd      YMMWORD PTR [rdi], ymm1, ymm0
    ret

After inlining, the all-ones vector can get hoisted out of a loop.


If exact equality wasn't what you needed, e.g. alpha >= 0xF0, you might have had to range-shift to signed (by subtracting or xoring 0x80) before a vpcmpgtb _mm256_cmpgt_epi8. After that adjustment, you could do a dword integer compare to create 32-bit mask elements, so you could use this with vpblendvb (integer byte-blend).

If alpha had been in a different position in the 32-bit element, left-shift before compare.

BTW, if you're storing pixels back where you found them, you could also consider vblendvps with the original data before a regular store, instead of maskstore.

There is no 32-bit granularity integer blend, so you'd have to _mm256_castsi256_ps to keep the compiler happy about using _mm256_blendv_ps on __m256i variables.

FP blend will cost an extra cycle or 2 of bypass latency on most CPUs, but no throughput cost as long as OoO exec can hide that latency, which is likely when you're working on independent vectors of pixels. But doing it this way saves instructions vs. vpxor / vpcmpgtd to set up for vpblendvb.

Avoiding maskstore is very good on AMD.

0
On

I'm not sure whether this fully answers your question or not, but this comparison would be compatible with __m256_maskstore_epi32(), where I assume out_ptr points to the location you want to store to:

// As an example, the YMM0 register currently holds these pixels:
//        AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB-AARRGGBBAARRGGBB
// YMM0 = FF33281EFF000000-FF33281E00FFFFFF-00FFFFFF00FFFFFF-00FFFFFF00FFFFFF

// compare every 8-bit value against 0xFF; for pixels that have this value in their alpha
// channel, the corresponding byte in alpha_mask will be 0xFF
__m256i mask = _mm256_cmpeq_epi8(BitmapOctoPixel, _mm256_set1_epi8(0xFF));
// now, you can use the masked store directly; the high bit in each 32-bit pixel is used
// to determine whether to do the store
__m256_maskstore_si256((__m256i *) out_ptr, mask, BitmapOctoPixel);

However, this will leave gaps in the output buffer where you have pixels that did not have 0xFF alpha values. Is that what you want? Or do you want to contiguously store all of the pixels that passed the test? In that case, you would want something to the effect of _mm256_mask_compressstoreu_epi32() from AVX512, which is more work to emulate in AVX2.