The code below converts a row from an 8-Bit paletized format to 32-RGBA.
Before I trying to implement it, I would like to know if the code below is even suited for being optimized with Direct-Math or alternatively ARM Neon intrinsics or inline assembly. My first look at the documentation did not reveal anything that would cover the table-lookup part.
void CopyPixels(BYTE *pDst, BYTE *pSrc, int width,
const BYTE mask, Color* pColorTable)
{
if (width)
{
do
{
BYTE b = *pSrc++;
if (b != mask)
{
// Translate to 32-bit RGB value if not masked
const Color* pColor = pColorTable + b;
pDst[0] = pColor->Blue;
pDst[1] = pColor->Green;
pDst[2] = pColor->Red;
pDst[3] = 0xFF;
}
// Skip to next pixel
pDst += 4;
}
while (--width);
}
}
You will need a LUT of size 256*4bytes = 1024bytes. This kind of job is not suited for SIMD at all. (except for the SSE part on Intel's new Haswell core)
NEON can handle LUTs of maximum 32bytes in size with VTBL and VTBX, but it's more or less meant to work in conjunction with CLZs as starting values for Newton-Raphson iterations.