Optimizing a scanline conversion function for ARM

Question

Optimizing a scanline conversion function for ARM

394 Views Asked by Oliver Weichhold At 27 July 2013 at 05:03

The code below converts a row from an 8-Bit paletized format to 32-RGBA.

Before I trying to implement it, I would like to know if the code below is even suited for being optimized with Direct-Math or alternatively ARM Neon intrinsics or inline assembly. My first look at the documentation did not reveal anything that would cover the table-lookup part.

void CopyPixels(BYTE *pDst, BYTE *pSrc, int width,
  const BYTE mask, Color* pColorTable)
{
  if (width)
  {
    do
    {
      BYTE b = *pSrc++;
      if (b != mask)
      {
        // Translate to 32-bit RGB value if not masked
        const Color* pColor = pColorTable + b;
        pDst[0] = pColor->Blue;
        pDst[1] = pColor->Green;
        pDst[2] = pColor->Red;
        pDst[3] = 0xFF;
      }
      // Skip to next pixel
      pDst += 4;
    }
    while (--width);
  }
}

Original Q&A

There are 2 best solutions below

Peter M On 29 July 2013 at 16:05

I agree with Jake that this isn't a great vector processor problem, and may be more efficiently handled by the ARM main pipeline. That doesn't mean that you couldn't optimize it by assembly (but just plain ARM v7) for drastically improved results.

In particular, a simple improvement would be to construct your lookup table such that it can be used with a word sized copy. This would involve making sure the Color struct follows the 32-RGBA format, including having the 4th 0xFF as part of the lookup, so that you can just do a single word copy. This could be a significant performance boost with no assembly required, since it is a single memory fetch, rather than 3 (plus a constant assignment).

void CopyPixels(RGBA32Color *pDst, BYTE const *pSrc, int width,
  const BYTE mask, RGBA32Color const *pColorTable)
{
  if (width)
  {
    do
    {
      BYTE b = *pSrc++;
      if (b != mask)
      {
        // Translate to 32-bit RGB value if not masked
        *pDst = pColorTable[b];
      }
      // Skip to next pixel
      pDst ++;
    }
    while (--width);
  }
}

**Jake 'Alquimista' LEE** · Accepted Answer · 2013-07-28T23:54:27.733000

You will need a LUT of size 256*4bytes = 1024bytes. This kind of job is not suited for SIMD at all. (except for the SSE part on Intel's new Haswell core)

NEON can handle LUTs of maximum 32bytes in size with VTBL and VTBX, but it's more or less meant to work in conjunction with CLZs as starting values for Newton-Raphson iterations.

Optimizing a scanline conversion function for ARM

There are 2 best solutions below

Related Questions in ARM

Related Questions in NEON

Related Questions in DIRECTXMATH

Trending Questions

Popular # Hahtags

Popular Questions