SSE2 convert packed RGB to RGBA pixels (add a 4th 0xFF byte after every 3 bytes)

147 Views Asked by At

My application is processing a computationally heavy near realtime workload which I need to speed up as much as possible. The software is written in C++ and only targeting Linux.

My program grabs a 6.4 megapixel RAW data buffer off a specialist astronomical camera which is capable of delivering 25 fps at 3096px x 2080px. This stream is then debayered, in realtime, by using a high quality linear interpolation debayering algorithm. I know that a HQ linear interpolation debayering algorithm is always going to be computationally heavy but there are other areas of my program that I would like to speed up.

Once the stream has been debayered, I need to convert the RGB buffer (created from debayering) into a RGBA buffer because it's my understanding (proven by profiling) that GPUs operate more efficiently on RGBA pixel buffers. However, I'm happy to stand corrected on this.


Initially, I wrote a very simple for loop (below) which, of course, yielded dreadful results.

// both buffers have uint8_t elements
for(int n = 0, m = 0; n < m_Width * m_Height * 4; n+=4, m+=3)
{
     m_display_buffer[n] = in_buffer[m];
     m_display_buffer[n+1] = in_buffer[m+1];
     m_display_buffer[n+2] = in_buffer[m+2];
     m_display_buffer[n+3] = 255;
}

The above code gave me a frame rate of 13 fps. My next experiment was to initialise the buffer with all elements equal to 255 and then use the following code:

uint8_t *dsp = m_display_buffer;
uint8_t *in_8 = (uint8_t*) in_buffer;

for (int n = 0; n < m_Width * m_Height; n++)
{
    *dsp++ = *in_8++;
    *dsp++ = *in_8++;
    *dsp++ = *in_8++;
    *dsp++;
}

The above code significantly sped up the loop; now achieving 23.9 fps running on an i7-7700 laptop. However, running this code on older machines still gives very disappointing frame rates. I know that older machines struggle with debayering but profiling clearly shows that converting to an RGBA buffer is causing significant problems.


I have read that it might be possible to use SSE intrinsics to do this much more efficiently, however, I have zero experience with SSE intrinsics.

I've tried many SSE examples found online but cannot get it to work. I would therefore be grateful if somebody experienced with SSE would be able to help me with this problem.

I cannot target SSE any higher than 2 or possibly 3 because my software might be run on much older hardware.

I would be grateful if somebody would be able to point me in the right direction.

0

There are 0 best solutions below