My application is processing a computationally heavy near realtime workload which I need to speed up as much as possible. The software is written in C++ and only targeting Linux.
My program grabs a 6.4 megapixel RAW data buffer off a specialist astronomical camera which is capable of delivering 25 fps at 3096px x 2080px. This stream is then debayered, in realtime, by using a high quality linear interpolation debayering algorithm. I know that a HQ linear interpolation debayering algorithm is always going to be computationally heavy but there are other areas of my program that I would like to speed up.
Once the stream has been debayered, I need to convert the RGB buffer (created from debayering) into a RGBA buffer because it's my understanding (proven by profiling) that GPUs operate more efficiently on RGBA pixel buffers. However, I'm happy to stand corrected on this.
Initially, I wrote a very simple for loop (below) which, of course, yielded dreadful results.
// both buffers have uint8_t elements
for(int n = 0, m = 0; n < m_Width * m_Height * 4; n+=4, m+=3)
{
m_display_buffer[n] = in_buffer[m];
m_display_buffer[n+1] = in_buffer[m+1];
m_display_buffer[n+2] = in_buffer[m+2];
m_display_buffer[n+3] = 255;
}
The above code gave me a frame rate of 13 fps. My next experiment was to initialise the buffer with all elements equal to 255 and then use the following code:
uint8_t *dsp = m_display_buffer;
uint8_t *in_8 = (uint8_t*) in_buffer;
for (int n = 0; n < m_Width * m_Height; n++)
{
*dsp++ = *in_8++;
*dsp++ = *in_8++;
*dsp++ = *in_8++;
*dsp++;
}
The above code significantly sped up the loop; now achieving 23.9 fps running on an i7-7700 laptop. However, running this code on older machines still gives very disappointing frame rates. I know that older machines struggle with debayering but profiling clearly shows that converting to an RGBA buffer is causing significant problems.
I have read that it might be possible to use SSE intrinsics to do this much more efficiently, however, I have zero experience with SSE intrinsics.
I've tried many SSE examples found online but cannot get it to work. I would therefore be grateful if somebody experienced with SSE would be able to help me with this problem.
I cannot target SSE any higher than 2 or possibly 3 because my software might be run on much older hardware.
I would be grateful if somebody would be able to point me in the right direction.