SIMD -> uint16_t array to float array work on float then back to uint16_t

444 Views Asked by user1273813 At 01 September 2017 at 14:23

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions.

The code using for loops is

int idx;
uint16_t* A, B, C;
float gAlpha = 0.8;
float alpha = 0.2;
for (size_t rw = 0; rw < height; rw++) {
   for (size_t cl = 0; cl < width; cl++) {
            idx = rw * width + height;
            C[idx] =  static_cast<uint16_t>(gAlpha * static_cast<float>(A[idx]) + alpha * static_cast<float>(B[idx]));
        }
    }
}

This loop is probably not perfect but it makes its job perfectly and my unit test gives me the expected results.

As I said, I am trying to convert these loops using SIMD intrinsic. This is my working code and, as you will see, it is not very pretty... We do have access to intrinsic up to AVX2.

size_t n_pixels = height * width;
for (size_t px = 0; px < n_pixels; px += 8) {
    __m128i xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
    __m128i xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
    __m128 ylo = _mm_cvtepi32_ps(xlo);
    __m128 yhi = _mm_cvtepi32_ps(xhi);
    __m256 pxMinFl = _mm256_castps128_ps256(ylo);
    pxMinFl = _mm256_insertf128_ps(pxMinFl, yhi, 1);

    xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
    xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
    ylo = _mm_cvtepi32_ps(xlo);
    yhi = _mm_cvtepi32_ps(xhi);
    __m256 pxMaxFl = _mm256_castps128_ps256(ylo);
    pxMaxFl = _mm256_insertf128_ps(pxMaxFl, yhi, 1);

    __m256 avGain1 = _mm256_set1_ps(gAlpha);
    __m256 avGain2 = _mm256_set1_ps(alpha);

    __m256 prodUp = _mm256_mul_ps(prodUp, avGain1);
    __m256 prodBt = _mm256_mul_ps(prodBt, avGain2);
    __m256 pxOutFl = _mm256_add_ps(prodUp, prodBt);

    __m128 ylo_ps = _mm256_castps256_ps128(pxOutFl);
    __m128 yhi_ps = _mm256_extractf128_ps(pxOutFl, 1);
    __m128i xlo_ep = _mm_cvtps_epi32(ylo_ps);
    __m128i xhi_ep = _mm_cvtps_epi32(yhi_ps); <- POINT 1

    int* xl = reinterpret_cast<int*>(&xlo_ep); <- POINT 2
    for (int i=0; i < 8; i++) {                <- POINT 2
        C[px + i] = static_cast<uint16_t>(xl[i]); <- POINT 2
    }
}

There are probably tons of optimization that could be done on this code but I have checked that the output of pxOutFl corresponds to the expected value. Where is start to look like black magic to me is when I looked at how I had to save the data back into the output array C. First of all, the code doesn't work if I comment the line at POINT 1 even if, as you can read, I don't use the variable. Secondly, I would have guessed that there is a better solution than the trick I used to store the data back into the uint16_t array (POINT 2) but I can't find one that is working.

Could someone point me into the correct direction? What am I missing? How could I improve this code?

Thanks in advance!

PS: We use the Intel compiler 2017 for the parallel studio professional edition 2117 on Linux (Fedora 25).

Original Q&A

There are 1 best solutions below

Paul R On 01 September 2017 at 14:40

You can re-write all of POINT 2 as:

_mm_storeu_si128((__m128i *)&C[px], xlo_ep);

Also note that all instances of _mm_load_si128 should probably be _mm_loadu_si128, since you don't seem to be guaranteeing alignment anywhere.

SIMD -> uint16_t array to float array work on float then back to uint16_t

There are 1 best solutions below

Related Questions in C++

Related Questions in LINUX

Related Questions in SIMD

Related Questions in AVX

Related Questions in AVX2

Trending Questions

Popular # Hahtags

Popular Questions