I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions.
The code using for loops is
int idx;
uint16_t* A, B, C;
float gAlpha = 0.8;
float alpha = 0.2;
for (size_t rw = 0; rw < height; rw++) {
for (size_t cl = 0; cl < width; cl++) {
idx = rw * width + height;
C[idx] = static_cast<uint16_t>(gAlpha * static_cast<float>(A[idx]) + alpha * static_cast<float>(B[idx]));
}
}
}
This loop is probably not perfect but it makes its job perfectly and my unit test gives me the expected results.
As I said, I am trying to convert these loops using SIMD intrinsic. This is my working code and, as you will see, it is not very pretty... We do have access to intrinsic up to AVX2.
size_t n_pixels = height * width;
for (size_t px = 0; px < n_pixels; px += 8) {
__m128i xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMinFl = _mm256_castps128_ps256(ylo);
pxMinFl = _mm256_insertf128_ps(pxMinFl, yhi, 1);
xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
ylo = _mm_cvtepi32_ps(xlo);
yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMaxFl = _mm256_castps128_ps256(ylo);
pxMaxFl = _mm256_insertf128_ps(pxMaxFl, yhi, 1);
__m256 avGain1 = _mm256_set1_ps(gAlpha);
__m256 avGain2 = _mm256_set1_ps(alpha);
__m256 prodUp = _mm256_mul_ps(prodUp, avGain1);
__m256 prodBt = _mm256_mul_ps(prodBt, avGain2);
__m256 pxOutFl = _mm256_add_ps(prodUp, prodBt);
__m128 ylo_ps = _mm256_castps256_ps128(pxOutFl);
__m128 yhi_ps = _mm256_extractf128_ps(pxOutFl, 1);
__m128i xlo_ep = _mm_cvtps_epi32(ylo_ps);
__m128i xhi_ep = _mm_cvtps_epi32(yhi_ps); <- POINT 1
int* xl = reinterpret_cast<int*>(&xlo_ep); <- POINT 2
for (int i=0; i < 8; i++) { <- POINT 2
C[px + i] = static_cast<uint16_t>(xl[i]); <- POINT 2
}
}
There are probably tons of optimization that could be done on this code but I have checked that the output of pxOutFl corresponds to the expected value. Where is start to look like black magic to me is when I looked at how I had to save the data back into the output array C. First of all, the code doesn't work if I comment the line at POINT 1 even if, as you can read, I don't use the variable. Secondly, I would have guessed that there is a better solution than the trick I used to store the data back into the uint16_t array (POINT 2) but I can't find one that is working.
Could someone point me into the correct direction? What am I missing? How could I improve this code?
Thanks in advance!
PS: We use the Intel compiler 2017 for the parallel studio professional edition 2117 on Linux (Fedora 25).
You can re-write all of POINT 2 as:
Also note that all instances of
_mm_load_si128
should probably be_mm_loadu_si128
, since you don't seem to be guaranteeing alignment anywhere.