I am developing a performance critical application which has to be ported into Intel Atom processor which just supports MMX, SSE, SSE2 and SSE3. My previous application had support for SSSE3 as well as AVX now I want to downgrade it to Intel Atom processor(MMX, SSE, SSE2, SSE3).
There is a serious performance downgrade when I replace ssse3 instruction particularly _mm_hadd_epi16
with this code
RegTemp1 = _mm_setr_epi16(RegtempRes1.m128i_i16[0], RegtempRes1.m128i_i16[2],
RegtempRes1.m128i_i16[4], RegtempRes1.m128i_i16[6],
Regfilter.m128i_i16[0], Regfilter.m128i_i16[2],
Regfilter.m128i_i16[4], Regfilter.m128i_i16[6]);
RegTemp2 = _mm_setr_epi16(RegtempRes1.m128i_i16[1], RegtempRes1.m128i_i16[3],
RegtempRes1.m128i_i16[5], RegtempRes1.m128i_i16[7],
Regfilter.m128i_i16[1], Regfilter.m128i_i16[3],
Regfilter.m128i_i16[5], Regfilter.m128i_i16[7]);
RegtempRes1 = _mm_add_epi16(RegTemp1, RegTemp2);
This is the best conversion I was able to come up with for this particular instruction. But this change has seriously affected the performance of the entire program.
Can anyone please suggest a better performance efficient alternative within MMX, SSE, SSE2 and SSE3 instructions to the _mm_hadd_epi16
instruction. Thanks in advance.
If your goal is to take the horizontal sum of 8 16-bit values you can do this with SSE2 like this: