If my understanding is correct,
_mm_movehdup_ps(a)
gives the same result as
_mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 3, 3))?
Is there a performance difference the two?
If my understanding is correct,
_mm_movehdup_ps(a)
gives the same result as
_mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 3, 3))?
Is there a performance difference the two?
Copyright © 2021 Jogjafile Inc.
_MM_SHUFFLEtakes the high element first, so_MM_SHUFFLE(3,3, 1,1)would do themovshdupshuffle.The main difference is at the assembly level;
movshdupis a copy-and-shuffle, avoiding amovapsto copy the input if the inputais still needed later (e.g. as part of a horizontal sum: see Fastest way to do horizontal float vector sum on x86 for an example of how it compiles without amovapsvs. the SSE1 version that usesshufps.movshdup/movsldupcan also be a load+shuffle with a memory source operand. (shufpsobviously can't, because it needs the same input twice.) On modern Intel CPUs (Sandybridge-family),movshdup xmm0, [rdi]decodes to a pure load uop, not micro-fused with an ALU uop. So it doesn't compete for ALU shuffle throughput (port 5) against other shuffles. The load ports contain logic to do broadcast loads (includingmovddup64-bit broadcast), andmovs[lh]dupduplication of pairs of elements. More complicated load+shuffle likevpermilps xmm0, [rdi], 0x12orpshufd xmm, [rdi], 0x12do still decode to multiple uops, possibly micro-fused into a load+ALU depending on the uarch.Both instructions are the same length:
movshdupavoids the immediate byte, butshufpsis an SSE1 instruction so it only has a 2-byte opcode, 1 byte shorter than SSE2 and SSE3 instructions. But with AVX enabled,vmovshdupdoes save a byte, because the opcode-size advantage goes away.On older CPUs with only 64-bit shuffle units (like Pentium-M and first-gen Core 2 (Merom)), there was a larger performance advantage.
movshduponly shuffles within 64-bit halves of the vector. On Core 2 Merom,movshdup xmm, xmmdecodes to 1 uop, butshufps xmm, xmm, idecodes to 3 uops. (See https://agner.org/optimize/ for instruction tables and microarch guide). See also my horizontal sum answer (linked earlier) for more about SlowShuffle CPUs like Merom and K8.In C++ with intrinsics
If SSE3 is enabled, it's a missed optimization if your compiler doesn't optimize
_mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 3, 1, 1))into the same assembly it would make for_mm_movehdup_ps(a).Some compilers (like MSVC) don't typically optimize intriniscs, though, so it's up to the programmer to understand the asm implications of avoiding
movapsinstructions by using intrinsics for copy-and-shuffle instructions (likepshufdandmovshdup) instead of shuffles that necessarily destroy their destination register (likeshufps, and likepsrldqbyte-shifts.)Also MSVC doesn't let you enable compiler use of SSE3, you only get instructions beyond the baseline SSE2 (or no SIMD) if you use intrinsics for them. Or if you enable AVX, that would allow the compiler to use SSE4.2 and earlier as well, but it still chooses not to optimize. So again, up to the human programmer to find optimizations. ICC is similar. Sometimes this can be a good thing if you know exactly what you're doing and are checking the compiler's asm output, because sometimes gcc or clang's optimizations can pessimize your code.
Probably a good idea to compile with clang and see if it uses the same instructions as the intrinsics in your source; it has by far the best shuffle optimizer out of any of the 4 major compilers that support Intel intrinsics, basically optimizing your intrinsics code the same way compilers normally optimize pure C, i.e. just following the as-if rule to produce the same result.
The most trivial example:
compiled with gcc/clang/MSVC/ICC on Godbolt
GCC and clang with
-O3 -march=core2both spot the optimization:ICC
-O3 -march=haswelland MSVC-O2 -arch:AVX -Gv(to enable the vectorcall calling convention, instead of passing SIMD vectors by reference.)