Why is clang turning fabs(double)
into vandps
instead of vandpd
(like GCC does)?
Example from Compiler Explorer:
#include <math.h>
double float_abs(double x) {
return fabs(x);
}
clang 12.0.1 -std=gnu++11 -Wall -O3 -march=znver3
.LCPI0_0:
.quad 0x7fffffffffffffff # double NaN
.quad 0x7fffffffffffffff # double NaN
float_abs(double): # @float_abs(double)
vandps xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
ret
gcc 11.2 -std=gnu++11 -Wall -O3 -march=znver3
float_abs(double):
vandpd xmm0, xmm0, XMMWORD PTR .LC0[rip]
ret
.LC0:
.long -1
.long 2147483647
.long 0
.long 0
(Ironically, GCC uses vandpd
but defines the constant with 32-bit .long
chunks (interestingly with the upper half zero), while clang uses vandps
but defines the constant as two .quad
halves.
TL:DR: Probably because it's easier for the optimizer / code-generator to always do this, instead of only with legacy-SSE instructions to save code-size. There's no performance downside, and they're architecturally equivalent (i.e. no correctness difference.)
Probably clang always "normalizes" architecturally equivalent instructions to their
ps
version, because those have a shorter machine-code encoding for the legacy-SSE versions.No existing x86 CPUs have any bypass delay latency for forwarding between
ps
andpd
instructions1, so it's always safe to use[v]andps
between[v]mulpd
or[v]fmadd...pd
instructions.As What is the point of SSE2 instructions such as orpd? points out, instructions like
movupd
andandpd
are completely useless wastes of space that only exist for decoder consistency: a66
prefix in front of an SSE1 opcode always does the pd version of it. It might have been smarter to save some of that coding space for other future extensions, but Intel didn't do that.Or perhaps the motivation was the future possibility of a CPU that did have separate SIMD-double vs. SIMD-float domains, since it was early days for Intel's FP SIMD in general when SSE2 was being designed on paper. These days we can say that's unlikely because FMA units take a lot of transistors, and can apparently be built to share some mantissa-multiplier hardware between one 53-bit mantissa per 64-bit element vs. two 23-bit mantissas per 2x 32-bit elements.
Having separate forwarding domains would probably only be useful if you also had separate execution units for float vs. double math, not sharing transistors, unless you had different input and output ports for different types but the same actual internals? IDK enough about that level of CPU design detail.
There's no advantage to
ps
for the AVX VEX-encoded versions, but also no disadvantage, so it's probably simpler for LLVM's optimizer / code generator to just always do that instead of ever caring about trying to respect the source intrinsics. (Clang / LLVM doesn't in general try to do that, e.g. it freely optimizes shuffle intrinsics into different shuffles. Often this is good, but sometimes it de-optimizes carefully crafted intrinsics when it doesn't know a trick that the author of the intrinsics did.)e.g. LLVM probably thinks in terms of "FP-domain 128-bit bitwise AND", and knows the instruction for that is
andps
/vandps
. There's no reason for clang to even know thatvandpd
exists, because there's no case where it would help to use it.Footnote 1: Bulldozer hidden metadata and forwarding between math instructions:
AMD Bulldozer-family has a penalty for nonsensical things like
mulps
->mulpd
, for actual FP math instructions that actually care about the sign/exponent/mantissa components of an FP value (not booleans or shuffles).It basically never makes sense to treat the concatenation of two IEEE binary32 FP values as a binary64, so this isn't a problem that needs to be worked around. It's mostly just something that gives us insight into how the CPU internals might be designed.
In the Bulldozer-family section of Agner Fog's microarch guide, he explains that the bypass delay for forwarding between two math instructions that run on the FMA units is 1 cycle lower than if another instruction is in the way. e.g.
addps / orps / addps
has worse latency thanaddps / addps / orps
, assuming those three instructions form a dependency chain.But for a crazy thing like
addps / addpd / orps
, you get extra latency. But not foraddps / orps / addpd
. (orps
vsorpd
never makes a difference here.shufps
would also be equivalent.)The likely explanation is that BD kept extra stuff with vector elements to be reused in that special forwarding case, to maybe avoid some formatting / normalization work when forwarding FMA->FMA. If it's in the wrong format, that optimistic approach has to recover and do the architecturally required thing, but again, that only happens if you actually treat the result of a float FMA/add/mul as doubles, or vice versa.
addps
could forward to a shuffle likeunpcklpd
without delay, so it's not evidence of 3 separate bypass networks, or any justification for the use (or existence) ofandpd
/orpd
.