Why does clang emit a 32-bit float ps instruction for the absolute value of a 64-bit double?

331 Views Asked by At

Why is clang turning fabs(double) into vandps instead of vandpd (like GCC does)?


Example from Compiler Explorer:

#include <math.h>

double float_abs(double x) {
    return fabs(x);
}

clang 12.0.1 -std=gnu++11 -Wall -O3 -march=znver3

.LCPI0_0:
        .quad   0x7fffffffffffffff              # double NaN
        .quad   0x7fffffffffffffff              # double NaN
float_abs(double):                          # @float_abs(double)
        vandps  xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

gcc 11.2 -std=gnu++11 -Wall -O3 -march=znver3

float_abs(double):
        vandpd  xmm0, xmm0, XMMWORD PTR .LC0[rip]
        ret
.LC0:
        .long   -1
        .long   2147483647
        .long   0
        .long   0

(Ironically, GCC uses vandpd but defines the constant with 32-bit .long chunks (interestingly with the upper half zero), while clang uses vandps but defines the constant as two .quad halves.

1

There are 1 best solutions below

0
On

TL:DR: Probably because it's easier for the optimizer / code-generator to always do this, instead of only with legacy-SSE instructions to save code-size. There's no performance downside, and they're architecturally equivalent (i.e. no correctness difference.)


Probably clang always "normalizes" architecturally equivalent instructions to their ps version, because those have a shorter machine-code encoding for the legacy-SSE versions.

No existing x86 CPUs have any bypass delay latency for forwarding between ps and pd instructions1, so it's always safe to use [v]andps between [v]mulpd or [v]fmadd...pd instructions.

As What is the point of SSE2 instructions such as orpd? points out, instructions like movupd and andpd are completely useless wastes of space that only exist for decoder consistency: a 66 prefix in front of an SSE1 opcode always does the pd version of it. It might have been smarter to save some of that coding space for other future extensions, but Intel didn't do that.

Or perhaps the motivation was the future possibility of a CPU that did have separate SIMD-double vs. SIMD-float domains, since it was early days for Intel's FP SIMD in general when SSE2 was being designed on paper. These days we can say that's unlikely because FMA units take a lot of transistors, and can apparently be built to share some mantissa-multiplier hardware between one 53-bit mantissa per 64-bit element vs. two 23-bit mantissas per 2x 32-bit elements.

Having separate forwarding domains would probably only be useful if you also had separate execution units for float vs. double math, not sharing transistors, unless you had different input and output ports for different types but the same actual internals? IDK enough about that level of CPU design detail.


There's no advantage to ps for the AVX VEX-encoded versions, but also no disadvantage, so it's probably simpler for LLVM's optimizer / code generator to just always do that instead of ever caring about trying to respect the source intrinsics. (Clang / LLVM doesn't in general try to do that, e.g. it freely optimizes shuffle intrinsics into different shuffles. Often this is good, but sometimes it de-optimizes carefully crafted intrinsics when it doesn't know a trick that the author of the intrinsics did.)

e.g. LLVM probably thinks in terms of "FP-domain 128-bit bitwise AND", and knows the instruction for that is andps / vandps. There's no reason for clang to even know that vandpd exists, because there's no case where it would help to use it.


Footnote 1: Bulldozer hidden metadata and forwarding between math instructions:
AMD Bulldozer-family has a penalty for nonsensical things like mulps -> mulpd, for actual FP math instructions that actually care about the sign/exponent/mantissa components of an FP value (not booleans or shuffles).

It basically never makes sense to treat the concatenation of two IEEE binary32 FP values as a binary64, so this isn't a problem that needs to be worked around. It's mostly just something that gives us insight into how the CPU internals might be designed.

In the Bulldozer-family section of Agner Fog's microarch guide, he explains that the bypass delay for forwarding between two math instructions that run on the FMA units is 1 cycle lower than if another instruction is in the way. e.g. addps / orps / addps has worse latency than addps / addps / orps, assuming those three instructions form a dependency chain.

But for a crazy thing like addps / addpd / orps, you get extra latency. But not for addps / orps / addpd. (orps vs orpd never makes a difference here. shufps would also be equivalent.)

The likely explanation is that BD kept extra stuff with vector elements to be reused in that special forwarding case, to maybe avoid some formatting / normalization work when forwarding FMA->FMA. If it's in the wrong format, that optimistic approach has to recover and do the architecturally required thing, but again, that only happens if you actually treat the result of a float FMA/add/mul as doubles, or vice versa.

addps could forward to a shuffle like unpcklpd without delay, so it's not evidence of 3 separate bypass networks, or any justification for the use (or existence) of andpd / orpd.