Why does MinGW GCC use x87 80bit FP library code for atan2, cos, exp & sin?

109 Views Asked by At

I have a curious problem porting working numerical code from Intel 2023 & MSC Visual C++ 2022. The code compiled with GCC is perfectly accurate (too accurate) since some library calls are working in full 80bit floating point precision - notably sqrt, sin and cos. I can verify this by tracing into the library calls with gdb using TUI to disassemble the library code execution.

It also shows up in the benchmark timings since x87 atan2, cos, exp & sin are all ~100 cycles and sqrt is ~80 cycles. The corresponding timings on SSE/AVX2 code are under 50 and mostly around 20-30 cycles.

Curiously tan, atan are compiled using AVX2. But cos, sin, sqrt and atan2 are using legacy x87 code inside a GCC system library. I have tried this with both the 32 bit port and the 64 bit version and have the same problem in both. I'm a novice on GCC so I may have overlooked something. I'm using the default MinGW port version 13.1.0 (MinGW-W64 i686-ucrt-posix-dwarf) on Windows which may have its own peculiarities.

As an aside I just noticed that MSC 2022 sometimes codes x87 sqrt even with all gofaster optimisations and AVX2 code enabled since that is also an outlier in the benchmark timings that I hadn't noticed before. Intel compiles it to the native sqrtsd and is very much faster as a result. I went back to MSC x86 for inline assembler to confirm benchmark timings for the x87 trig instructions.

Compiler options are:
gcc -c -O3 -Ofast -march=native -mavx2 -mfpmath=sse benchmark.cpp

Linking is taking whatever the GCC default system libraries are and that seems to be the problem - my code or any system code that inlines generates SSE or AVX2 code so that tan and atan are OK but any that generate a library call are executing x87 instructions for the transcendental function in full 80 bit precision. I think it might be related to this thread which is the closest I could find:

Why does MinGW-w64 floating-point precision depend on winpthreads version?

I want to force it to use a different FP library using entirely AVX2 or SSE2 code or recompile the existing ones with something like the -march=native -mavx2 and -Ofast. Speed is more important than precise standards compliance here. There is a lot of trig involved.

It is entirely possible that I have linked the wrong default library in GCC. I have a sort of work around based on incorporating BSD trig library functions sourcecode but that isn't elegant.

I could post a code sample for the benchmark if someone wants to try it on their system but it would be a bit longer than seems normal here. I'm hoping that someone already knows the answer...

These are benchmarks in machine cycles for the primitive trig operations with AVX2 code gen MSC 2022 vs GCC 13.0.1. It was run on an Intel i5-12600 with maximum optimisation on both compilers.

MSC GCC
atan 12 13
atan2 27 122
log 11 76
exp 11 136
sin 14 115
cos 13 117
sincos 19 127
tan 18 20

The ones using x87 code stand out a mile ~+100 cycles compared to where they should be.

I want to get the code using the right floating point library when compiled -O3 -Ofast -mavx2

Here is minimal sample code that shows my problem and a snapshot of the disassembly in GDB which shows how sin has become x87 FSIN etc. The other way to test if you are similarly afflicted is to benchmark sin(x) and tan(x) in your favourite profiler if tan timing is ~20 cycles and sin ~40 then you are OK (tan ~2x faster than sin). Any trig function 100+ cycles and it is slow x87 code.

#include "stdio.h"
#include "math.h"

// Toy use of sin & tan to see if they compile using SSE2 or x87

int main(int argc, char* argv[]){
double x, y;
if (argc>1) x = atof(argv[1]); else x = 3.1415926535/2;
y = sin(x);
printf("sin of %g is %18.10g\n", x, y);
x = x/2;
y = tan(x);
printf("tan of %g is %18.10g\n", x, y);
}

Disassembly of sin routine using x87 code from GDB (tan is OK)

 <__sinl_internal>        fldt   (%rdx)                                                                
 <__sinl_internal+2>      fsin                                                                       
 <__sinl_internal+4>      fnstsw %ax                                                                
 <__sinl_internal+6>      test   $0x400,%eax                                                        
 <__sinl_internal+11>     jne    0x7ff656332e6b                              
 <__sinl_internal+13>     mov    %rcx,%rax                                                          
 <__sinl_internal+16>     movq   $0x0,0x8(%rcx)                                                     
 <__sinl_internal+24>     fstpt  (%rcx)                                                             
 <__sinl_internal+26>     ret              

I am fairly convinced now that @emacsdrivesmenuts is right and I have to rebuild the default system maths library with the correct FP optimisation under Mingw to fix this but I have no idea how to do that!

Thanks for any enlightenment!

0

There are 0 best solutions below