Problem with floating point compiler-optimization for double-double extended precision class in C++

Question

Problem with floating point compiler-optimization for double-double extended precision class in C++

240 Views Asked by Paul Aner At 27 November 2023 at 18:31

I implemented a "double-double" class in C++. A "double-double" uses two doubles to increase precision. A number is treated as number = hi + lo, though the sum is never actually calculated because hi + lo == hi. Here is a shortened version of my code:

class doubledouble
{
public:
    double hi, lo;

    doubledouble()
    {
        hi = 0.0;
        lo = 0.0;
    }

    doubledouble quickTwoSum(const double a, const double b) const
    {
        int old;
        doubledouble out;
        double s = a + b;
        double e = b - (s - a);
        out.hi = s;
        out.lo = e;
        return (out);
    }

    doubledouble twoSum(const double a, const double b) const
    {
        double s = a + b;
        double v = s - a;
        double e = (a - (s - v)) + (b - v);
        doubledouble out;
        out.hi = s;
        out.lo = e;
        return (out);
    }

    doubledouble operator+(const doubledouble& in) const
    {
        doubledouble s, t;
        s = twoSum(in.hi, this->hi);
        t = twoSum(in.lo, this->lo);
        s.lo += t.hi;
        s = quickTwoSum(s.hi, s.lo);
        s.lo += t.lo;
        s = quickTwoSum(s.hi, s.lo);
        return(s);
    }

    doubledouble split(const double a) const
    {
        const double split = (1 << 12) + 1;
        double t = a * split;
        doubledouble out;
        out.hi = t - (t - a);
        out.lo = a - out.hi;
        return (out);
    }

    doubledouble twoProd(const double a, const double b) const
    {
        double p = a * b;
        doubledouble aS = split(a);
        doubledouble bS = split(b);
        double err = ((aS.hi * bS.hi - p) + aS.hi * bS.lo + aS.lo * bS.hi) + aS.lo * bS.lo;
        aS.hi = p;
        aS.lo = err;
        return (aS);
    }

    doubledouble operator*(const doubledouble& in) const
    {
        doubledouble p;
        p = twoProd(this->hi, in.hi);
        p.lo += this->hi * in.lo + this->lo * in.hi;
        p = quickTwoSum(p.hi, p.lo);
        return(p);
    }
};

My problem is that this code works in debug-mode, but in release-mode I only get normal double-precision. I am using the MSVC compiler and if I put a #pragma optimize("", off) before the class and a corresponding on behind it the code works (very slow of course).

I am pretty sure the problem lies in things like t - (t - a) which mathematically is simply equal to a, so the compiler probably optimizes it like that. To narrow down the problem, I tried putting #pragma optimize around certain functions. But even if I put every single function inside the class between pragmas, the code does not work correctly.

Either way, using #pragma optimize is not very useful anyway. The code is slow and it is not portable. My first question is: Why does #pragma optimize not work inside for specific functions inside the class? That would be really useful to narrow down the problem.

Second question: How could I trick/convince the compiler to not optimize certain expressions like t - (t - a)? Using a temp-variable would probably not work, because the compiler would optimize that away.

Edit

As a reproducible example I used this code:

double a = 1.23456789012345;
const double split = (1 << 12) + 1;
double t = a * split;
double hi = t - (t - a);
double lo = a - hi;
   
std::cout << "hi: " << hi << ", lo: " << lo << std::endl;

Output in both cases (release- and debug-mode):

hi: 1.23457, lo: -3.48832e-13

So t - (t - a) does not seem to be the problem.

As for a "simple" example (if I find a simpler one, I will edit) that produces the wrong behavior - I calculated the mandelbrot-set with zooms greater than 10^14. In debug mode I got perfect results, in release mode I get pixelation. I used this function to create the images:

void CalcMandelCPU_doubledouble(std::atomic<unsigned int>* RowCounter)
    {
        std::complex<doubledouble> Z, C;

        int Y = (*RowCounter)++;
        while (Y < ResY)
        {
            for (int X = 0; X < ResX; X++)
            {
                C.real(minX + (X + 0.5) * (maxX - minX) / ResX);
                C.imag(minY + (Y + 0.5) * (maxY - minY) / ResY);

                int i;
                Z = 0;
                for (i = 0; i < MaxIter; i++)
                {
                    Z = Z * Z + C;

                    if (Z.real() * Z.real() + Z.imag() * Z.imag() > 4.0)
                        break;
                }

                if (i == MaxIter)
                {
                    Result[Y * ResX + X] = 0;
                }
                else
                {
                    Result[Y * ResX + X] = i + 1 - log(log(Z.real() * Z.real() + Z.imag() * Z.imag())) / log(2.0);
                }
            }
            Y = (*RowCounter)++;
        }
    }

Define MaxIter, ResX and ResY. minX, maxX, minY and maxY will give you the (zoomed) view. Examples for the view-variables (as doubledoubles) :

minX.hi = -1.9997740601362948, minX.lo = 9.2915246622385581e-17

maxX.hi = -1.9997740601362861, maxX.lo = 7.5150963188138267e-17

minY.hi = -3.2900430992572130e-09, minY.lo = 1.7490722630808694e-25

maxY.hi = -3.2900375437016574e-09, maxY.lo = 1.0427104313278710e-25

Original Q&A

There are 2 best solutions below

**Paul Aner** · Answer 1 · 2023-11-28T09:28:40.877000

OK, here is the solution:

Switching from fp:fast to fp:strict as mentioned by multiple users in the comments solves the problem. However, I did not want to change global optimization settings to not effect the rest of my code. The solution was to put volatile before certain variables (if there is a better solution, please comment). Indeed only two changes were necessary:

Function twoSum:

doubledouble twoSum(const double a, const double b) const
    {
        volatile double s = a + b;
        volatile double v = s - a;
        double e = (a - (s - v)) + (b - v);
        doubledouble out;
        out.hi = s;
        out.lo = e;
        return (out);
    }

... and function split:

doubledouble split(const double a) const
    {
        const double split = (1 << 26) + 1;
        volatile double t = a * split;
        doubledouble out;
        out.hi = t - (t - a);
        out.lo = a - out.hi;
        return (out);
    }

Thank you to the commenters.

**Eric Postpischil** · Answer 2 · 2023-11-28T12:20:08.880000

Write only single operations, using a cast or assignment to isolate each assignment.

The C++ standard allows an implementation to use excess precision in expression evaluation but requires casts and assignments to use the nominal precision. So double x = t - (t - a); may be implemented as if it had infinite precision, and then it is mathematically equal to a, so optimization may reduce it to a. However double x = t - static_cast<double>(t - a); must produce a double value for the cast. double t0 = t - a, x = t - t0; also must produce a double value for t0.

(Technically, the implementation could use excess precision for the subtraction, including rounding, and then round it to double, allowing the possibility of a double rounding error, but I have not heard of this happening.)

Note this assumes standard-conforming behavior. If the compiler has been licensed to perform additional optimizations, as with a switch telling it to be looser with math than the standard specifies, the methods in this answer may be insufficient.

Some compiler details

Mainstream compilers are generally fine when compiling for x86-64 which implies using SSE2 for floating point so the hardware format is IEEE754 64-bit double. Same for most other modern ISAs like PowerPC and ARM / AArch64. So math should be IEEE compliant as long as you avoid options like MSVC /fp:fast or GCC/Clang -ffast-math.

GCC's default (with the default -std=gnu++20 or whatever) is -fexcess-precision=fast, so you'll want -fexcess-precision=standard, or -std=c++20 which implies that, if you're compiling for 32-bit x86 with x87 floating-point. (SSE2 doesn't have this problem). MSVC targeting x86 with x87 FP sets the FPU precision-control bits so all results are rounded to a 53-bit mantissa (so it's like double except for the exponent range), avoiding the need for store/reload to actually round to double. (Or if you linked at Direct3D library, it'll set the precision-control to 24-bit, float mantissa width, according to Bruce Dawson's excellent articles from 2012 about FP math in general and MSVC/Windows in particular.)

GCC's default is -ffp-contract=fast, which will allow it to contract t - a into fma(a, scale, -a) even across statements when FMA is available (like -march=x86-64-v3 when compiling for x86-64).
MSVC's default is #pragma fp_contract off; clang's default is -ffp-contract=on (within one expression, not across statements, like ISO C #pragma STDC FP_CONTRACT on (also supported by some C++ compilers)). (GCC doesn't support on, only off and fast, so GCC's on is equivalent to off: the only way to stop its optimizer contracting across statements is to disable it entirely.)

Classic ICC's default is -fp-model fast=1, so for this you'd need -fp-model precise. See also slides from a 2012 talk about FP math optimization in ICC and GCC. The LLVM-based ICX doesn't have the same options, being more like Clang.

Other compilers default to MSVC /fp:precise or gcc/clang -fno-fast-math.

Problem with floating point compiler-optimization for double-double extended precision class in C++

There are 2 best solutions below

Some compiler details

Related Questions in C++

Related Questions in VISUAL-C++

Related Questions in FLOATING-POINT

Related Questions in COMPILER-OPTIMIZATION

Related Questions in DOUBLE-DOUBLE-ARITHMETIC

Trending Questions

Popular # Hahtags

Popular Questions