Exceeding the range of long double and big floating point numbers

218 Views Asked by At

Problem statement: I am working on a code that calculates big numbers. Hence, I am easily get beyond the maximum length of "long double". Here is an example below, where part of the code is given that generates big numbers:

int n;
long double summ;

  a[1]=1;
  b[1]=1; 
  c[1] = 1; //a, b, c are 1D variables of long double types 
  summ=1+c[1];
  for(n=2; n <=1760; n++){
    a[n]=n*n;
    b[n]=n;
    c[n] = c[n-1]*a[n-1]/b[n]; //Let us assume we have this kind of operation
    summ= summ+c[n]; //So basically, summ = 1+c[1]+c[2]+c[3]+...+c[1760]
  }

The intermediates values of summ and c[n] are then used to evaluate the ratio c[n]/summ for every integer n. Then, just after the above loop, I do:

    for(n=1;n<=1760;n++){
c2[n]=c[n]/summ; //summ is thus here equals to 1+c[1]+c[2]+c[3]+...+c[1760]
}

Output: If we print n, c[n] and summ, we obtain inf after n=1755 because we exceed the length of long double:

n            c[n]            summ
1752     2.097121e+4917  2.098320e+4917
1753     3.672061e+4920  3.674159e+4920
1754     6.433452e+4923  6.437126e+4923
1755     1.127785e+4927  1.128428e+4927
1756     inf             inf
1757     inf             inf
1758     inf             inf
1759     inf             inf
1760     inf             inf

Of course, if there is an overflow for c[n] and summ, I cannot evaluate the quantity of interest, which is c2[n].

Questions: Does someone see any solution for this ? How do I need to change the code so that to have finite numerical values (for arbitrary n) ? I will indeed most likely need to go to very big numbers (n can be much larger than 1760).

Proposition: I know that GNU Multiple Precision Arithmetic (GMP) might be useful but honestly found too many difficulties trying to use this (outside the field), so if there an easier way to solve this, I would be glad to read it. Otherwise, I will be forever grateful if someone could apply GMP or any other method to solve the above-mentioned problem.

2

There are 2 best solutions below

14
Fe2O3 On

I ain't no mathematician. This is what I wrote with the results below. Looks to me that the exponent, at least, is keeping up with your long double results using my feeble only double only...

#include <stdio.h>
#include <math.h>

int main() {
    int n;
    double la[1800], lb[1800], lc[1800];

    for( n = 2; n <= 1760; n++ ) {
        lb[n] = log10(n);
        la[n] = lb[n] + lb[n];
        lc[n] = lc[n-1] + la[n-1] - lb[n];

        printf( "%4d:  %.16lf\n", n, lc[n] );
    }
    return 0;
}
/* omitted for brevity */
1750:  4910.8357954121602000
1751:  4914.0785853634488000
1752:  4917.3216235537839000
1753:  4920.5649098413542000
1754:  4923.8084440845114000
1755:  4927.0522261417700000 <<=== Take note, please.
1756:  4930.2962558718036000
1757:  4933.5405331334487000
1758:  4936.7850577857016000
1759:  4940.0298296877190000
1760:  4943.2748486988194000

EDIT (Butterfly edition)
Below is a pretty simple iterative function involving one single and one double precision float values. The purpose is to demonstrate that iterative calculations are exceedingly sensitive to initial conditions. While it seems obvious that the extra bits of the double will "hold-on", remaining closer to the results one would get with infinite precision, the compounding discrepancy between these two versions demonstrate that "demons lurking in small places" will likely remain hidden in the fantastically tiny gaps between finite representations of what is infinite.

Just a bit of fun for a rainy day.

int main() {
    float  fpi = 3.1415926535897932384626433832;
    double dpi = 3.1415926535897932384626433832;

    double thresh = 10e-8;

    for( int i = 0; i < 1000; i++ ) {
        fpi = fpi * 1.03f;
        dpi = dpi * 1.03f;
        double diff = fabs( dpi - fpi );

        if( diff > thresh) {
            printf( "%3d: %25.16lf\n", i, diff );
            thresh *= 10.0;
        }
    }
    return 0;
}
  8:        0.0000001229991486
 35:        0.0000010704333473
 90:        0.0000100210180918
192:        0.0001092634900033
229:        0.0010121794607585
312:        0.0100316228017618
367:        0.1002719746902585
453:        1.0056506423279643
520:       10.2658853083848950
609:      103.8011477291584000
667:     1073.9984381198883000
736:    10288.9632129669190000
807:   101081.5514678955100000
886:  1001512.2135009766000000
966: 10473883.3271484370000000
7
chtz On

NOTE: This does not exactly what OP wants. I'll leave this answer here in case someone has a similar problem.


As long as your final result and all initial values are not out of range, you can very often re-arrange your terms to avoid any overflow. In your case if you actually just want to know c2[n] = c[n]/sum[n] you can re-write this as follows:

c2[n] = c[n]/sum[n] 
      = c[n]/(sum[n-1] + c[n])                        // def. of sum[n]
      = 1.0/(sum[n-1]/c[n] + 1.0)                
      = 1.0/(sum[n-1]/(c[n-1] * a[n-1] / b[n]) + 1.0) // def. of c[n]
      = 1.0/(sum[n-1]/c[n-1] * b[n] / a[n-1] + 1.0)
      = a[n-1]/(1/c2[n-1] * b[n]  + a[n-1])           // def. of c2[n-1]
      = (a[n-1]*c2[n-1]) / (b[n] + a[n-1]*c2[n-1])

Now in the final expression neither argument grows out of range, and in fact c2 slowly converges towards 1. If the values in your question are the actual values of a[n] and b[n] you may even find a closed form expression for c2[n] (I did not check it).

To check that the re-arrangement works, you can compare it with your original formula (godbolt-link, only printing the last values): https://godbolt.org/z/oW8KsdKK6

Btw: Unless you later need all values of c2 again, there is actually no need to store any intermediate value inside an array.