Floating point serialization/parsing rounding

128 Views Asked by At

I have the following floating point number stored in memory (float, not double)

0b00111101111001000010000111001000

which, according to floating-point-converter converts to

0.111392557621002197265625

I serialize it, rounding up to the 9th digit, and storing it as the following string:

0.111392558

I understood that, if now I parse the string, I'm using using stof, I should expect some precision loss, which would imply that I'm not guaranteed that the parsed value would result in the original value, that is

0.111392557621002197265625

The fact is that I'm doing some unit testing and I consistently find that my expectations never met, original and parsed values are always the same.

Are my expectations right?

3

There are 3 best solutions below

6
Eric Postpischil On BEST ANSWER

Supposing you mean nine significant digits, not just nine digits after the decimal point, then the farthest apart two consecutive nine-significant-digit decimal numerals can be is one part in 108. This occurs when the first digit is 1 and the ninth digit changes by 1.

The closest together two consecutive 24-digit binary numerals can be is one part in 224. This occurs when all the digits are 1 and we add 1 at the last digit (carrying the number up to the next power of two).

Consider, in general, rounding a number x to a numerical format using a round-to-nearest method. Suppose x is between two values representable in the format, a and b. If a is nearer to x than b is, we round x to a. If b is nearer, we round to b. In either case, the rounding error is less than half the distance between a and b, since we chose the nearer value. If x is exactly in the middle, the rounding error is half the distance. So, with round-to-nearest, the rounding error is at most half the distance between representable values. Thus, we have a Rounding Lemma: If x rounds to x', then |x'−x| ≤ ½U, where U is the distance between adjacent representable numbers bracketing x.

Now consider more specifically rounding a 24-digit binary numeral x to a nine-digit decimal numeral. Let D be the distance between nine-digit decimal numerals near x, and let B be the distance between 24-digit binary numerals near x. From above, we know D < B. Rounding x to a nine-digit decimal numeral yields some number x', and we know from the Rounding Lemma that |x'−x| ≤ ½D. Since D < B, |x'−x| < ½B. This may also be expressed as −½B < xx' < ½B, which implies x'−½B < x < x'+½B.

Now let x'' be the number we get by rounding x' back to a 24-digit binary numeral. To satisfy the Rounding Lemma, |x'−x''| ≤ ½B, which means −½Bx''−x' ≤ ½B, which implies x'−½Bx'' ≤ x'+½B.

Now we have both x'−½B < x < x'+½B and x'−½Bx'' ≤ x'+½B, which means both x and x'' are in the same interval from x'−½B to x'+½B, except that x'' can be on an endpoint of the interval and x cannot. The length of that interval is B, and 24-bit binary numerals are spaced B apart, so one of two things are true: Either there is exactly one 24-bit binary numeral inside the interval or there are two 24-bit binary numerals, one at each endpoint of the interval. The latter is impossible since then there would be no 24-bit binary numeral inside the interval to be x. So there is only one 24-bit binary numeral in the interval, and both x and x'' must be that numeral. Therefore x'' = x.

Thus, rounding a 24-bit binary numeral x to a nine-digit decimal numeral x' and then back to a 24-bit binary numeral must produce x.

Converting to an eight-digit decimal numeral is not always sufficient to restore the original 24-digit binary numeral. We can see this in several ways:

  • In the interval [10, 11), there are 106 = 1,000,000 eight-digit decimal numerals (with form 10.dddddd, where each d is any decimal digit), but there are 220 = 1,048,576 24-digit binary numerals (with form 1010.bbbbbbbbbbbbbbbbbbbb2, where each b is any binary digit). Therefore, converting these decimal numerals to 24-digit binary numerals can produce only 1,000,000 results, so at least 48,576 of the binary numerals cannot be produced.
  • The spacing between eight-digit decimal numerals can reach one part in 10,000,000, which is not fine enough to distinguish 24-bit binary numerals with spacings of one part in 16,777,216.
  • 134,217,704 and 134,217,696 are both representable in the IEEE-754 binary32 format commonly used for float, and, when converted to an eight-significant-digit decimal numeral, they both round to 134,217,700, so converting back to binary can only produce one of those numbers, so one of them does not survive the round-trip.
4
Christian Stieber On

Well, even the website you mention shows what's going on -- but it appears you can't really understand it.

The important part is that floating point numbers are binary. In that website enter "0.5" into "decimal representation", and observe that the fields below are showing nice numbers -- no errors, binary value with lots of 0 at the end.

Now, enter "0.4" instead and observe how the other numbers are turning into a mess. Why is that?

0.5 is 1/2, or 2^-1 -- it' a power of 2, which is what binary values are all about: you need to able to express the number as a sum of such power-of-two numbers. 2^-2 would be 0.25, 2^-3 would be 0.125 -- whether or not you want to believe me, you can't get to exactly 0.4 with these numbers, hence the "mess" shown by the tool.

The important part to take home is that the fractional part of a number doesn't really work like you're used to in binary. While "1/3" in decimal produces a number with infinite decimal digits, binary representation has a lot more of these cases.

As a side note, this is why COBOL offers decimal datatypes -- it was meant for business calculations, and you can't have money calculations come up with slightly different results than expected... like, in our example, $0.40 should not turn into an ugly mess

Let's approach it from a different direction: enter your binary value into the tool, and observe that its translated to "0.11139256". Now, enter your other value, "0.111392558" into the tool, and observe that it produced the exact same binary value... you're already in a range where your float doesn't have enough bits to distinguish between the two. So all the numbers you are providing are leading into the same binary representation... so they are all identical.

2
stevea On

If you are converting to a decimal string like "0.111392557621002197265625" without an exponent, then 9 digits after the decimal point are certainly insufficient as the smallest positive 'normal' IEE754 single is "0.0000000000000000000000000000000000000117549" (0x00800000,1.17549435e-38), and Eric mentions an even smaller 'non-normal' limit in the comments, and these are distinguishable from zero. If instead you are using an exponent in the string representation, like "1.1754944e-38" then 9 digits should be sufficient to round correctly. This C test code (below) tests nearly all positive, non-special-case (+-Inf, Nan< Zero) IEEE754 singles.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

union fugly {
    float f;
    uint32_t u;
};


int main()
{
    union fugly before, after;
    char fstr[42];
    uint32_t i;

    for(i = 0x00800000; i < 0x7effffff; i++) {
    before.u = i;
    gcvt(before.f, 9, fstr);
    after.f = strtof(fstr, NULL);
    if( before.u != after.u  ) {
        printf("0x%08x: 0x%08x  ?=? 0x%08x\n",
           i, before.u, after.u);
        printf("          :       %.12f  ?=? %.12f\n",
           i, before.f, after.f);
    }
    }
    return 0;
}

Result: (no rounding issues !)

$ cc -O3 float.c -o  float && time ./float

real    7m57.205s
user    7m56.574s
sys 0m0.009s

Reducing the string precision (argument 2 of gcvt() ,from 9 to 8 produces many rounding errors in the low bit.

$ cc -O3 float.c -o  float && time ./float
0x03aa242d: 0x03aa242d  ?=? 0x03aa242e
          :       0.000000000000  ?=? 0.000000000000
0x03aa2437: 0x03aa2437  ?=? 0x03aa2438
          :       0.000000000000  ?=? 0.000000000000
0x03aa2441: 0x03aa2441  ?=? 0x03aa2440
          :       0.000000000000  ?=? 0.000000000000
0x03aa244a: 0x03aa244a  ?=? 0x03aa244b
          :       0.000000000000  ?=? 0.000000000000
0x03aa2454: 0x03aa2454  ?=? 0x03aa2455
          :       0.000000000000  ?=? 0.000000000000
0x03aa245e: 0x03aa245e  ?=? 0x03aa245d
          :       0.000000000000  ?=? 0.000000000000
...

Of course the routines and test I am doing differs from yours, but it's not too onerous to test all single floats.

Fwiw my laptop can test all ~2^31 -(special cases) representations in ~10 minutes. It might be worth testing your code for "corner cases".