Java and floating point arithmetic

513 Views Asked by At

Having the code

public static final float epsilon = 0.00000001f;

public static final float a [] = {
        -180.0f,
        -180.0f + epsilon * 2,
        -epsilon * 2
}

The a is initialized as follows:

[-180.0, -180.0, -2.0E-8]

Instead of desired

[-180.0, X, Y]

How to tune epsilon to achieve the desired result? --


1) I want float rather than double to be coherent with the previously written code
2) I do not want -179.99999998 or any other particular number for X, I want X > -180.0 but X as much as possible close to -180.0
3) I want Y to be as much as possible close to 0, but to be it float
4) I want -180.0 < X < Y

In my initial post I have not specified precisely what I want. Patricia Shanahan guessed that by suggesting Math.ulp

3

There are 3 best solutions below

2
On BEST ANSWER

As recommended in prior answers, the best solution is to use double. However, if you want to work with float, you need to take into account its available precision in the region of interest. This program replaces your literal epsilon with the value associated with the least significant bit of 180f:

import java.util.Arrays;

public class Test {
  public static final float epsilon = Math.ulp(-180f);

  public static final float a [] = {
          -180.0f,
          -180.0f + epsilon * 2,
          -epsilon * 2
  };

  public static void main(String[] args) {
    System.out.println(Arrays.toString(a));
  }

}

Output:

[-180.0, -179.99997, -3.0517578E-5]
1
On

Try to "double" key. If it is not enough for you, try "long double".

13
On

Although the value 0.00000001f is within the float's precision capacity, the value -180f + 0.00000001f * 2 (-179.99999998) is not. float has only about 7-8 significant digits of precision, and -179.99999998 requires at least 11. So the least significant bits of it get dropped by the addition operation, and the imprecise value ends up being -180.0f.

Just for the fun of it, here are those values in bits (n = -180.0f):

           sign
           | exponent       significand
           - -------- -----------------------
epsilon  = 0 01100100 01010111100110001110111
epsilon2 = 0 01100101 01010111100110001110111
n        = 1 10000110 01101000000000000000000
result   = 1 10000110 01101000000000000000000

The result ends up being bit-for-bit the same as the original -180.0f.

If you use double, that problem goes away, because you aren't exceeding double's ~15 digits of precision.