Floating Precision problem on ARM FPU (Neon-vfvp3) on IMX.6 Sabre lite

853 Views Asked by At

Hi I am running a simple program to test out floating point unit on Imx6.sabre_lite

    double z = 2.2250738585072014e-308;
    double x = 3.0594765554474019e-308;
    double ans = x-z;

Now ans comes out to be zero, while on x86 architecture it is non-zero (8.344026969402e-309#DEN), my guess is there is some precision problem i.e. ARM FPU unit on cortex-A9 (IMX.6 Sabre) doesn't support such calculations, but for some reason I am unable to verify it. I am using the the following build flags for compilation.

-mfloat-abi=hard -mfpu=neon-vfpv3 

I have searched other answers and they all seem to point out that Neon only support single precision floating point on Aarch32, but I read in the following document for cortex-A9 https://developer.arm.com/documentation/ddi0409/i/preface/about-this-book Although SIMD is only for Single precision floating point, VFPv3 supports double precision floating point, so kind of confused what's the issue here. The generated assembly code is as follows

21          double z = 2.2250738585072014e-308;
1009baa8:   mov     r2, #0
1009baac:   mov     r3, #1048576    ; 0x100000
1009bab0:   strd    r2, [r11, #-12]
22          double x = 3.0594765554474019e-308;
1009bab4:   mov     r2, #0
1009bab8:   mov     r3, #1441792    ; 0x160000
1009babc:   strd    r2, [r11, #-20] ; 0xffffffec
23          double ans = x-z;
1009bac0:   vldr    d17, [r11, #-20]        ; 0xffffffec
1009bac4:   vldr    d16, [r11, #-12]
1009bac8:   vsub.f64        d16, d17, d16
1009bacc:   vstr    d16, [r11, #-28]        ; 0xffffffe4

This the instruction that does the subtraction vsub.f64 d16, d17, d16, is vsub.f64 a VFP instruction.

2

There are 2 best solutions below

2
On

The VFP on ARM is fully IEEE-754 compliant, and thus I doubt it delivers wrong results when it comes to sub-normal numbers.

My guess is you put the wrong parameters to printf.

The easiest way to find out is to check the register or the memory that contains ans.


Edit:

I ran the following test function on my Nexus-S (Cortex-A8)

    double dsub(double a, double b)
    {
        return a-b;
    }

    ans = dsub(3.0594765554474019e-308, 2.2250738585072014e-308);


ans: 8.3440269694020052E-309

Cortex-A8 is the very first of the Cortex series with the worst VFP (VFP Lite)

I think you are doing something wrong when checking the result. (The machine code is fine)

2
On

EDIT : (This is not final and complete answer, currently under investigation.)

I finally found the answer from GCC documentation of compiler flag, neon doesn't fully implement IEEE 754 standard, and I guess that is the reason for loss of precision.

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.

Source : https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html (see -mfpu documentation)

Since the result in my calculation was #8.344026969402e-309 which is a DEN (Denormal number) and it is treated as zero by neon unlike IEEE 754 compliant FPU units.