Subtracting two Ieee754 numbers and I am not getting the correct result

54 Views Asked by At

I am currently writing a Compiler to just test my programing ability and my target architecture doesn't have a Floating Point Unit. To account for this I am adding functions to my standard library to handle floating point calculations with Bitwise operations. All the floating point calculations are 32 bits. With 1 sign bit, 8 bits for the exponent, and 23 bits for the Mantissa. Detailed below are all the steps utilized for the subtraction operation.

As a note currently the code I have in the Standard Library works for numbers of the same sign and even works for some circumstances of numbers with different signs.

The test case I have that is failing is associated with the following subtraction operation:

50.0 - 92.0 = -42.0

As stated above the operation I am trying to solve is 50.0 - 92.0 which should equal -42.0.

Step one should be to convert both the numbers into binary:

The Converted numbers are as followed...

        Sign  Exp      Mantissa                Binary Scientific Notation
  50.0 = 0|10000100|10010000000000000000000 = 1.10010000000000000000000x2^5
  92.0 = 0|10000101|01110000000000000000000 = 1.01110000000000000000000x2^6

Step two is to raise the exponent of the 50.0 so that the exponent 5 becomes a 6. Therefor we will need to shift the bits 1 place to the right to account for the increase in the exponent.

1.10010000000000000000000x2^5 becomes 0.11001000000000000000000x2^6

Step three is to get the twos compliment of the 2nd value because we are subtracting 92.0 not adding.

1.01110000000000000000000x2^6 inverted is 0.10001111111111111111111x2^6
0.10001111111111111111111x2^6 + 1 is 0.10010000000000000000000x2^6

The final step is to add the Mantissas together

  0.11001000000000000000000x2^6
+ 0.10010000000000000000000x2^6
_______________________________
  1.01011000000000000000000x2^6

Now this final bit is where I get a bit confused because the final result of -42 in ieee754 format is

       Sign Exp          Mantissa
-42.0 = 1|10000100|01010000000000000000000

And obviously the Mantissa

01010000000000000000000 is not
01011000000000000000000

Does anyone have some insight as to what I am doing wrong. Thanks

1

There are 1 best solutions below

0
Eric Postpischil On

You have not used enough bits to handle two’s complement correctly, and you have not handled a negative result.

In complementing the positive 1.011100000000000000000002×26, you got 0.100100000000000000000002×26. The result should be a negative number, but a leading 0 in two’s complement indicates a positive number. In other words, your complement operation overflowed the format.

If you prefix a leading 0 and then complement, you will have 10.100100000000000000000002×26, and you will add this to 00.110010000000000000000002×26, which has also had a 0 prefixed. Then the sum is 11.010110000000000000000002×26.

The leading bit is 1, indicating the result is negative. So you can complement it again to see the absolute value, 00.101010000000000000000002×26, meaning the result is −00.101010000000000000000002×26.

Finally, you normalize this to −1.010100000000000000000002×25, which is −42.

Notes

This explanation is not an endorsement of using two’s complement. Implementing a direct subtractor may be preferred.

“Significand” is the preferred term for the fraction part of a floating-point number. “Mantissa” is an old term for the fraction part of a logarithm. Significands are linear (if the number increases by a factor of 1.2, the significand increases by a factor of 1.2, unless an exponent threshold is crossed), whereas mantissas are logarithmic (adding to the mantissa multiplies the value represented).