A simple floating-point addition x+y in with precision 4 (i.e. IEEE mantissa width 3), with 3 bits for exponent (emax=3
, emin=-4
) for x = mpfr('0.75')
, y = mpfr('0.03125')
incorrectly gives mpfr('0.75')
as result when it should be mpfr('0.8125')
. Note that 0.3125
is a subnormal number for this reduced precision format.
Edit: Terminal interaction extracted from link and included for future reference.
>>> "{0:.10Df}".format(mpfr('0.75')+mpfr('0.03125'))
'0.7500000000'
>>> get_context()
context(precision=4, real_prec=Default, imag_prec=Default,
round=RoundToNearest, real_round=Default, imag_round=Default,
emax=3, emin=-4,
subnormalize=True,
trap_underflow=False, underflow=False,
trap_overflow=False, overflow=False,
trap_inexact=False, inexact=True,
trap_invalid=False, invalid=False,
trap_erange=False, erange=False,
trap_divzero=False, divzero=False,
trap_expbound=False,
allow_complex=False)
>>>
Disclaimer: I maintain gmpy2.
I believe it is a bug with creating subnormals from a string. I think it is fixed in the development code but I won't be able to test until later. I'll update this answer later.
Update
The problem is not related to creating a subnormal from a string. In this case, the subnormal value is created properly. In gmpy2 2.0.x, there is a rare bug when converted a string to a subnormal. The simplest work-around is to convert the input to an
mpq
type first; i.e.mpfr(mpq('0.03125'))
.The actual problem is the default rounding mode. The intermediate sum is exactly halfway between two 4 bit values. The default rounding mode of
RoundToNearest
selects the rounded value with final bit of 0. If you change the rounding mode toRoundUp
, you get the expected result.One last comment: the values of
precision
,emax
andemin
are slight different between the IEEE standards and the MPFR library. Ife
is the exponent size andp
is the precision (in IEEE terms), thenprecision
should bep+1
,emax
should be2**(e-1)
andemin
should be4-emax-precision
. This doesn't impact your question since it only changesemax
.