Given two IEEE-754 double-precision floating-point numbers a and b, I want to get the exact quotient a/b rounded to an integer towards zero.
A C99 program to do that could look like this:
#include <fenv.h>
#include <math.h>
#pragma STDC FENV_ACCESS on
double trunc_div(double a, double b) {
int old_mode = fegetround();
fesetround(FE_TOWARDZERO);
double result = a/b; // rounding occurs here
fesetround(old_mode);
return trunc(result);
}
#include <stdio.h>
int main() {
// should print "6004799503160662" because 18014398509481988 / 3 = 6004799503160662.666...
printf("%.17g", trunc_div(18014398509481988.0, 3.0));
}
Now suppose I only have access to the nearest-ties-to-even rounding mode: I could be using GCC with optimizations, compiling for a microcontroller, or having to make it work in JavaScript.
What I've tried is to compute a/b with the provided rounding, truncate, and compensate if the magnitude of the result is too large:
double trunc_div(double a, double b) {
double result = trunc(a/b);
double prod = result * b;
if (a > 0) {
if (prod > a || (prod == a && mul_error(result, b) > 0)) {
result = trunc(nextafter(result, 0.0));
}
}
else {
if (prod < a || (prod == a && mul_error(result, b) < 0)) {
result = trunc(nextafter(result, 0.0));
}
}
return result;
}
The helper function mul_error computes the exact multiplication error (using Veltkamp-Dekker splitting):
// Return the 26 most significant bits of a.
// Assume fabs(a) < 1e300 so that the multiplication doesn't overflow.
double highbits(double a) {
double p = 0x8000001L * a;
double q = a - p;
return p + q;
}
// Compute the exact error of a * b.
double mul_error(double a, double b) {
if (!isfinite(a*b)) return -a*b;
int a_exp, b_exp;
a = frexp(a, &a_exp);
b = frexp(b, &b_exp);
double ah = highbits(a), al = a - ah;
double bh = highbits(b), bl = b - bh;
double p = a*b;
double e = ah*bh - p; // The following multiplications are exact.
e += ah*bl;
e += al*bh;
e += al*bl;
return ldexp(e, a_exp + b_exp);
}
Can the compensation fail for some inputs (for example, due to overflow or underflow)?
Is there a faster way?
Edit: Changed the first line of mul_error from … return a*b to … return -a*b;. This fixes the cases where a = ±∞; finite inputs were OK.
Thanks to Eric Postpischil for catching the error.
Edit: If a, b are finite and non-zero and the division a/b overflows, I'd like to match IEEE-754 division in round-to-zero mode, which returns the maximum finite double-precision number ±(2¹⁰²⁴ − 2⁹⁷¹).
Edit: The functions frexp and ldexp can be called only when necessary.
That's a 30% speedup on doubles a, b with uniformly random bits.
double mul_error(double a, double b) {
if (!isfinite(a*b)) return -a*b;
double A = fabs(a), B = fabs(b);
// bounds from http://proval.lri.fr/gallery/Dekker.en.html
if (A>0x1p995 || B>0x1p995 || (A*B!=0 && (A*B<0x1p-969 || A*B>0x1p1021))) {
// ... can overflow/underflow: use frexp, ldexp
} else {
// ... no need for frexp, ldexp
}
}
Maybe ldexp is always unnecessary because we only need to know how mul_error compares to 0.
Edit: Here's how to do it if you have 128-bit integers available. (It's slower than the original version.)
double trunc_div(double a, double b) {
typedef uint64_t u64;
typedef unsigned __int128 u128;
if (!isfinite(a) || !isfinite(b) || a==0 || b==0) return a/b;
int sign = signbit(a)==signbit(b) ? +1 : -1;
int ea; u64 ua = frexp(fabs(a), &ea) * 0x20000000000000;
int eb; u64 ub = frexp(fabs(b), &eb) * 0x20000000000000;
int scale = ea-53 - eb;
u64 r = ((u128)ua << 53) / ub; // integer division truncates
if (r & 0xFFE0000000000000) { r >>= 1; scale++; } // normalize
// Scale<0 means that we have fractional bits. Shift them out.
double d = scale<-63 ? 0 : scale<0 ? r>>-scale : ldexp(r, scale);
// Return the maximum finite double on overflow.
return sign * (isfinite(d) ? d : 0x1.fffffffffffffp1023);
}
Consider the exact remainder
r=frem(a,b).We know that
a = b*n + rfor some integer n, with r between -b/2 and b/2.And
a/b = n + r/bwithr/bbetween -1/2 and 1/2 (/ is exact division here).We can imagine 2 cases when
float(a/b)would round to an upper integer part:float(n+r/b)=nnitself is too big to be represented as floating pointAn example of 1st case is
In this case,
n=1416003655831andfloat(a/b)rounds up to n, the residue-r/bbeing smaller thanulp(n).Note that testing for
a > 0 && fma(result,b,-a) > 0is OK, but adjusting withnextafter(result,0.0)is not in this case, it would lead to an non integer result1416003655830.999755859375. We should rather takeresult-1whentrunc(a/b) < 2^53.For example of 2nd case take:
We have n being 2^54+2, the exact mid point between a and nextafter(2,2*a)
With positive remainder r,
trunc(float(a/b))will be rounded up to a+4.And the discussion about sign of
rshown in 1st case does not work here, so it can't be generalized...Note that second case can always reduce to first case by appropriate scaling:
But this has no practical interest, first case still require adjusting the result for case of round-up.
So, the adjustment can fail to answer an integer as we saw in 1st example, and this answer doesn't show a faster way, there's probably not much to gain.