I was looking at implementing the following computation, where divisor
is nonzero and not a power of two
unsigned multiplier(unsigned divisor)
{
unsigned shift = 31 - clz(divisor);
uint64_t t = 1ull << (32 + shift);
return t / div;
}
in a manner that is efficient for processors that lack 64-bit integer and floating-point instructions, but may have 32-bit fused multiply-add (such as GPUs, which also will lack division).
This calculation is useful for finding "magic multipliers" involved in optimizing division, when the divisor is known ahead of time, to a multiply-high instruction followed by a bitwise shift. Unlike code used in compilers and reference code in libdivide, it finds largest such multiplier.
One additional twist is that in the application I was looking at, I anticipated that divisor
will almost always be representable in float
type. Therefore, it would make sense to have an efficient "fast path" that will handle those divisors, and a size-optimized "slow path" that would handle the rest.
The solution I came up with performs long division with remainder that is specialized for this particular scenario (dividend is a power of two) in 6 or 8 FMA operations on the "fast path", and then performs a binary search with 8 iterations on the "slow path".
The following program performs exhaustive testing of the proposed solution (needs about 1-2 minutes on an FMA-capable CPU).