FMA is a fused multiply-add instruction. The fmaf (float x, float y, float z) function in glibc calls the vfmadd213ss instruction. I want to know how this instruction is implemented. According to my understanding:
- add the exponents of
xandy. - multiplied the mantissas of
xandy. - normalized the result of
x*y, but not rounded. - compare the exponent of
zand move the mantissa of the smaller exponent - add the mantissa and the result is normalized again
- rounded(
rn).
The current x86-64 architecture implements so-called FMA3 variants: Since a fused-multiply add operation requires three source operands, if the instruction implementing it has only three operands in total, one must specify which of the source operands is also the destination:
vfmadd123ss,vfmadd213ss,vfmadd231ss.In terms of mathmetical functionality these instructions are all equivalent, and compute a*b+c with a single rounding, where a, b, and c are IEEE-754
binary32(single-precision) operands, which in programming languages in the C and C++ families are typically mapped tofloat.The top-level algorithmic outline provided in the question is correct. The code below demonstrates how one can implement all necessary details under the restrictions that IEEE-754 floating-point exceptions are turned off (i.e. the code provides the masked response prescribed by the IEEE-754 standard), and that subnormal support is turned on (many platforms, including x86-64, also support non-standard "flush-to-zero" and "denormals-are-zero" modes). When there is more than one NaN source operand provided to FMA, any one of them, or a canonical NaN, forms the basis of the result. In the code below I simply matched the behavior of the Xeon W2133 CPU in my workstation; adjustments may be needed for other processors.
The code below is a garden-variety FMA emulation, coded for reasonable performance and reasonable clarity. If a platform offers a CLZ (count leading zeros) or related instruction, it would be helpful to interface that through intrinsics. Correct rounding is very much dependent on the proper tracking of round and sticky bits. In hardware, these are typically two actual bits, but for a software emulation it is often useful to use an entire unsigned integer (
rndstkin the code below), the most significant of which represents the round bit, and all remaining lower-order bits collective (ie. ORed together) represent the sticky bit.For practical (faster)
fmaf()emulation one typically relies on performing intermediate computation in IEEE-754binary64(double precision). This gives rise to tricky double-rounding issues, and not all implementations in common open-source libraries work correctly. The best way known from the literature is to use a special rounding mode, rounding to odd, in intermediate computation. See:Sylvie Boldo and Guillaume Melquiond, "Emulation of a FMA and correctly rounded sums: Proved algorithms using rounding to odd," IEEE Transactions on Computers, Vol. 57, No. 4, February 2008, pp. 462-471.
Rigorously testing an FMA implementation, whether in hardware or software, is a hard problem. Because the search space is huge, simply using massive amounts (say, tens of billions) of random test vectors is going to provide only a "smoke" test, useful for demonstrating that the implementation is not hopelessly broken. Below I am adding pattern-based tests that are capable of exercising a number of corner cases. Nonetheless the code below should be considered only lightly tested.
If you plan to use the FMA emulation in any kind of professional capacity, I highly recommend investing serious time into ensuring functional correctness; there are frankly too many broken FMA emulations out there already. For industrial-strength implementations, hardware vendors employ mechanically-checked mathematical proofs of correct operation. This book by a practitioner with extensive experience provides a good overview of how that works in practice:
David M. Russinoff, Formal Verification of Floating-Point Design: A Mathematical Approach, Springer 2019.
There are also professional test suites for exercising numerous corner cases, for example the FPgen floating-point test generator from the IBM Research Labs in Haifa. They used to make a collection of single-precision test vectors freely available on their website, but I cannot seem to find them anymore.