I am writing a function library to provide all conventional operators and functions for signed-integer types s0128
, s0256
, s0512
, s1024
and floating-point types f0128
, f0256
, f0512
, f1024
.
I am writing the s0128
, s0256
, s0512
, s1024
multiply routines now, but am getting erroneous results that confuse me. I assumed I could cascade multiplies with the 64-bit imul rcx
instruction (that produces a 128-bit result in rdx:rax
) in the same way I could do the same with unsigned operands with the mul rcx
instruction... but the answers with imul
are wrong.
I suspect there is some trick to make this work, maybe mix imul
and mul
instructions - or something. Or is there some reason one cannot implement larger multiplies with signed multiply instructions?
So you understand the technique, I'll describe the smallest version, for s0128
operands.
arg2.1 arg2.0 : two 64-bit parts of s0128 operand
arg1.1 arg1.0 : two 64-bit parts of s0128 operand
---------------
0 out.edx out.eax : output of arg1.0 * arg2.0
out.edx out.eax : output of arg1.0 * arg2.1
-------------------------
out.2 out.1 out.0 : sum the above intermediate results
out.edx out.eax : output of arg1.1 * arg2.0
-------------------------
out.2 out.1 out.0 : sum the above intermediate results
Each time the code multiplies two 64-bit values, it generates a 128-bit result in edx:eax
. Each time the code generates a 128-bit result, it sums that result into an accumulating triple of 64-bit registers with addq
, adcq
, adcq
instructions (where the final adcq
instruction only adds zero to assure any carry flags gets propagated).
When I multiply small negative numbers by small positive numbers as a test, the result is negative, but there are one or two non-zero bits at the bottom of the upper 64-bit value in the 128-bit s0128
result. This implies to me that something isn't quite right with propagation in multiprecision signed multiplies.
Of course the cascade is quite a bit more extensive for s0256
, s0512
, s1024
.
What am I missing? Must I convert both operands to unsigned, perform unsigned multiply, then negate the result if one (but not both) of the operands was negative? Or can I compute multiprecision results with the imul
signed multiply instruction?
When you build an extended precision signed multiply out of smaller multiplies, you end up with a mixture of signed and unsigned arithmetic.
In particular, if you break a signed value in half, you treat the upper half as signed, and the lower half as unsigned. The same is true for extended precision addition, in fact.
Consider this arbitrary example, where
AH
andAL
represent the high and low halves ofA
, andBH
andBL
represent the high and low halves ofB
. (Note: these aren't meant to represent x86 register halves, just halves of a multiplicand.) TheL
terms are unsigned and theH
terms are signed.The
AL * BL
product is unsigned because both AL and BL are unsigned. Therefore, it gets zero extended when you promote it to the full precision of the result.The
AL * BH
andAH * BL
products mix signed and unsigned values. The resulting product is signed, and that needs to be sign extended when you promote it to the full precision of the result.The following C code demonstrates a 32×32 multiply implemented in terms of 16×16 multiplies. The same principle applies when building 128×128 multiplies out of 64×64 multiplies.
This pattern extends even if you break the multiplicands into more than two pieces. That is, only the most-significant piece of a signed number gets treated as signed. All of the other pieces are unsigned. Consider this example, which divides each multiplicand into 3 pieces:
Because of all the mixed-signedness and sign extension fun, it's often just easier to implement a signed × signed multiply as an unsigned × unsigned multiply, and conditionally negate at the end if the signs if the multiplicands differ. (And, in fact, when you get to the extended precision float, as long as you stay in sign-magnitude form like IEEE-754, you won't have to deal with signed multiply.)
This assembly gem shows how to negate extended precision values efficiently. (The gems page is a little dated, but you may find it interesting / useful.)