Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

2.4k Views Asked by At

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed?

4

There are 4 best solutions below

14
On BEST ANSWER

No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (e.g. between 512 and 1536 bits on x86_64). For fixed sizes, the threshold is at the smaller end of that range, and the new ADCX/ADOX instructions likely bring it in somewhat further for scalar code, but 64x64 is still too small to benefit from Karatsuba.

5
On

It's hard to tell without trying, but it might me faster to just use the AMD64 MUL instruction, which supports 64x64=128 with the same throughput as most AVX2 instructions (but not vectorized). The drawback is that you need to load to regular registers if the operands were in YMM registers. That would give something like LOAD + MUL + STORE for a single 64x64=128.

If you can vectorize Karatsuba in AVX2, try both AVX2 and MUL and see which is faster. If you can't vectorize, single MUL will probably be faster. If you can remove the load and store to regular registers, single MUL will be definitely faster.

Both MUL and AVX2 instructions can have an operand in memory with the same throughput, and it may help to remove one load for MUL.

19
On

It's highly unlikely that AVX2 will beat the mulx instruction which does 64bx64b to 128b in one instruction. There is one exception I'm aware of large multiplications using floating point FFT.

However, if you don't need exactly 64bx64b to 128b you could consider 53bx53b to 106b using double-double arithmetic.

To multiply four 53-bit numbers a and b to get four 106-bit number only needs two instructions:

__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);

This gives four 106-bit numbers in two instructions compared to one 128-bit number in one instruction using mulx.

7
On

Do not listen to the naysayers. It MIGHT be possible under a number of assumptions:

  1. You have a large number of products to compute. Enough to saturate the SIMD pipeline. This is necessary to hide the latency of the multiplications.
  2. You are willing to write hand optimized SIMD intrinsics. Obviously
  3. You care about aggregate throughtput, not latency.

Karatsuba substitutes 1 multiplication for 3 additions. Since the SIMD unit typically has more ports that can do addition/substraction than multiplication, and since vector additions usually have higher throughtput, reducing products by 25% can free enough resources to deal with the much less costly additions.

Now this of course will be highly dependent on the specifics of the CPU's SIMD pipeline, a good place to check is Agner Fog's instruction tables.

Also bear in mind that adding two 32-bit integers for the middle product generates 33-bit integers, which aren't strictly speaking reprsentable. There are ways around this issue, but they come at a cost. However, if you only need the high bits, you can do a slight modification to the algorithm by computing the product of the difference instead of the low bits, which has the same parity as the sum and therefore will cancel out the discarded bit.