This question came up when reviewing the WebAssembly SIMD proposal for extended multiplication.
To support older hardware, we need to support SSE2 and the only vector multiplication operation for 32 bit integers is pmuludq
. (Signed pmuldq
was only added in SSE4.1)
(non-widening is relatively easy; shuffle to feed 2x pmuludq
and take the low halves of the 4 results to emulate SSE4.1 pmulld
).
mulhs(a, b) = mulhu(a, b) - (a < 0 ? b : 0) - (b < 0 ? a : 0)
Using that, two signed double-width products can be computed like this,
That saves a couple of operations over the other proposal, but it's very close and now it includes a load which could be bad if this code is cold.