Consider that you want to calculate the low 128-bits of the result of multiplying a 64-bit and 128-bit unsigned number, and that the largest multiplication you have available is the C-like 64-bit multiplication which takes two 64-bit unsigned inputs and returns the low 64-bits of the result.
How many multiplications are needed?
Certainly you can do it with eight: break all the inputs up into 32-bit chunks and use your 64-bit multiplication to do the 4 * 2 = 8 required full-width 32*32->64 multiplications, but can one do better?
Of course the algorithm should do only a "reasonable" number of additions or other basic arithmetic on top of the multiplications (I'm not interested in solutions that re-invent multiplication as an addition loop and hence claim "zero" multiplications).
Four, but it starts to get a little tricky.
Let a and b be the numbers to be multiplied, with a0 and a1 being the low and high 32 bits of a, respectively, and b0, b1, b2, b3 being 32-bit groups of b, from low to high respectively.
The desired result is the remainder of (a0 + a1•232) • (b0 + b1•232 + b2•264 + b3•296) modulo 2128.
We can rewrite that as (a0 + a1•232) • (b0 + b1•232) + (a0 + a1•232) • (b2•264 + b3•296) modulo 2128.
The remainder of the latter term modulo 2128 can be computed as a single 64-bit by 64-bit multiplication (whose result is implicitly multiplied by 264).
Then the former term can be computed with three multiplications using a carefully implemented Karatsuba step. The simple version would involve a 33-bit by 33-bit to 66-bit product which is not available, but there is a trickier version that avoids it:
The last line contains only one multiplication; the other two pseudo-multiplications are just conditional negations. Absolute-difference and conditional-negate are annoying to implement in pure C, but it could be done.