How to split a 128-bit xmm
register into two 64-bit quadwords?
I have a very large number in xmm1
and want to get the higher quadword to r9
and lower quadword to r10
, or RAX
and RDX
.
movlpd
or movhpd
only works with reg to mem or vice versa.
SSE2 (baseline for x86-64) has instructions for moving data directly between XMM and integer registers (without bouncing through memory). It's easy for the low element of a vector: MOVD or MOVQ. To extract higher elements, you can just shuffle the element you want down to the low element of a vector.
SSE4.1 also added insert/extract for sizes other than 16-bit (e.g. PEXTRQ). Other than code-size, it's not actually faster than a separate shuffle and movq on any existing CPUs, but it means you don't need any extra tmp registers.
PUNPCKHQDQ is the most efficient way to do this. It's fast even on old CPUs with slow shuffles for element-sizes smaller than 64-bit, like 65nm Core2 (Merom/Conroe). See my horizontal sum answer for more details about that. PUNPCKHQDQ doesn't have an immediate operand, and is only SSE2, so it's only 4 bytes of code-size.
To preserve the original value of xmm0, use
pshufd
with a different destination. Or to swap high and low halves in-place, or whatever.There's no point in ever using them. Use movlps / movhps instead, because they're shorter and no CPUs care about float vs. double.
You can use
movhlps xmm1, xmm0
to extract the high half of xmm0 into another register, but mixing FP shuffles with integer-vector operations will cause bypass delays on some CPUs (specifically Intel Nehalem). Also beware of the dependency on xmm1 causing a latency bottleneck.Definitely prefer
pshufd
for this in general. But you could usemovhlps
if you're tuning for a specific CPU like Core2 wheremovhlps
is fast and runs in the integer domain, andpshufd
is slow.