How to split an XMM 128-bit register into two 64-bit integer registers?

2.8k Views Asked by At

How to split a 128-bit xmm register into two 64-bit quadwords?

I have a very large number in xmm1 and want to get the higher quadword to r9 and lower quadword to r10, or RAX and RDX.

movlpd or movhpd only works with reg to mem or vice versa.

1

There are 1 best solutions below

0
On

SSE2 (baseline for x86-64) has instructions for moving data directly between XMM and integer registers (without bouncing through memory). It's easy for the low element of a vector: MOVD or MOVQ. To extract higher elements, you can just shuffle the element you want down to the low element of a vector.

SSE4.1 also added insert/extract for sizes other than 16-bit (e.g. PEXTRQ). Other than code-size, it's not actually faster than a separate shuffle and movq on any existing CPUs, but it means you don't need any extra tmp registers.

#SSE4.1
movq    rax, xmm0       # low qword
pextrq  rdx,  xmm0, 1   # high qword
# 128b result in rdx:rax, ready for use with div r64 for example.
# (But watch out for #DE on overflow)
# also ready for returning as a __int128_t in the SystemV x86-64 ABI

#SSE2
movq       r10, xmm0
punpckhqdq xmm0, xmm0    # broadcast the high half of xmm0 to both halves
movq       r9,  xmm0

PUNPCKHQDQ is the most efficient way to do this. It's fast even on old CPUs with slow shuffles for element-sizes smaller than 64-bit, like 65nm Core2 (Merom/Conroe). See my horizontal sum answer for more details about that. PUNPCKHQDQ doesn't have an immediate operand, and is only SSE2, so it's only 4 bytes of code-size.

To preserve the original value of xmm0, use pshufd with a different destination. Or to swap high and low halves in-place, or whatever.


movlpd or movhpd ...

There's no point in ever using them. Use movlps / movhps instead, because they're shorter and no CPUs care about float vs. double.

You can use movhlps xmm1, xmm0 to extract the high half of xmm0 into another register, but mixing FP shuffles with integer-vector operations will cause bypass delays on some CPUs (specifically Intel Nehalem). Also beware of the dependency on xmm1 causing a latency bottleneck.

Definitely prefer pshufd for this in general. But you could use movhlps if you're tuning for a specific CPU like Core2 where movhlps is fast and runs in the integer domain, and pshufd is slow.