I'm currently working in C/C++, and I have a uint64_t
. I need to do a bitwise rotation on the top 32 bits and the bottom 32 bits separately. So for example, if my input is
| | | |
0000 0000 0000 0000 0000 0000 0000 1101 0000 0000 0000 0000 0000 0000 0000 0111
and I need to rotate 2 bits to the right, the proper output is
| | | |
0100 0000 0000 0000 0000 0000 0000 0011 1100 0000 0000 0000 0000 0000 0000 0001
The obvious method is to make a temporary 32-bit number and do the rotation operations on that separately, but is there a different, efficient way of doing this?
The canonical way to do a rotate when your language only offers shift instructions is by combining the results of two shifts. For example, to perform a rotation by 2 to the right, you can use:
Many compilers will recognize this idiom as a rotate and will issue an actual rotate machine instruction (
ror
on x86) if the underlying platform supports it.You can extend the idea in a straightforward way to do your two-32-bit-rotates-within-a-64-bit-word SWAR operation, using masking to avoid contamination between the two 32-bit halves1.
Of course the compiler isn't going to be able to recognize this and use a rotate since CPUs don't offer an instruction that does this, so this produces the following reasonable assembly:
I've optimized this for latency, and with perfect scheduling it could take as little as 3 cycles on modern x86, it has 4 different critical paths: two each of
shift -> and -> or
andmovabs -> and -> or
. In a loop, the constant loads could be hoisted, but the latency is still 3 (since the other critical paths remain). Total uop (excludingret
) count is 8, and the throughput on modern x86 could be as good as 2 cycles/iteration because all the instructions can issue across many execution units.The results don't really depend on compiler - I checked all of
icc
,gcc
andclang
and they all generate essentially identical code. This approach generalizes well to similar operations on other subword sizes (e.g., shifting all 16-bit words in a 64-bit value). It doesn't work as well if you want to use different shift amounts for each sub-word though (but based on your example, it doesn't seem like you do).Let's compare it to the union-based approach suggested by maxihatop. I modified that code slightly to rotate right rater than left, and to fix the rotate amount at 2:
How does it look when compiled to assembly on x86? Well now the results are really compiler dependent.
Sticking with gcc we get (comments mine):
GCC has recognized the
ROR
operation as a rotate and has issued tworor
instructions to rotate each half. Unfortunately, it also takes ten additional instructions just to isolate each half of the union in preparation for theror
and to move the results back into the right position.Furthermore, it has unnecessarily made the bottom and top rotates dependent on each other2. Overall this results in an 8 cycle dependency chain, by my count. So much slower latency-wise than the above solution. I count 12 uops total, in a loop this may execute, at best, at 3 cycles/iteration.
clang 3.9 is a bit more intelligent. Here's what it produces:
Like gcc, it's using
rot
for the lower DWORD, but it uses a mix of shifts for the upper DWORD, and is smarter about combining the results and keeping the computations independent. It's still doing some dumb stuff (what's up with the slowlea
vs a simple-and-fastor
?). The critical path (for the upper DWORD) is 5 cycles, and I counted 9 uops.On the other hand, icc 17 produces terrible code pretty bad code too:
For some reason it is using two
shld reg, reg, i
instructions, with both regs the same, which is really just arol
. Not sure why - theshrd
instructions have mostly always been slower or sometimes tied withror
. On Haswell and Skylake they have a latency of 3 and can issue on one port, whileror
has a latency of 1 can can issue on two ports. There was a brief time around Sandy Bridge whereshrd
was potentially better - it could issue with latency 1 on two ports, verus one port forror
. So maybe that's it. Let's try with -mtune=haswell:Yup, that was it. So the Intel code isn't too bad - with a critical path of 6 with my count, and 10 uops.
My best effort using
rot
by hand is as follows:It's straightforward - use 2
ror
instructions to rotate the top and bottom DWORDs, plus two shifts to isolate the top DWORD ineax
and move it back, and anor
to combine them. The latency is actually worse than the shift+mask solution at 4 cycles, but it has only 7 uops, 1 fewer than shift and mask.You could also try to combine the approaches, e.g., using shift+mask for the top DWORD, and
rot
for the bottom one, but I didn't come with anything better than the above, mostly because using doing the shift+mask approach for the top DWORD isn't much faster than doing it for the whole thing.In summary, assuming you aren't actually going to write assembly, the original C shift+mask approach I showed above has the shortest latency and the smallest uop count (outside of the hand-rolled assembly) and should do well across various compilers, even without
ror
detection. It doesn't depend on the quality of a compiler's support for optimizing union access, which as we can see above, varies wildly.Much of the assembly-level analysis was x86-centric, but most of it would apply to other platforms as well, with minor differences depending on the speed of loading large constants, and the ability to access 32-bit sub-registers of a 64-bit register and so on.
1 Here and everywhere in this question I'm making the implicit assumption that the amount to rotate both halves is the same. That's consistent with the OP's example. If the amounts can be different some of the solutions change.
2 In particular the
or rdi, rax
which inserts the bottom DWORD result makes the remainder of the function, which processes the high DWORD, depend on the first half. Theor
is in fact pretty pointless since there is already a finalor rax, rdx
to combine the results. It would have been easy to keep the results independent and then combine them at the end - many of the masking and combining operations that gcc emits are essentially redundant.