Is there an efficient way of doing 32-bit bitwise rotation separately on the high and low 32-bit parts of a 64-bit number?

Question

Is there an efficient way of doing 32-bit bitwise rotation separately on the high and low 32-bit parts of a 64-bit number?

535 Views Asked by MNagy At 20 October 2024 at 20:07

I'm currently working in C/C++, and I have a uint64_t. I need to do a bitwise rotation on the top 32 bits and the bottom 32 bits separately. So for example, if my input is

|                                     | |                                     |
0000 0000 0000 0000 0000 0000 0000 1101 0000 0000 0000 0000 0000 0000 0000 0111

and I need to rotate 2 bits to the right, the proper output is

|                                     | |                                     |
0100 0000 0000 0000 0000 0000 0000 0011 1100 0000 0000 0000 0000 0000 0000 0001

The obvious method is to make a temporary 32-bit number and do the rotation operations on that separately, but is there a different, efficient way of doing this?

Original Q&A

There are 2 best solutions below

**BeeOnRope** · Answer 1

The canonical way to do a rotate when your language only offers shift instructions is by combining the results of two shifts. For example, to perform a rotation by 2 to the right, you can use:

uint32_t y = (x >> 2) | (x << 30);

Many compilers will recognize this idiom as a rotate and will issue an actual rotate machine instruction (ror on x86) if the underlying platform supports it.

You can extend the idea in a straightforward way to do your two-32-bit-rotates-within-a-64-bit-word SWAR operation, using masking to avoid contamination between the two 32-bit halves¹.

#include <inttypes.h>

const uint64_t leftmask  = 0xC0000000C0000000;
const uint64_t rightmask = ~leftmask;


uint64_t rotate2x_32_right_2(uint64_t x) {
    uint64_t rightPart = (x >>  2) & rightmask;
    uint64_t leftPart  = (x << 30) &  leftmask;
    return rightPart | leftPart;
}

Of course the compiler isn't going to be able to recognize this and use a rotate since CPUs don't offer an instruction that does this, so this produces the following reasonable assembly:

rotate2x_32_right_2(unsigned long):
        mov     rax, rdi
        movabs  rdx, 4611686015206162431
        sal     rdi, 30
        shr     rax, 2
        and     rax, rdx
        movabs  rdx, -4611686015206162432
        and     rdi, rdx
        or      rax, rdi
        ret

I've optimized this for latency, and with perfect scheduling it could take as little as 3 cycles on modern x86, it has 4 different critical paths: two each of shift -> and -> or and movabs -> and -> or. In a loop, the constant loads could be hoisted, but the latency is still 3 (since the other critical paths remain). Total uop (excluding ret) count is 8, and the throughput on modern x86 could be as good as 2 cycles/iteration because all the instructions can issue across many execution units.

The results don't really depend on compiler - I checked all of icc, gcc and clang and they all generate essentially identical code. This approach generalizes well to similar operations on other subword sizes (e.g., shifting all 16-bit words in a 64-bit value). It doesn't work as well if you want to use different shift amounts for each sub-word though (but based on your example, it doesn't seem like you do).

Let's compare it to the union-based approach suggested by maxihatop. I modified that code slightly to rotate right rater than left, and to fix the rotate amount at 2:

#include <inttypes.h>

union U {
  uint64_t u64;
  uint32_t u32[2];
};

#define ROR(x, n) x = (x >> n) | (x << (32 - n))

uint64_t rotate2x_32_right_2(uint64_t input_val) {
  U val;
  val.u64 = input_val; // Assign 64_bit value to union val
  ROR(val.u32[0], 2); // Rotate left by 3 low-part of long int
  ROR(val.u32[1], 2); // Rotate left by 1 high-part of long int
  return val.u64;
}

How does it look when compiled to assembly on x86? Well now the results are really compiler dependent.

Sticking with gcc we get (comments mine):

rotate2x_32_right_2(unsigned long):
        mov     eax, edi
        movabs  rdx, -4294967296
        ror     eax, 2           ; do the rotate on the bottom half
        and     rdi, rdx         ; mask away the bottom DWORD
        or      rdi, rax         ; insert the bottom DWORD result
        mov     rax, rdi         
        shr     rax, 32          ; move to top DWORD into the bottom
        ror     eax, 2           ; do the rotate
        sal     rax, 32          ; move it back to the top dword
        mov     rdx, rax         ; the next 3 ins awkwardly combine the results
        mov     eax, edi
        or      rax, rdx
        ret

GCC has recognized the ROR operation as a rotate and has issued two ror instructions to rotate each half. Unfortunately, it also takes ten additional instructions just to isolate each half of the union in preparation for the ror and to move the results back into the right position.

Furthermore, it has unnecessarily made the bottom and top rotates dependent on each other². Overall this results in an 8 cycle dependency chain, by my count. So much slower latency-wise than the above solution. I count 12 uops total, in a loop this may execute, at best, at 3 cycles/iteration.

clang 3.9 is a bit more intelligent. Here's what it produces:

rotate2x_32_right_2(unsigned long):
        mov     eax, edi
        rol     eax, 30
        mov     rcx, rdi
        shr     rcx, 34
        shr     rdi, 2
        and     edi, -1073741824
        lea     ecx, [rcx + rdi]
        shl     rcx, 32
        or      rax, rcx
        ret

Like gcc, it's using rot for the lower DWORD, but it uses a mix of shifts for the upper DWORD, and is smarter about combining the results and keeping the computations independent. It's still doing some dumb stuff (what's up with the slow lea vs a simple-and-fast or?). The critical path (for the upper DWORD) is 5 cycles, and I counted 9 uops.

On the other hand, icc 17 produces terrible code pretty bad code too:

rotate2x_32_right_2(unsigned long):
        mov       rax, 0xffffffff00000000                       #13.3
        and       rax, rdi                                      #13.3
        shld      edi, edi, 30                                  #13.3
        or        rax, rdi                                      #13.3
        mov       rdx, rax                                      #12.3
        shr       rdx, 32                                       #12.3
        mov       eax, eax                                      #14.3
        shld      edx, edx, 30                                  #14.3
        shl       rdx, 32                                       #14.3
        or        rax, rdx                                      #14.3
        ret

For some reason it is using two shld reg, reg, i instructions, with both regs the same, which is really just a rol. Not sure why - the shrd instructions have mostly always been slower or sometimes tied with ror. On Haswell and Skylake they have a latency of 3 and can issue on one port, while ror has a latency of 1 can can issue on two ports. There was a brief time around Sandy Bridge where shrd was potentially better - it could issue with latency 1 on two ports, verus one port for ror. So maybe that's it. Let's try with -mtune=haswell:

rotate2x_32_right_2(unsigned long):
        mov       rax, 0xffffffff00000000                       #13.3
        and       rax, rdi                                      #13.3
        rol       edi, 30                                       #13.3
        or        rax, rdi                                      #13.3
        mov       rdx, rax                                      #12.3
        shr       rdx, 32                                       #12.3
        mov       eax, eax                                      #14.3
        rol       edx, 30                                       #14.3
        shl       rdx, 32                                       #14.3
        or        rax, rdx                                      #14.3
        ret                                                     #15.10

Yup, that was it. So the Intel code isn't too bad - with a critical path of 6 with my count, and 10 uops.

My best effort using rot by hand is as follows:

mov rax, rdi
shr rax, 32
ror eax, 2
ror edi, 2
sll rax, 32
or  rax, rdi
ret

It's straightforward - use 2 ror instructions to rotate the top and bottom DWORDs, plus two shifts to isolate the top DWORD in eax and move it back, and an or to combine them. The latency is actually worse than the shift+mask solution at 4 cycles, but it has only 7 uops, 1 fewer than shift and mask.

You could also try to combine the approaches, e.g., using shift+mask for the top DWORD, and rot for the bottom one, but I didn't come with anything better than the above, mostly because using doing the shift+mask approach for the top DWORD isn't much faster than doing it for the whole thing.

In summary, assuming you aren't actually going to write assembly, the original C shift+mask approach I showed above has the shortest latency and the smallest uop count (outside of the hand-rolled assembly) and should do well across various compilers, even without ror detection. It doesn't depend on the quality of a compiler's support for optimizing union access, which as we can see above, varies wildly.

Much of the assembly-level analysis was x86-centric, but most of it would apply to other platforms as well, with minor differences depending on the speed of loading large constants, and the ability to access 32-bit sub-registers of a 64-bit register and so on.

¹ Here and everywhere in this question I'm making the implicit assumption that the amount to rotate both halves is the same. That's consistent with the OP's example. If the amounts can be different some of the solutions change.

² In particular the or rdi, rax which inserts the bottom DWORD result makes the remainder of the function, which processes the high DWORD, depend on the first half. The or is in fact pretty pointless since there is already a final or rax, rdx to combine the results. It would have been easy to keep the results independent and then combine them at the end - many of the masking and combining operations that gcc emits are essentially redundant.

**olegarch** · Answer 2

You can use same memory in 2 modes - as uint_64_t, and as array[2] of uint32_t. Easiest and transparent way - to use union:

union U {
  uint64_t u64;
  uint32_t u32[2];
};

Thereafter, just use fields of this union:

#define ROL(x, n) x = (x << n) | (x >> (32 - n))

U val;
val.u64 = input_val; // Assign 64_bit value to union val
ROL(val.u32[0], 3); // Rotate left by 3 low-part of long int
ROL(val.u32[1], 1); // Rotate left by 1 high-part of long int

Is there an efficient way of doing 32-bit bitwise rotation separately on the high and low 32-bit parts of a 64-bit number?

There are 2 best solutions below

Related Questions in C

Related Questions in BIT-MANIPULATION

Related Questions in BIT-SHIFT

Related Questions in SWAR

Trending Questions

Popular # Hahtags

Popular Questions