Branchless comparison in x86_64

559 Views Asked by At

Recently I took the course of discrete math. The professor tells us that branching is slower than branchless.

AFAIK modern CPUs use pipeline to increase the efficiency, so breaching in CPUs are essentially assuming a result of previous instructions, do the operations base on it until the previous instructions finished then compare the assumption and actual result.

If match then the result based on assumption are accepted, else the operations will be redone using the actual value previous produced.

My questions are:

  1. The real slow part is when assumption mismatch the product of previous, isn't it ?

1-1. If CPUs always guessed right, then it's actually faster right?

  1. The following is x == y in x86_64 asm:

2-1. Branching

XeqY1:
    xor     rax, rax ; clear rax
    cmp     rdi, rsi ; compare X and Y
    sete    al       ; set al = 1 if equal

2-2. Branchless

XeqY2:
    mov     rax, rdi ; 
    sub     rax, rsi ; 
    sub     rsi, rdi ; 
    xor     rax, rsi ; tmp = (X-Y)^(Y-X)
    not     rax      ; invert tmp
    shr     rax, 63  ; get sign bit

2-3. For above two, 2-1 only takes about 3 Cycle. 2-2 takes about 6 cycle, which is doubled of branching version. Does it really works faster? It's weird to me that 2-1 will be slower than 2-2 (according to professor).

EDIT 13:53

Thanks for the replies!!

I didn't know that SETcc and CMOVcc was branchless, what if change 2-1 to:

XeqY1:
    cmp     rdi, rsi
    jne     .L0
    mov     rax, 1
.L0:
    mov     rax, 0

Does this change make it non-branchless? If so, will it be faster or slower to 2-2?

0

There are 0 best solutions below