It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2, should I prefer
CMOVE rax, 1 #1a
CMOVNE rax, 2 #1b
or
MOV rax, 1 #2a
CMOVNE rax, 2 #2b
In theory the first one can execute pretty much in parallel while the second one is slower because of a data dependency. But I'm not sure how it is in reality.
The second one seems better.
First, note that
CMOVccdoes not have an immediate form; the second operand must also be a register (or memory). And soCMOVcc rax, rbxactually has three input dependencies;rax,rbx, andflags.raxis an input dependency because ifccis false, then the output value inraxmust equal the value it had before. So the instruction will always stall until all three of the inputs are ready.You might imagine a "conditional dependency" where the instruction need not stall on
raxifccis true, but I don't believe any currently existing machine can actually do that. Generally, a conditional move is treated as an arithmetic/logic instruction, kind of similar toadc rax, rbx: it computes some function ofrax,rbxand the flags, and leaves the result inrax. You could think of that function as something likerax = (~mask & rax) | (mask & rbx).(This is one of the main disadvantages of branchless code: it always has to wait for both results to be ready. A branch may seem worse, but if it's correctly predicted, then it only waits on the result that is actually needed. Linus Torvalds wrote a famous rant about this.)
So the first example would look more completely like
(I know we should be doing it all with 32-bit registers to save the REX prefixes, but it's just an example.)
And we can now see several problems.
The
cmovs must wait forrbxandrcxto be ready, but that's probably not an issue; the immediatemovs have no input dependency at all, so they could have been executed out-of-order long ago.More seriously, the second
cmovhas an input dependency on the first one's output, viarax. So the twocmovs must actually execute serially and not in parallel.And worse, the first
cmovhas a false dependency on whatever the previous value ofraxwas. If some earlier code is doing something slow withrax, this code will stall until that other code is done, even though the value left inraxby the earlier code should be totally irrelevant to this snippet.The second alternative would look like:
This avoids most of those issues. Now the only real input dependency is the flags produced by the
cmp(and thus on whatever the inputs tocmpare). As before, the immediatemovs toraxandrbxhave no input dependencies and can be done in advance. And there is no false dependency on the previous value ofraxbecausemov rax, 1definitively wipes it out. As a final bonus, this version is one instruction shorter.