On Intel AVX, there is a possibility of branchless code. Instead of branching for case0 or case1, you can compute both cases, and blend the results based on a condition.
AVX does this 8 way for float using the vblendps instruction.
You can also do this in a scalar way, without a vector, using the x86 instruction CMOVcc which performs a move operation, conditionally.
NOTE: ARM has CSEL and NEON has VBSL.
Can RISCV64 do a scalar move like this, so that you do not have to branch for
a = c ? x : y;
As I understand, RISCV implementations are in-order, so it would benefit even more than x86 when not having to branch. (The latter can at least shuffle around some instructions, and even branch speculatively to hide latency.)
The closest I can find w.r.t branchless operation for riscv is SLT (Set Less Than) but that sets to 1 or 0, and then would need multiplications? Wouldn't it be more useful to have SLT set to -1 or 0 instead, so that we can AND that?
UPDATE
When doing:
int foo(int a, int b, int x, int y)
{
return a < b ? x : y;
}
I tried a poor-man's version of branchless using SLT. I am not sure if I did it completely right, by using bitmask as 0 - condition(0|1), I came up with:
branchless:
SLT t0,a0,a1
SUB t0,zero,t0
NOT t1,t0
AND t0,a2,t0
AND t1,a3,t1
OR a0,t0,t1
RET
.size branchless, .-branchless
as the branchless version of:
branched:
BGE a0,a1,.L2
MV a3,a2
.L2:
MV a0,a3
RET
.size branched, .-branched
I wonder if I used too many instructions for this, but I measured the branching version to be slightly faster than the non-branching one on random data, but not by much.
Update: see sh1's answer for the current situation: there's a conditional-zero instruction, like
cmovfromx0. The fullcmovwas dropped from the planned discussions before extension B made it to v1.0 (and extension B was split into some separate parts). An article has some details and links on the situation as of mid 2023.Current compilers no longer support
bas a single-letter extension name either.The proposed RISC-V extension B includes
cmov(with 4 operands: 3 inputs and a separate destination!). (Version 0.93 was current when the rest of this answer was written.)I think David Patterson (one of the lead architects behind MIPS and RISC-V) really dislikes
cmov(along with short-vector SIMD like SSE/AVX) and thinks CPUs should specially handle "hammock" branches (that jump forward over a single instruction like a move) if they want to do that. Something like that. So this seems to be a case of philosophical purity getting in the way of including useful instructions. (AArch64 is a much more pragmatic design, still being RISC in the ways that matter for a high-performance implementation.)And/or perhaps a desire to limit instructions to at most 2 inputs, if there aren't any other 3-input instructions. That means a scalar pipeline only needs 2 register read ports, not 3, if it strictly follows this restriction. (That also means no add-with-carry, making extended-precision math quite a pain for numbers wider than 2 registers, when you have to deal with carry-in and carry-out to the same add operation.)
You can emulate
cmovas you say with a mask for AND/ANDnot/OR, but that would take quite a few instructions and is usually not worth it except possibly on wide and deep out-of-order machines, where the amount of work discarded by a branch miss is a lot bigger. (mask = (c == 0) - 1;which you can do withsltiu/add reg,reg, -1to turn 0 into -1 and 1 into 0.)You kind of have it backwards in terms of which kind of microarchitecture benefits more from CMOV, although there are potential benefits either way. And an in-order machine already kind of has to wait at a conditional branch for the condition to resolve, vs. an out-of-order machine treating control dependencies very differently from data dependencies. As discussed in gcc optimization flag -O3 makes code slower than -O2, data dependencies through
cmovcan create a loop-carried dependency chain that's a bigger bottleneck that highly predictable branches.There are some out-of-order exec RISC-V designs, maybe even some that are open-source. For example, Erik Eidt linked The Berkeley Out-of-Order Machine (BOOM).
Extension B: where they put all the fun instructions they left out
The RISC-V extension B proposal has a conditional move, along with scalar min/max, popcount, leading/trailing zero count, bitfield insert/extract, two-register shifts, and a bunch of more esoteric stuff. https://five-embeddev.com/riscv-bitmanip/draft/bext.html#conditional-move-cmov
Looking at the list of proposed instructions, it's amazing what got left out of baseline RISC-V, like sign-extension of narrow integers (currently requires slli/srai) if it's not already guaranteed by the calling convention or a load instruction, and standard stuff like popcount and leading/trailing zero count that most ISAs have.
Godbolt shows clang 12.0 using
cmov,min, andsext.b. In that clang version,-O3 -Wall -menable-experimental-extensions -march=rv32gcb0p93was the magic incantation to do that. Extension B 0.93 is enabled by theb0p93part of the string. (Extension B isn't finalized, and I don't know what version clang 14.0 was looking for; its error message wasn't helpful, and just plain-march=rv32gcbdidn't get the compiler to actually usecmov.)