RISCV branchless coding

1.7k Views Asked by At

On Intel AVX, there is a possibility of branchless code. Instead of branching for case0 or case1, you can compute both cases, and blend the results based on a condition.

AVX does this 8 way for float using the vblendps instruction.

You can also do this in a scalar way, without a vector, using the x86 instruction CMOVcc which performs a move operation, conditionally.

NOTE: ARM has CSEL and NEON has VBSL.

Can RISCV64 do a scalar move like this, so that you do not have to branch for

a = c ? x : y;

As I understand, RISCV implementations are in-order, so it would benefit even more than x86 when not having to branch. (The latter can at least shuffle around some instructions, and even branch speculatively to hide latency.)

The closest I can find w.r.t branchless operation for riscv is SLT (Set Less Than) but that sets to 1 or 0, and then would need multiplications? Wouldn't it be more useful to have SLT set to -1 or 0 instead, so that we can AND that?

UPDATE

When doing:

int foo(int a, int b, int x, int y)
{
    return a < b ? x : y;
}

I tried a poor-man's version of branchless using SLT. I am not sure if I did it completely right, by using bitmask as 0 - condition(0|1), I came up with:

branchless:
    SLT t0,a0,a1
    SUB t0,zero,t0
    NOT t1,t0
    AND t0,a2,t0
    AND t1,a3,t1
    OR  a0,t0,t1
    RET
    .size   branchless, .-branchless

as the branchless version of:

branched:
    BGE a0,a1,.L2
    MV  a3,a2
.L2:
    MV  a0,a3
    RET
    .size   branched, .-branched

I wonder if I used too many instructions for this, but I measured the branching version to be slightly faster than the non-branching one on random data, but not by much.

2

There are 2 best solutions below

9
Peter Cordes On BEST ANSWER

Update: see sh1's answer for the current situation: there's a conditional-zero instruction, like cmov from x0. The full cmov was dropped from the planned discussions before extension B made it to v1.0 (and extension B was split into some separate parts). An article has some details and links on the situation as of mid 2023.

Current compilers no longer support b as a single-letter extension name either.


The proposed RISC-V extension B includes cmov (with 4 operands: 3 inputs and a separate destination!). (Version 0.93 was current when the rest of this answer was written.)

I think David Patterson (one of the lead architects behind MIPS and RISC-V) really dislikes cmov (along with short-vector SIMD like SSE/AVX) and thinks CPUs should specially handle "hammock" branches (that jump forward over a single instruction like a move) if they want to do that. Something like that. So this seems to be a case of philosophical purity getting in the way of including useful instructions. (AArch64 is a much more pragmatic design, still being RISC in the ways that matter for a high-performance implementation.)

And/or perhaps a desire to limit instructions to at most 2 inputs, if there aren't any other 3-input instructions. That means a scalar pipeline only needs 2 register read ports, not 3, if it strictly follows this restriction. (That also means no add-with-carry, making extended-precision math quite a pain for numbers wider than 2 registers, when you have to deal with carry-in and carry-out to the same add operation.)

You can emulate cmov as you say with a mask for AND/ANDnot/OR, but that would take quite a few instructions and is usually not worth it except possibly on wide and deep out-of-order machines, where the amount of work discarded by a branch miss is a lot bigger. (mask = (c == 0) - 1; which you can do with sltiu / add reg,reg, -1 to turn 0 into -1 and 1 into 0.)

You kind of have it backwards in terms of which kind of microarchitecture benefits more from CMOV, although there are potential benefits either way. And an in-order machine already kind of has to wait at a conditional branch for the condition to resolve, vs. an out-of-order machine treating control dependencies very differently from data dependencies. As discussed in gcc optimization flag -O3 makes code slower than -O2, data dependencies through cmov can create a loop-carried dependency chain that's a bigger bottleneck that highly predictable branches.

There are some out-of-order exec RISC-V designs, maybe even some that are open-source. For example, Erik Eidt linked The Berkeley Out-of-Order Machine (BOOM).


Extension B: where they put all the fun instructions they left out

The RISC-V extension B proposal has a conditional move, along with scalar min/max, popcount, leading/trailing zero count, bitfield insert/extract, two-register shifts, and a bunch of more esoteric stuff. https://five-embeddev.com/riscv-bitmanip/draft/bext.html#conditional-move-cmov

Looking at the list of proposed instructions, it's amazing what got left out of baseline RISC-V, like sign-extension of narrow integers (currently requires slli/srai) if it's not already guaranteed by the calling convention or a load instruction, and standard stuff like popcount and leading/trailing zero count that most ISAs have.

Godbolt shows clang 12.0 using cmov, min, and sext.b. In that clang version, -O3 -Wall -menable-experimental-extensions -march=rv32gcb0p93 was the magic incantation to do that. Extension B 0.93 is enabled by the b0p93 part of the string. (Extension B isn't finalized, and I don't know what version clang 14.0 was looking for; its error message wasn't helpful, and just plain -march=rv32gcb didn't get the compiler to actually use cmov.)

//  -march=rv32gcb0p93 includes extension b 0.93 (0p93)

int sel(int x, int y, int c){
    return c ? x : y;
}
# extension B  clang
        cmov    a0, a2, a0, a1
        ret

# baseline gcc11.3  (clang and GCC12 waste several mv instructions)
        bne     a2,zero,.L2
        mv      a0,a1
.L2:
        ret
int min(int x, int y, int c){
    return (x<y) ? x : y;
}
# extension B  clang
        min     a0, a0, a1
        ret

# baseline gcc
        ble     a0,a1,.L5
        mv      a0,a1
.L5:
        ret
int sext(int c){
    return (signed char)c;
}
# extension B  clang
        sext.b  a0, a0
        ret

# baseline gcc
        slli    a0,a0,24
        srai    a0,a0,24
        ret
2
sh1 On

OK, cmov didn't make it.

Right now you'll need to look at the Zicond extension to get the instructions czero.eqz and czero.nez. These return either the first input or zero, depending on whether or not the last input is zero.

For example:

int cmov(bool c, int x, int y) {
    return c ? x : y;
}

gives:

cmov(bool, int, int):                             # @cmov(bool, int, int)
        czero.nez       a2, a2, a0
        czero.eqz       a0, a1, a0
        or      a0, a0, a2
        ret

Obviously this looks a lot better when one of the operands is constant zero, which is fairly common, or if you're looking at something like c ? x : (x + y) then that would become x + (c ? 0 : y).

To enable this optimisation in clang right now requires: -menable-experimental-extensions -march=rv64gc_zicond1p0

Once everything is settled I suppose this will become: -march=rv64gc_zicond

If you already have a setting for -march=, just tack _zicond1p0 or on the end of it.

In the SIMD space (-march=rv64gcv) you have __riscv_vmerge_*() intrinsics.

What did survive in the B extension family is min/max. You get access to these with -march=rv64gc_zbb, and aside from the obvious uses, you can sometimes refactor things to use them as masking operations.