For 64-bit registers, there is the CMOVcc A, B instruction, that only writes B to A if condition cc is satisfied:
; Do rax <- rdx iff rcx == 0
test rcx, rcx
cmove rax, rdx
However, I wasn't able to find anything equivalent for AVX. I still want to move depending on the value of RFLAGS, just with larger operands:
; Do ymm1 <- ymm2 iff rcx == 0
test rcx, rcx
cmove ymm1, ymm2 (invalid)
Is there an AVX equivalent for cmov? If not, how can I achieve this operation in a branchless way?
Given this branchy code (which will be efficient if the condition predicts well):
We can do it branchlessly by creating a 0 / -1 vector based on the compare condition, and blending on it. Some optimizations vs. the other answer:
vmovd/q xmm, regcan only run on a single execution port on Intel: port 5, the same one needed by vector shuffles likevpbroadcastq ymm, xmm.As well as saving 1 total instruction, it makes some of them cheaper (less competition for the same execution port, e.g. scalar xor isn't SIMD at all) and off the critical path (xor-zeroing). And in a loop, you can prepare a zeroed vector outside the loop.
Destroying the old RCX means you might need a
mov, but this is still worth it.A condition like
rcx >= rdx(unsigned) could be done withcmp rdx, rcx/sbb rax,raxto materialize a 0 / -1 integer (which you can broadcast without needingvpcmpeqq).A signed-greater-than condition is more of a pain; you might end up wanting 2x
vmovqforvpcmpgtq, instead ofcmp/setg/vmovd/vpbroadcastb. Especially if you don't have a convenient register tosetginto to avoid a possible false dependency.setg al/ read EAX isn't a problem for partial register stalls: CPUs new enough to have AVX2 don't rename AL separately from the rest of RAX. (Only Intel ever did that, and doesn't in Haswell.) So anyway, you could justsetccinto the low byte of one of yourcmpinputs.Note that
vblendvpsandvblendvpdonly care about the high byte of each dword or qword element. If you have two correctly sign-extended integers, and subtracting them won't overflow,c - dwill be directly usable as your blend control, just broadcast that. FP blends between integer SIMD instructions likevpadddhave an extra 1 cycle of bypass latency on input and output, on Intel CPUs with AVX2 (and maybe similar on AMD), but the instruction you save will also have latency.With unsigned 32-bit numbers, you're likely to have them already zero-extended to 64-bit in integer regs. In that case,
sub rcx, rdxcould set the MSB of RCX identically to howcmp ecx, edxwould set CF. (And remember that the FLAGS condition forjb/cmovbisCF == 1)But if your inputs are already 64-bit, and you don't know that their range is limited, you'd need a 65-bit result to fully capture a 64-bit subtraction result.
That's why the condition for
jlisSF != OF, not justa-b < 0becausea-bis done with truncating math. And the condition forjbisCF == 1(instead of the MSB).