I found 2 branchless functions that find the maximum of two numbers in python, and compared them to an if statement and the built-in max function. I thought the branchless or the built-in functions would be the fastest, but the fastest was the if-statement function by a large margin. Does anybody know why this is? Here are the functions:
If-statement (2.16 seconds for 25000 operations):
def max1(a, b):
if a > b:
return a
return b
Built-in (4.69 seconds for 25000 operations):
def max2(a, b):
return max(a, b)
Branchless 1 (4.12 seconds for 25000 operations):
def max3(a, b):
return (a > b) * a + (a <= b) * b
Branchless 2 (5.34 seconds for 25000 operations):
def max4(a, b):
diff = a - b
return a - (diff & diff >> 31)
Your expectations about branching vs. branchless code apply to low-level languages like assembly and C. Branchless code can be faster in low-level languages because it prevents slowdowns caused by branch prediction misses. (Note: this means branchless code can be faster, but it will not necessarily be.)
Python is a high-level language. Assuming you are using the CPython interpreter: for every bytecode instruction you execute, the interpreter has to branch on the kind of opcode, and typically many other things. For example, even the simple
<operator requires a branch to check for the<opcode, another branch to check whether the object's class implements a__lt__method, more branches to check whether the right-hand-side value is of a valid type for the comparison to be performed, and probably several other branches. Even your so-called "branchless" code will in practice result in a lot of branching for these reasons.Because Python is so high-level, each bytecode instruction is actually doing quite a lot of work compared to a single machine-code instruction. So the performance of simple code like this will mainly depend on how many bytecode instructions have to be interpreted:
max1function has to do three loads of local variables, a comparison, a conditional jump and a return. That's six.max2function does two loads of local variables, one load of a global variable (referencing the built-inmax), and also makes a function call; that requires passing arguments, and is relatively expensive compared to other bytecode instructions. On top of that, the built-in function itself has to do the same work as your ownmax1function, so no wondermax2is slower.max3function does six loads of local variables, two comparisons, two multiplications, one addition, and one return. That's twelve instructions, so we should expect it to take about twice as long asmax1.max4does five loads of local variables, one store to a local variable, one load of a constant, two subtractions, one bitshift, one bitwise "and", and one return. That's twelve instructions again.That said, note that if we compare your
max1with the built-in functionmaxdirectly, instead of yourmax2which has an extra function call, yourmax1function is still a bit faster than the built-inmax. This is probably because the built-inmaxaccepts a variable number of arguments, which may involve building a tuple of arguments, and the built-inmaxfunction also has a branch to check if it was called with a single iterable argument (e.g.max([3, 1, 4, 2])), and handle that case differently; yourmax1function doesn't do those things.