I am struggling to understand how much 'optimizing' is the Python compiler. I did a few tests, for example consider this really simple script:
def to_optimize():
for i in range(10000):
a = 1
b = 2
c = a + b
print(c)
When I import this function from another script and disassemble it I get:
>>> dis.dis(optimization.to_optimize)
2 0 LOAD_GLOBAL 0 (range)
2 LOAD_CONST 1 (10000)
4 CALL_FUNCTION 1
6 GET_ITER
>> 8 FOR_ITER 28 (to 38)
10 STORE_FAST 0 (i)
3 12 LOAD_CONST 2 (1)
14 STORE_FAST 1 (a)
4 16 LOAD_CONST 3 (2)
18 STORE_FAST 2 (b)
5 20 LOAD_FAST 1 (a)
22 LOAD_FAST 2 (b)
24 BINARY_ADD
26 STORE_FAST 3 (c)
6 28 LOAD_GLOBAL 1 (print)
30 LOAD_FAST 3 (c)
32 CALL_FUNCTION 1
34 POP_TOP
36 JUMP_ABSOLUTE 8
>> 38 LOAD_CONST 0 (None)
40 RETURN_VALUE
As you can see almost no optimization at all is applied. I think a piece of code like this would be very easy to optimize more. I know that Python compiles 'in-time', so I do not expect the .pyc files created during a normal execution to be really optimized, but even when I pre-compile the script with python3 -m compileall the result is the same, and no additional optimization flag exists.
Is this really the maximum optimization achievable for a Python script? Is there any language limitation?
The default implementation of Python is CPython which is a slow interpreter. The compilation step is meant to produce a bytecode so to speed up the next interpreter run (so not to recompute the parsing of the code again and again). Python was designed to execute mainly glue codes and scripts, clearly not numerically-intensive codes with pure-Python expressions, so it only optimize very few things. Indeed, optimizations would slow-down startup time and since Python is a very-dynamic language, only very-few basic optimizations are possible during the bytecode generation. For example, one module imported (dynamic operation) can reimplement how objects are added (typically by setting
__add__) producing different results. That being said, AFAIK, this is not possible forintobject so doing a constant propagation on integers in this case is possible and this is simply a missed optimization. Still, therangeandprintobjects can be re-assigned to a user-specified function so the loop cannot be optimized out by any compile-time method.There are other implementation of Python meant to optimize the code at runtime typically using a just-in-time compiler (JIT). This is the case of PyPy and Pyston. Such implementation is generally much faster for basic numerical code since the JIT can optimize expressions and translate them to a native code at runtime. That being said, PyPy do not support all modules. For numerical codes like this, embedded JIT like Numba can be even faster (close to a native C/C++ code) though it only support Numpy (as of now).
Note that CPython developers recently decided to significantly optimize the code of the interpreter so such basic optimizations like this might be added soon. Still, the resulting code will be far slower that what PyPy can currently do.