Test code:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
%timeit df.y[(10 < df.z) & (df.z < 50)].mean()
%timeit df.y.values[(10 < df.z.values) & (df.z.values < 50)].mean()
%timeit df.eval('y[(10 < z) & (z < 50)].mean()', engine='numexpr')
The output on my machine (a fairly fast x86-64 Linux desktop with Python 3.6) is:
17.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.44 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.4 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I understand why the second line is a bit faster (it ignores the Pandas index). But why is the eval()
approach using numexpr
so slow? Shouldn't it be faster than at least the first approach? The documentation sure makes it seem like it would be: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html
From the investigation presented below, it looks like the unspectacular reason for the worse performance is "overhead".
Only a small part of the expression
y[(10 < z) & (z < 50)].mean()
is done vianumexpr
-module.numexpr
doesn't support indexing, thus we can only hope for(10 < z) & (z < 50)
to be speed-up - anything else will be mapped topandas
-operations.However,
(10 < z) & (z < 50)
is not the bottle-neck here, as can be easily seen:df.y[mask]
-takes the lion's share of the running time.We can compare the profiler output for
df.y[mask]
anddf.eval('y[mask]')
to see what makes the difference.When I use the following script:
and run it with
python -m cProfile -s cumulative run.py
(or%prun -s cumulative <...>
in IPython), I can see the following profiles.For direct call of the pandas functionality:
We can see that almost 100% of the time is spent in
series.__getitem__
without any overhead.For the call via
df.eval(...)
, the situation is quite different:once again about 7 seconds are spent in
series.__getitem__
, but there are also about 6 seconds overhead - for example about 2 seconds inframe.py:2861(eval)
and about 2 seconds inexpr.py:461(visit_Subscript)
.I did only a superficial investigation (see more details further below), but this overhead doesn't seems to be just constant but at least linear in the number of element in the series. For example there is
method 'copy' of 'numpy.ndarray' objects
which means that data is copied (it is quite unclear, why this would be necessary per se).My take-away from it: using
pd.eval
has advantages as long as the evaluated expression can be evaluated withnumexpr
alone. As soon as this is not the case, there might be no longer gains but losses due to quite large overhead.Using
line_profiler
(here I use %lprun-magic (after loading it with%load_ext line_profliler
) for the functionrun()
which is more or less a copy from the script above) we can easily find where the time is lost inFrame.eval
:Here we can see were the additional 10% are spent:
and
_get_index_resolvers()
can be drilled down toIndex._to_embed
:Where the
O(n)
-copying happens.