Improve performance on processing a big pandas dataframe

1.3k Views Asked by At

I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.

My code is below, and a profiling analysis is also provided.

Header of the dataset:

key_id, date, par1, par2, par3, par4, pop, price, value

For each key, we have a row with every of the 5000 dates possibles

There is 200 key_id * 5000 date = 1000000 rows

Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.

In the end, I want to find the variables which optimize this popularity.

def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
    dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
        + val2 * dataset['par2'] + val3 * dataset['par3'] \
        + val4 * dataset['par4']

    return dataset

def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
    dataset = compute_value_col(dataset, val1, val2, val3, val4)
    dataset = dataset.sort(['key_id','value'], ascending=True)
    dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
    return dataset['pop'].sum()

def optimize(dataset, top):
    for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
        print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)

optimize(my_dataset, 20)

I need to enhance perf

Here is a %prun output, after running 49 params_to_score

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       98    2.148    0.022    2.148    0.022 {pandas.algos.take_2d_axis1_object_object}
       49    1.663    0.034    9.852    0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
       49    1.311    0.027    1.311    0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
       49    1.219    0.025    1.223    0.025 {pandas.algos.groupby_indices}
       49    0.875    0.018    0.875    0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
      147    0.452    0.003    0.457    0.003 index.py:581(is_unique)
      343    0.193    0.001    0.193    0.001 {method 'copy' of 'numpy.ndarray' objects}
        1    0.136    0.136   10.058   10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
      147    0.122    0.001    0.122    0.001 {method 'argsort' of 'numpy.ndarray' objects}
      833    0.112    0.000    0.112    0.000 {numpy.core.multiarray.empty}
       49    0.109    0.002    0.109    0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
       98    0.083    0.001    0.083    0.001 {pandas.algos.take_2d_axis1_float64_float64}
       49    0.078    0.002    1.460    0.030 groupby.py:1014(_cumcount_array)

I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.

But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?

EDIT: @Jeff

Thanks a lot for your help!

I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:

def to_bench1(dataset):
    dataset = dataset.sort(['key_id','value'], ascending=True)
    dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
    return dataset['pop'].sum()

def to_bench2(dataset):
    dataset = dataset.set_index('pop')
    dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
    return dataset['pop'].sum()

On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.

After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...

1

There are 1 best solutions below

0
On

The way to make this significantly faster is like this.

Create some sample data

In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })

Define the compute_val_column like you have

In [149]: def f(p):
    return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
   .....: 

These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function

In [150]: parms = [1,3]

Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.

In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()

In [155]: df2
Out[155]: 
   parm  indexer   A  B
0     1        0   0  1
1     1        1   1  1
2     1        2   2  1
3     1        3   3  2
4     1        4   4  2
5     3        0   0  1
6     3        1   3  1
7     3        2   6  1
8     3        3   9  2
9     3        4  12  2

Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)

In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]: 
parm  B
1     1     3
      2     7
3     1     9
      2    21
dtype: int64