Multi Column DDPLY/R function in Pandas/Python

716 Views Asked by At

I have the following statement in R

library(plyr)
filteredData <- ddply(data, .(ID1, ID2), businessrule)

I am trying to use Python and Pandas to duplicate the action. I have tried...

data['judge'] = data.groupby(['ID1','ID2']).apply(lambda x: businessrule(x))

But this provides error...

 incompatible index of inserted column with frame index
1

There are 1 best solutions below

0
On

The error message can be reproduced with

import pandas as pd

df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['ID1', 'ID2', 'val'])
df['new'] = df.groupby(['ID1', 'ID2']).apply(lambda x: x.values.sum())
# TypeError: incompatible index of inserted column with frame index

It is likely that your code raises an error for the same reason this toy example does. The right-hand side is a Series with a 2-level MultiIndex:

ID1  ID2
0    1       3
3    4      12
6    7      21
9    10     30
dtype: int64

df['new'] = ... tells Pandas to assign this Series to a column in df. But df has a single-level index:

   ID1  ID2  val
0    0    1    2
1    3    4    5
2    6    7    8
3    9   10   11

Because the single-level index is incompatible with the 2-level MultiIndex, the assignment fails. It is in general never correct to assign the result of groupby/apply to a columns of df unless the columns or levels you group by also happen to be valid index keys in the original DataFrame, df.

Instead, assign the Series to a new variable, just like what the R code does:

filteredData = data.groupby(['ID1','ID2']).apply(businessrule)

Note that lambda x: businessrule(x) can be replaced with businessrule.