Using groupby and filters on a dataframe

175 Views Asked by At

I have a dataframe with both string and integer values.

Attaching a sample data dictionary to understand the dataframe that I have:

data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12]
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}

I need to extract data as under:

  • Max value from col4
  • Grouped by col1
  • Filtered out col3 from the result if value is Y
  • Filter col5 from the result to show only values not more than 5.

So I tried something and faced following problems.

1- I used following method to find max value from the data. But I am not able to find max value from each group.

print(dataframe['col4'].max()) #this worked to get one max value
print(dataframe.groupby('col1').max() #this doesn't work

Second one doesn't work for me as that returns maximum value for col2 as well. I need the result to have col2 value against the max row under each group.

2- I am not able to apply filter on both col3 (str) and col5 (int) in one command. Any way to do that?

print(dataframe[dataframe['col3'] != 'Y' & dataframe['col5'] < 6]) #generates an error

The output that I am expecting through this is:

    col1  col2 col3  col4  col5
0     A    10    X    45     3
3     B    10    X    56     4
6     C    10    X    87     4
10    D    20    X    43     4
#
# 78 is max in group A, but ignored as col5 is 6 (we need < 6)
# Similarly, 89 is max in group D, but ignored as col3 is Y.

I apologize if I am doing something wrong. I am quite new to this.

Thank you.

2

There are 2 best solutions below

1
Grzegorz On

I'm not a python developer, but im my opinion you do it in a wrong way. You shoud have a list of structure insted of structure of list. Then you can start workin on such list.

This is an example solution, so probably it coud be done im much smootcher way:

data = {
'col1': ['A','A','A','B','B','B','C','C','C','D','D','D'],
'col2': [10,20,30,10,20,30,10,20,30,10,20,30],
'col3': ['X','X','X','X','Y','X','X','X','Y','Y','X','X'],
'col4': [45,23,78,56,12,34,87,54,43,89,43,12],
'col5': [3,4,6,4,3,2,4,3,5,3,4,6]
}

newData = [];

for i in range(len(data['col1'])):
    newData.append({'col1' : data['col1'][i], 'col2' : data['col2'][i], 'col3' : data['col3'][i], 'col4' : data['col4'][i], 'col5' : data['col5'][i]})

withoutY = list(filter(lambda d: d['col3'] != 'Y', newData))
lessThen5 = list(filter(lambda d: d['col5'] < 5, withoutY))
values = set(map(lambda d: d['col1'], lessThen5))
groupped = [[d1 for d1 in lessThen5 if d1['col1']==d2] for d2 in values]

result = [];
for i in range(len(groupped)):
    result.append(max(groupped[i], key = lambda g: g['col4']))

sortedResult = sorted(result, key = lambda r: r['col1'])

print (sortedResult)

result:

[
{'col1': 'A', 'col2': 10, 'col3': 'X', 'col4': 45, 'col5': 3}, 
{'col1': 'B', 'col2': 10, 'col3': 'X', 'col4': 56, 'col5': 4}, 
{'col1': 'C', 'col2': 10, 'col3': 'X', 'col4': 87, 'col5': 4}, 
{'col1': 'D', 'col2': 20, 'col3': 'X', 'col4': 43, 'col5': 4}
]
0
Grzegorz On

Ok, I didn't actually notice. So i was try something like this:

#fd is a filtered data
fd=data.query('col3 != "Y"').query('col5 < 6')
# or fd=data[data.col3 != 'Y'][data.col5 < 6]
#m is max for col4 grouped by col1
m=fd.groupby('col1')['col4'].max()

This will group by col1 and get max from col4, but in result we have 2 colums (col1 and col4). I don't know what do you want to achieve. If you want to have all line, here is the code:

result=fd[lambda x: x.col4 == m.get(x.col1).values]

You need to be careful, because you not alway will have one line for "col1". E.g. For data

data = pd.DataFrame({
    'col1': ['A','A','A','A','B','B','B','B','C','C','C','D','D','D'],
    'col2': [20,10,20,30,10,20,20,30,10,20,30,10,20,30],
    'col3': ['X','X','X','X','X','X','Y','X','X','X','Y','Y','X','X'],
    'col4': [45,45,23,78,45,56,12,34,87,54,43,89,43,12],
    'col5': [1,3,4,6,1,4,3,2,4,3,5,3,4,6]})

Result will be:

   col1  col2 col3  col4  col5
0     A    20    X    45     1
1     A    10    X    45     3
5     B    20    X    56     4
8     C    10    X    87     4
12    D    20    X    43     4

Additionally if you want to have normal index instead of ..., 8, 9 12, you could use "where" instead of "query"