More efficient way to search through Pandas groups

Question

More efficient way to search through Pandas groups

164 Views Asked by sierra_papa At 14 July 2022 at 12:35

I want to get the list of values from col2 that belong to the same groupId, given corresponding value in col1. Col1 values can belong to multiple groups and in that case only top-most group should be considered (group 2 but not group 3 in my example). Col1 values are always identical within the same groupId.

groupId	col1	col2
2	a	10
1	b	20
2	a	30
1	b	40
3	a	50
3	a	60
1	b	70

My current solution takes over 30s for a df with 2000 rows and 32 values to search for in col1 ('a' in this case):

group_id_groups = df.groupby('groupId') 

for group_id, group in group_id_groups:
    col2_values = list(group[group['col1'] == 'a']['col2'])
    if col2_values:
        print(col2_values)
        break


result: [10, 30]

Original Q&A

There are 4 best solutions below

buhtz On 14 July 2022 at 12:58

I am still not sure what you want. Does this help you? I am sure that pandas.DataFrame.groupby() is your friend here.

Full code

#!/usr/bin/env python3
import pandas as pd

# initial data
df = pd.DataFrame({
    'groupId': [2, 1, 2, 1, 3, 3, 1],
    'col1': list('ababaab'),
    'col2': range(10, 80, 10)
})
print(df)

g = df.groupby(['groupId', 'col1']).agg(list)
print(g)

result = g.loc[(2, 'a')]
print(result)

Step by step

Your initial data in df looks like this

    groupId col1  col2
 0        2    a    10
 1        1    b    20
 2        2    a    30
 3        1    b    40
 4        3    a    50
 5        3    a    60
 6        1    b    70

Then you simply group your data by your two "search columns". The result per group is stored as a list.

g = df.groupby(['groupId', 'col1']).agg(list)

The result:

                      col2
groupId col1              
1       b     [20, 40, 70]
2       a         [10, 30]
3       a         [50, 60]

No you can do your search:

result = g.loc[(2, 'a')]

That gives you

col2    [10, 30]
Name: (2, a), dtype: object

I'mahdi On 14 July 2022 at 13:06

You can use pandas.groupby with agg(list) then search what you want with .loc and return the first find.

>>> grp = df.groupby(['groupId', 'col1']).agg(list).reset_index()

>>> grp.loc[grp['col1'].eq('a'), 'col2'].to_list()[0]
[10, 30]

>>> grp.loc[grp['col1'].eq('a'), 'col2']
1    [10, 30]
2    [50, 60]
Name: col2, dtype: object

rpanai On 14 July 2022 at 13:06

It seems to me that you mostly need to create a mask without using a groupby.

import pandas as pd

# data
data = {'groupId': {0: '2', 1: '1', 2: '2', 3: '1', 4: '3', 5: '3', 6: '1'},
 'col1': {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'a', 5: 'a', 6: 'b'},
 'col2': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60, 6: 70}}

df = pd.DataFrame(data)

# First group where condition is satisfied
first_group = df[df["col1"].eq("a")].iloc[0]["groupId"]

# Output
df[df["col1"].eq("a") &
  df["groupId"].eq(first_group)]["col2"].to_list()

And the output is [10, 30] as expected.

**Chris** · Accepted Answer · 2022-07-14T12:54:52.163000

The sort parameter of groupby defaults to true, which means the first group will be the topmost by default. You can change the col_to_search to b and get the other answer.

import pandas as pd
df = pd.DataFrame({'groupId': [2, 1, 2, 1, 3, 3, 1],
 'col1': ['a', 'b', 'a', 'b', 'a', 'a', 'b'],
 'col2': [10, 20, 30, 40, 50, 60, 70]})

col_to_search = 'a'
(
    df.loc[df['col1'].eq(col_to_search)]
    .groupby('groupId')['col2']
    .apply(list)
    .iloc[0]
)

Output

[10, 30]

More efficient way to search through Pandas groups

There are 4 best solutions below

Full code

Step by step

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PANDAS-GROUPBY

Trending Questions

Popular # Hahtags

Popular Questions