How to select rows from Pandas dataframe after using groupby?

4.7k Views Asked by At

From the following data frame, how can a list of owners with more than two pets be obtained without resorting to copying/pasting or intermediate data structures?

df = pd.DataFrame([['Jack', 'fuzzy',12], ['Jack', 'furry',13], ['Joe', 'scratchy',3], ['Jack', 'chirpy',40], ['Jeff', 'slithery',9], ['Jack', 'swimmy',1], ['Joe', 'feathery',14], ['Joe', 'oinky',11], ['Jack', 'stampy',1]],
                  columns=['Owner', 'Pet', 'Age'])
print(df)


  Owner       Pet  Age
0  Jack     fuzzy   12
1  Jack     furry   13
2   Joe  scratchy    3
3  Jack    chirpy   40
4  Jeff  slithery    9
5  Jack    swimmy    1
6   Joe  feathery   14
7   Joe     oinky   11
8  Jack    stampy    1

Getting a boolean series matching the condition is easy:

df.groupby('Owner').count()['Pet']>2

Owner
Jack     True
Jeff    False
Joe      True
Name: Pet, dtype: bool

Actually distilling the matches (Jack and Joe) can be done by copying pasting the groupby statement:

df.groupby('Owner').count()['Pet'][df.groupby('Owner').count()['Pet']>2] 

Owner
Jack    5
Joe     3
Name: Pet, dtype: int64

But this is a pain if the conditional statement is long because every change needs to be repeated. The only other way discovered so far is to throw the series back into a data frame and use query(), but this feel impossibly hackish:

pd.DataFrame(df.groupby('Owner').count()['Pet']).query('Pet > 2')

       Pet
Owner     
Jack     5
Joe      3

Is there a better way than these?

2

There are 2 best solutions below

0
On BEST ANSWER

You could use the .loc indexer with a filtering function.

>>> df.groupby('Owner').Pet.count().loc[lambda p: p > 2]
Owner
Jack    5
Joe     3
Name: Pet, dtype: int64

Alternatively, you could use the compress method.

>>> df.groupby('Owner').Pet.count().compress(lambda p: p > 2)
Owner
Jack    5
Joe     3
Name: Pet, dtype: int64
0
On

Option 1
Use pd.factorize and np.bincount

f, u = pd.factorize(df.Owner.values)
b = np.bincount(f)
m = b > 2
u[m]

array(['Jack', 'Joe'], dtype=object)

Or produce a series

pd.Series(b[m], u[m])

Jack    5
Joe     3
dtype: int64

Option 2
use same groupby twice

2.1
Funky lambda

(lambda p: p[p > 2])(df.groupby('Owner').Pet.count())

Owner
Jack    5
Joe     3
Name: Pet, dtype: int64

2.2
pipe
I'd rather use @Mitch's answer than this one.

df.groupby('Owner').Pet.count().pipe(lambda p: p[p > 2])

Owner
Jack    5
Joe     3
Name: Pet, dtype: int64

Timing
Code Below

# Multiples of minimum runtime: Smaller is better.
#
       pir1      pir2      pir3      mch1      mch2
10      1.0  2.984347  2.907198  2.422435  2.736712
30      1.0  3.396997  3.464083  3.023355  3.353150
100     1.0  3.646931  3.053890  2.586377  2.859365
300     1.0  4.541890  4.037132  3.054388  3.323939
1000    1.0  2.529670  2.438109  2.214494  2.415056
3000    1.0  3.212312  3.739621  3.062538  2.969489
10000   1.0  2.923211  2.807983  2.970712  2.637492
30000   1.0  2.790350  2.830328  2.978083  2.719900

enter image description here

def pir1(d, c):
    f, u = pd.factorize(d.Owner.values)
    b = np.bincount(f)
    m = b > c
    return pd.Series(b[m], u[m])

pir2 = lambda d, c: (lambda p: p[p > c])(d.groupby('Owner').Pet.count())
pir3 = lambda d, c: d.groupby('Owner').Pet.count().pipe(lambda p: p[p > c])
mch1 = lambda d, c: d.groupby('Owner').Pet.count().loc[lambda p: p > c]
mch2 = lambda d, c: d.groupby('Owner').Pet.count().compress(lambda p: p > c)

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 mch1 mch2'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    c = 2 * i
    for j in res.columns:
        stmt = '{}(d, c)'.format(j)
        setp = 'from __main__ import d, c, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=10)

res.div(res.min(1), 0)

res.plot(loglog=True)