Group dataframe and sample n rows with equal probability between groups

44 Views Asked by At

I have a pandas dataframe like this:

     ID  Value
0     a     2
1     a     4
2     b     6
3     c     8
4     c    10
5     c    12

I would like to sample equally from the ID groups. I know I can group the data frame by ID and then specify the number of rows I want to sample from each group like this: df.groupby("ID").sample(n=2, replace = True) However, I just want the probability of sampling from a group to be the same, not necessarily the exact same number of rows.

2

There are 2 best solutions below

0
mozway On BEST ANSWER

If you want to sample N rows with about the same probability to sample each group, you could oversample per group then sample again:

import math

N = 4

out = (df.groupby('ID').sample(n=math.ceil(N/df['ID'].nunique()), replace=True)
         .sample(N)
      )

Example output:

  ID  Value
2  b      6
2  b      6
4  c     10
1  a      4

With N = 10:

  ID  Value
0  a      2
2  b      6
5  c     12
3  c      8
1  a      4
5  c     12
2  b      6
1  a      4
1  a      4
2  b      6

Proportion with N = 100:

ID
b    0.34
a    0.33
c    0.33
Name: proportion, dtype: float64
3
Jamie On

This can be done using frac instead of n in your sample code. To use 50% of the samples for a given ID:

newdf=df.groupby("ID").sample(frac=0.5, replace = True)
display(newdf)