Getting interquartile range and median from pandas groupby, zero-padding for all unmentioned dates

4.6k Views Asked by At

I have a dataframe like (except mine is very large):

user1      user2   day   hour  quantity
-------------------------------------
Alice      Bob      1     12     250
Alice      Bob      1     13     250
Bob        Carol    1     10     20
Alice      Bob      4     1      600
.
.
.

...then suppose I get the following groupby and aggregation (by user1, user2 and day):

user1      user2   day   quantity
---------------------
Alice      Bob      1      500
                    4      600
Bob        Carol    1      20
                    3      100

where the day should go from 0-364 (365 days). What I want is the interquartile range (and median) of the counts for each user for all the days -- except that the zeroes aren't counted.

Life would have been easier if I had explicit zeroes for all excluded days:

user1    user2    day   quantity
---------------------
Alice    Bob      1      500
                  2      0
                  3      0
                  4      600
.....
Bob      Carol    1      20
                  2      0
                  3      100
...

... because then I could do df.reset_index().agg({'quantity':scipy.stats.iqr}) but I'm working with a very large dataframe (the example above is a dummy one), and reindexing with zeroes is just not possible.

I have an idea how to do it: since I know there are 365 days, then I should just pad the rest of the numbers by zeroes:

Alice-Bob: [500,600] + (365-2) * [0]

and get the scipy.stats.iqr (and median) of that. However, this would involve iterating over all user1-user2 pairs. From experience, that takes a lot of time.

Is there any vectorized solution to this? I also have to get the median, too, and I think the same approach should hold.

1

There are 1 best solutions below

7
On BEST ANSWER

To take advantage of zeros without putting them in dataframe you can use something like this:

test = df.groupby(['user1', 'user2', 'day'])['quantity'].mean().reset_index()\
         .groupby(['user1', 'user2'])\
         .agg({'day': lambda x: tuple(x), 'quantity': lambda x: tuple(x)})\
         .reset_index()

def med_from_tuple(row):
    # starts with everything zero, and replaces some with the nonzero values in the dataframe
    z = np.zeros(365)
    np.put(z, row['day'], row['quantity'])
    return np.median(z)

test['example'] = test.apply(lambda x: med_from_tuple(x), axis=1)

This would create the median of quantity as if there were zeros in the dataframe.

test
#   user1  user2     day    quantity   example
#0  Alice    Bob  (1, 4)  (250, 600)       0.0
#1    Bob  Carol    (1,)       (20,)       0.0