I have a dataframe like (except mine is very large):
user1 user2 day hour quantity
-------------------------------------
Alice Bob 1 12 250
Alice Bob 1 13 250
Bob Carol 1 10 20
Alice Bob 4 1 600
.
.
.
...then suppose I get the following groupby and aggregation (by user1
, user2
and day
):
user1 user2 day quantity
---------------------
Alice Bob 1 500
4 600
Bob Carol 1 20
3 100
where the day should go from 0-364 (365 days). What I want is the interquartile range (and median) of the counts for each user for all the days -- except that the zeroes aren't counted.
Life would have been easier if I had explicit zeroes for all excluded days:
user1 user2 day quantity
---------------------
Alice Bob 1 500
2 0
3 0
4 600
.....
Bob Carol 1 20
2 0
3 100
...
... because then I could do df.reset_index().agg({'quantity':scipy.stats.iqr})
but I'm working with a very large dataframe (the example above is a dummy one), and reindexing with zeroes is just not possible.
I have an idea how to do it: since I know there are 365 days, then I should just pad the rest of the numbers by zeroes:
Alice-Bob: [500,600] + (365-2) * [0]
and get the scipy.stats.iqr
(and median) of that. However, this would involve iterating over all user1-user2
pairs. From experience, that takes a lot of time.
Is there any vectorized solution to this? I also have to get the median, too, and I think the same approach should hold.
To take advantage of zeros without putting them in dataframe you can use something like this:
This would create the median of quantity as if there were zeros in the dataframe.