It seems that a lot of stuff I do in Pandas has convex computational-time costs in the amount of data I have (e.g. 1 row takes 1 second, 2 rows takes 2.2 seconds, 4 rows takes 6 seconds, etc.).
Why are computational costs not linearly increasing the the amount of data I have? For example, this function I wrote:
def fractrips1brand(trip):
# Get number of transaction rows for THIS sepcific consumer
art = trip[trip.Item_Id.isin(insidegood)].Item_Id.nunique()
output = pd.Series({'numinsidegoods': art })
return output
gr_TILPS = TILPStemp.groupby('uniqueid')
output = gr_TILPS.apply(fractrips1brand)
Seems to exhibit such costs.
Why is it not O(n)
?
It's quite common for functions to have greater-than-linear time complexity. Sorting, for example, has
O(n log n)
complexity.groupby
sorts the keys by default, so this call has at leastO(n log n)
complexity. You can turn the sorting off by usingIn Pandas 0.15 and older versions,
Series.nunique
(source) callsSeries.value_counts
(source) which also sorts the values by default. So this is another function call with O(n log n) complexity. Since this one occurs infractrips1brand
, the total complexity ofgr_TILPS.apply(fractrips1brand)
is at leastO(m n log n)
wherem
is the number of groups.Update: In the next release of Pandas (version 0.16.0)
Series.nunique
should be significantly faster.