Apply a function with multiple arguments to a large Pandas dataframe efficiently

47 Views Asked by At

My dataframe (1,957,046 x 4) is of baby names by year, count and gender, as follows:

Year Name Gender Count
1880 A F 1
1880 B M 5
1880 C F 2
... ... ... ...
2018 X M 7
2018 Y F 4
2018 Z M 2

I am trying to create a new column called "popularity," defined as babies with a given name and gender per million for a particular year. E.g., if Mary (F) is recorded 50,000 times in 1950 with 2,000,000 babies born that year, then its popularity score will be 25,000 (per million).

This is what I've tried.

def popularity(Year,Name,Gender):
    count = df[(df['Year']==Year) & (df['Name']==Name) & (df['Gender']==Gender)].Count.sum()
    total = df[(df['Year']==Year)].Count.sum()
    return round((count/total)*1e6)

df['popularity'] = df.apply(lambda x: popularity(x.Year,x.Name,x.Gender),axis=1)

I also tried doing 'df.groupby(...)' but that didn't work because the function has multiple arguments. Even if I truncate the data as df[0:100], the function takes 15+ seconds to run, which would equate to about 84 hours of runtime for the entire df.

Any suggestions for economizing would be appreciated.

0

There are 0 best solutions below