Replace nan values by user defined values in categorical variables

1k Views Asked by At

Consider a categorical variable in a pandas dataFrame, where some of the entries are NaN values, e.g.

User Name
1     Joe 
2     nan    
3     Joe     
4    Mark  
5     nan  
6     Joe

I would like to replace the nan values by a user-defined function that preserves the frequency of the names, i.e., assigns weights [0.75, 0.25] to ['Joe', 'Mark'] respectively. This can be easily done with random.choices, how can I insert the values to the dataFrame with pandas?

1

There are 1 best solutions below

0
ALollz On BEST ANSWER

value_counts with normalize=True to get the weights, then set the null values with loc

import numpy as np

p = df.Name.value_counts(normalize=True)  # Series of probabilities
m = df.Name.isnull()

np.random.seed(42)
rand_fill = np.random.choice(p.index, size=m.sum(), p=p)
#array(['Joe', 'Mark'], dtype=object)

df.loc[m, 'Name'] = rand_fill

#   User  Name
#0     1   Joe
#1     2   Joe
#2     3   Joe
#3     4  Mark
#4     5  Mark
#5     6   Joe