I have the following data
col=['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
dd=pd.DataFrame({'color':col})
dd.replace('nan',np.NaN,inplace=True)
dd.sample(5)
Out[1]:
color
8 grey
14 NaN
7 NaN
2 black
9 black
The following is the proportion of each color in the column
dd.color.value_counts(normalize=True)
Out[2]:
black 0.500000
grey 0.285714
white 0.142857
red 0.071429
Now I want to fill the null values based on these proportions above. So 50% of null values will become black, 28% grey,14% white and 7.1% red
You can randomly assign the value based on the probability using
numpy.random.choice
(https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)STEP #1. Calculate the probabilities for each values.
STEP #2. Assign the values to
NaN
based on the probability so that the values are assigned as the original distribution.This way will work regardless of the number of values.