Filling null values based on the proportion of the categories in that column

Question

Filling null values based on the proportion of the categories in that column

184 Views Asked by saad1s At 28 July 2025 at 02:10

I have the following data

col=['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
dd=pd.DataFrame({'color':col})
dd.replace('nan',np.NaN,inplace=True)
dd.sample(5)
Out[1]: 
    color
8    grey
14    NaN
7     NaN
2   black
9   black

The following is the proportion of each color in the column

dd.color.value_counts(normalize=True)
Out[2]: 
black    0.500000
grey     0.285714
white    0.142857
red      0.071429

Now I want to fill the null values based on these proportions above. So 50% of null values will become black, 28% grey,14% white and 7.1% red

Original Q&A

There are 2 best solutions below

Panda Kim On 27 December 2022 at 02:22

cond1 = dd['color'].isnull()
dd.loc[cond1, 'color'] = dd[~cond1].sample(cond1.sum())['color'].values

as data size increases, probability converges. run following code for checking result

import numpy as np
col=['black','black','black','grey','white','grey','grey','grey','black','black','red',np.nan,np.nan,np.nan,np.nan,'black','black','white']
dd=pd.DataFrame(col * 10000, columns=['color']) # *10000 for large data
cond1 = dd['color'].isnull()
dd.loc[cond1, 'color'] = dd[~cond1].sample(cond1.sum())['color'].values
dd.value_counts(normalize=True)

result:

color
black    0.500444
grey     0.284111
white    0.143944
red      0.071500
dtype: float64

**Park** · Accepted Answer

You can randomly assign the value based on the probability using numpy.random.choice (https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)

STEP #1. Calculate the probabilities for each values.

STEP #2. Assign the values to NaN based on the probability so that the values are assigned as the original distribution.

This way will work regardless of the number of values.

import pandas as pd
import numpy as np

col = ['black','black','black','grey','white','grey','grey','nan','grey','black','black','red','nan','nan','nan','nan','black','black','white']
dd = pd.DataFrame({'color': col})
dd.replace('nan', np.NaN, inplace=True)
print(dd.color.value_counts(normalize=True))
"""
black    0.500000
grey     0.285714
white    0.142857
red      0.071429
Name: color, dtype: float64
"""

probability = list(dd.color.value_counts(normalize=True).values)
dd['color'] = dd['color'].apply(lambda x: np.random.choice(['black', 'grey', 'white', 'red'], 1, replace=False, p=probability)[0] if pd.isna(x) else x)

print(dd.color.value_counts(normalize=True))
"""
black    0.526316
grey     0.315789
white    0.105263
red      0.052632
Name: color, dtype: float64
"""

Filling null values based on the proportion of the categories in that column

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in FILLNA

Related Questions in PROPORTIONS

Trending Questions

Popular # Hahtags

Popular Questions