Not filling NaN values in df when using mode of df

Question

Not filling NaN values in df when using mode of df

36 Views Asked by zazoupile At 14 February 2024 at 13:50

I have this dataframe for which I want to fill in the nan value:

quant = ['satisfaction_level', 'last_evaluation', 'average_monthly_hours']
cat = ['number_projects', 'time_spend_company', 'work_accident', 'promotion_last_5_years', 'position', 'salary']
X_train[quant] = X_train[quant].fillna(X_train[quant].median())
X_train[cat] = X_train[cat].fillna(X_train[cat].mode())

X_test[quant] = X_test[quant].fillna(X_test[quant].median())
X_test[cat] = X_test[cat].fillna(X_test[cat].mode())

Even when I use .loc instead, it does not fill the NaN and if there are not two modes as the mode()[0] returns an error.

However this does not fill in the categorical nan values though the documentation states that the value parameter can be a dataframe as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

Other methods such as doing it by column works.

Any idea ?

Original Q&A

There are 1 best solutions below

**DataJanitor** · Accepted Answer · 2024-02-14T14:46:21.107000

The reason is that .mode() returns a DataFrame, so you need to use .mode().iloc[0]:

for column in cat:
    X_train[column].fillna(X_train[column].mode().iloc[0], inplace=True)
    X_test[column].fillna(X_test[column].mode().iloc[0], inplace=True)

This .mode().iloc[0] method will not give an error if there are not multiple modes. The mode() function always returns a DataFrame, even if there is only one mode. The .iloc[0] simply selects the first mode from this DataFrame. So, if there is only one mode, .mode().iloc[0] will return that mode. If there are multiple modes, .mode().iloc[0] will return the first one. When you use .mode()[0] on a pandas.DataFrame, it can lead to issues if there are not multiple modes. This is because .mode() returns a DataFrame, and when you use [0], it tries to access the first column of this DataFrame, not the first row.

I tested this snipped with the following code:

import pandas as pd
import numpy as np

df_train = pd.DataFrame({
    'satisfaction_level': [0.38, 0.80, 0.11, 0.72, 0.37, np.nan],
    'last_evaluation': [0.53, 0.86, 0.88, 0.87, 0.52, 0.50],
    'average_monthly_hours': [157, np.nan, 272, 223, 159, 153],
    'number_projects': [2, 5, 7, 5, 2, np.nan],
    'time_spend_company': [3, 6, 4, 5, 3, 3],
    'work_accident': [0, 0, 0, 0, 0, 1],
    'promotion_last_5_years': [0, 0, 0, 0, 0, 0],
    'position': ['sales', 'sales', 'sales', 'sales', 'sales', np.nan],
    'salary': ['low', 'medium', 'medium', 'low', 'low', 'low']
})

df_test = pd.DataFrame({
    'satisfaction_level': [0.41, 0.10, 0.92, 0.89, 0.42, 0.45],
    'last_evaluation': [0.50, 0.77, 0.85, 0.91, 0.53, np.nan],
    'average_monthly_hours': [160, 255, 262, 234, 158, 145],
    'number_projects': [2, 6, 5, 5, 2, 2],
    'time_spend_company': [3, 4, 5, 5, 3, 2],
    'work_accident': [0, 0, 0, 0, 1, 0],
    'promotion_last_5_years': [0, 0, 0, 0, 0, 0],
    'position': ['sales', 'sales', 'sales', 'sales', 'sales', 'sales'],
    'salary': ['low', 'medium', 'medium', 'low', 'low', np.nan]
})

quant = ['satisfaction_level', 'last_evaluation', 'average_monthly_hours']
cat = ['number_projects', 'time_spend_company', 'work_accident', 'promotion_last_5_years', 'position', 'salary']

# Fill NaN values in quantitative columns with median
df_train[quant] = df_train[quant].fillna(df_train[quant].median())
df_test[quant] = df_test[quant].fillna(df_test[quant].median())

# Fill NaN values in categorical columns with mode
for column in cat:
    df_train[column].fillna(df_train[column].mode().iloc[0], inplace=True)
    df_test[column].fillna(df_test[column].mode().iloc[0], inplace=True)

print(df_train)
print(df_test)

	satisfaction_level	last_evaluation	average_monthly_hours	number_projects	time_spend_company	work_accident	position	salary
0	0.38	0.53	157	2	3	0	sales	low
1	0.8	0.86	159	5	6	0	sales	medium
2	0.11	0.88	272	7	4	0	sales	medium
3	0.72	0.87	223	5	5	0	sales	low
4	0.37	0.52	159	2	3	0	sales	low
5	0.38	0.5	153	2	3	1	sales	low

	satisfaction_level	last_evaluation	average_monthly_hours	number_projects	time_spend_company	work_accident	position	salary
0	0.41	0.5	160	2	3	0	sales	low
1	0.1	0.77	255	6	4	0	sales	medium
2	0.92	0.85	262	5	5	0	sales	medium
3	0.89	0.91	234	5	5	0	sales	low
4	0.42	0.53	158	2	3	1	sales	low
5	0.45	0.77	145	2	2	0	sales	low

Not filling NaN values in df when using mode of df

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in NAN

Related Questions in FILLNA

Trending Questions

Popular # Hahtags

Popular Questions