Even though removing the outliers using the IQR Method. The outliers are still present the data

99 Views Asked by At

I have found the outliers in my data using the box plot method.

enter image description here Box plot Before applying IQR Method

file1.shape
# (457, 11)

I have applied the IQR method to the data.

q1, q2, q3 = file1['Salary'].quantile([0.25, 0.5, 0.75])
IQR = q3 - q1
f_data = file1[(file1['Salary'] > lower_bound) & (file1['Salary'] < upper_bound)]

And I removed a few data points.

f_data.shape
# (420, 11)

However, after reviewing the filtered data using a box plot, I still found a few outliers in my data.

enter image description here Box plot after applying the IQR method.

What should i do now.
Do i have to perform the IQR method again on the filtered data.

The Salary data is right skewed data . It's skew value is around 1.5

Or should I decrease the skew value. Like using log, power methods.

2

There are 2 best solutions below

0
On

I think you are wrongly using IQR,

q1, q3 = file1['Salary'].quantile([0.25, 0.75])
IQR = q3 - q1
lower_bound = q1 - 1.5 * IQR
upper_bound = q3+ 1.5 * IQR

then

f_data = file1[(file1['Salary'] > lower_bound) & (file1['Salary'] < upper_bound)]

should work.

0
On

Double-check your IQR calculation: Make sure that the calculation of the lower and upper bounds using the quartiles (q1, q3) and the IQR is accurate. Ensure that you are not accidentally using the original data's quartiles instead of the filtered data's quartiles.

Reapply IQR method on filtered data: If you still notice outliers, you can consider applying the IQR method again on the already filtered data (f_data). This may help in removing any remaining outliers.

q1, q3 = f_data['Salary'].quantile([0.25, 0.75]) IQR = q3 - q1 lower_bound = q1 - 1.5 * IQR upper_bound = q3 + 1.5 * IQR

f_data = f_data[(f_data['Salary'] > lower_bound) & (f_data['Salary'] < upper_bound)]

Transform the skewed data: Since your Salary data is right-skewed, transforming the data using methods like logarithm or power transformations might help in reducing the skewness. This can sometimes make the distribution more symmetric and assist in outlier removal.

Example using log transformation

import numpy as np
f_data['Salary'] = np.log1p(f_data['Salary'])