How to Remove outlier from DataFrame using IQR?

41.9k Views Asked by At

I Have Dataframe with a lot of columns (Around 100 feature), I want to apply the interquartile method and wanted to remove the outlier from the data frame.

I am using this link stackOverflow

But the problem is nan of the above method is working correctly,

As I am trying like this

Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()

it is giving me this

((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
Out[35]: 
Day                      0
Col1                     0
Col2                     0
col3                     0
Col4                     0
Step_Count            1179
dtype: int64

I just wanted to know that, What I will do next so that all the outlier from the data frame will be removed.

if i am using this

def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low  = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out

re_dat = remove_outlier(stepframe, stepframe.columns)

I am getting this error

ValueError: Cannot index with multidimensional key

in this line

    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
4

There are 4 best solutions below

0
On BEST ANSWER

You can use:

np.random.seed(33454)
stepframe = pd.DataFrame({'a': np.random.randint(1, 200, 20), 
                          'b': np.random.randint(1, 200, 20),
                          'c': np.random.randint(1, 200, 20)})

stepframe[stepframe > 150] *= 10
print (stepframe)

Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1

df = stepframe[~((stepframe < (Q1 - 1.5 * IQR)) |(stepframe > (Q3 + 1.5 * IQR))).any(axis=1)]

print (df)
      a    b     c
1   109   50   124
3   137   60  1990
4    19  138   100
5    86   83   143
6    55   23    58
7    78  145    18
8   132   39    65
9    37  146  1970
13   67  148  1880
15  124  102    21
16   93   61    56
17   84   21    25
19   34   52   126

Details:

First create boolean DataFrame with chain by |:

print (((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))))
        a      b      c
0   False   True  False
1   False  False  False
2    True  False  False
3   False  False  False
4   False  False  False
5   False  False  False
6   False  False  False
7   False  False  False
8   False  False  False
9   False  False  False
10   True  False  False
11  False   True  False
12  False   True  False
13  False  False  False
14  False   True  False
15  False  False  False
16  False  False  False
17  False  False  False
18  False   True  False
19  False  False  False

And then use DataFrame.any for check at least one True per row and last invert boolean mask by ~:

print (~((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).any(axis=1))
0     False
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11    False
12    False
13     True
14    False
15     True
16     True
17     True
18    False
19     True
dtype: bool

invert solution with changed conditions - < to >= and > to <=, chain by & for AND and last filter by all for check all Trues per rows

print (((stepframe >= (Q1 - 1.5 * IQR)) & (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1))
0     False
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11    False
12    False
13     True
14    False
15     True
16     True
17     True
18    False
19     True
dtype: bool


df = stepframe[((stepframe >= (Q1 - 1.5 * IQR))& (stepframe <= (Q3 + 1.5 * IQR))).all(axis=1)]
0
On

This is a rewrite of jezrael's accepted solution in a slightly simplified form and as a function that accepts both DataFrames and Series and an argument for determining the threshold. The simplification will give a slight performance improvement on big data.

Specifically,

  • 1.5 * (Q3 - Q1) is computed only once
  • my solution computes the within_mask right away, instead of first creating a without_mask and then negating it
  • it replaces the comparison operators with the corresponding methods (.ge() and .le() for the within-mask case), doing away with the parentheses
import pandas as pd
import numpy as np

def remove_outliers(
        df_or_s: pd.DataFrame | pd.Series,
        iqr_coefficient: float = 1.5
) -> pd.DataFrame | pd.Series:
    Q1 = df_or_s.quantile(0.25)
    Q3 = df_or_s.quantile(0.75)
    distance = iqr_coefficient * (Q3 - Q1)
    lower_fence, upper_fence = Q1 - distance, Q3 + distance
    within_mask = df_or_s.ge(lower_fence) & df_or_s.le(upper_fence)
    if isinstance(df_or_s, pd.DataFrame):
        return df_or_s[within_mask.all(axis=1)]
    else:
        # alternatively, for a Series we can use .between() since pandas 1.3.0
        # within_mask = df_or_s.between(lower_fence, upper_fence, inclusive="both")
        return df_or_s[within_mask]


def test_frame(iqr_coefficient=1.5):
    np.random.seed(33454)
    stepframe = pd.DataFrame(
        {'a': np.random.randint(1, 200, 20),
         'b': np.random.randint(1, 200, 20),
         'c': np.random.randint(1, 200, 20)}
        )
    stepframe[stepframe > 150] *= 10
    print(stepframe)
    return remove_outliers(stepframe, iqr_coefficient)


def test_series(iqr_coefficient=1.5):
    np.random.seed(33454)
    stepseries = pd.Series(np.random.randint(1, 200, 20))
    stepseries[stepseries > 150] *= 10
    print(stepseries)
    return remove_outliers(stepseries, iqr_coefficient)

test_frame() yields the same result as the accepted answer, test_series() only the first column as a Series.

1
On

Before using this Function make sure that your DATAFRAME is in the "df" variable because i build this function in this way. If you have different name of DATAFRAME variable then replace "df" with your DATAFRAME Variable name and that's it.

Why i build lengthy function?? -> Because i want every information like Outliers, Index of Outliers so i can see how it's functioning with Lower and Higher fence.

def outliers(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    Lower_fence = Q1 - (1.5*IQR)
    print(f"Lower fence is = {Lower_fence}")

    Higher_fence = Q3 + (1.5*IQR)
    print(f"Higher fence is = {Higher_fence}")
      
    #here i'm taking all Outliers and appending this in Variable "Outlier".
    Outlier =[]
    for i in data:
        if i < Lower_fence:
            Outlier.append(i)
            data.drop(data==i)
        elif i > Higher_fence:
            Outlier.append(i)


    #With the help of "index" function here we are getting all the indexes of Lower_fence and Higher_fence

    Index_Outlier = df[data < Lower_fence ].index  
    Index_Outlier = df[data > Higher_fence].index


    #Here we are converting all the "Outliers" and "Index_Outliers" into string just to see all the data in One line
    #If you do print(Outliers) or print(Outliers_index) you will get every element of data in New Line.

    print(f"\nOutliers = {', '.join([str(x) for x in Outlier])}")
    print(f"\nOutliers_INDEX = {', '.join([str(x) for x in Index_Outlier ])}")

    #here we are seeing before and after shape.

    print(f'\nBEFORE dropping Outlier we have rows = {df.shape[0]}, and columns = {df.shape[1]}')

    df.drop(Index_Outlier,inplace=True)

    print(f'AFTER dropping Outlier we have rows = {df.shape[0]}, and columns = {df.shape[1]}')
0
On

You forgot to write the name of your column in the quotation (['col_name']).

The correct one is:

df_out = df_in.loc[(df_in['col_name'] > fence_low) & (df_in['col_name'] < fence_high)]