I Have Dataframe with a lot of columns (Around 100 feature), I want to apply the interquartile method and wanted to remove the outlier from the data frame.
I am using this link stackOverflow
But the problem is nan of the above method is working correctly,
As I am trying like this
Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
it is giving me this
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
Out[35]:
Day 0
Col1 0
Col2 0
col3 0
Col4 0
Step_Count 1179
dtype: int64
I just wanted to know that, What I will do next so that all the outlier from the data frame will be removed.
if i am using this
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
re_dat = remove_outlier(stepframe, stepframe.columns)
I am getting this error
ValueError: Cannot index with multidimensional key
in this line
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
You can use:
Details:
First create
boolean DataFrame
with chain by|
:And then use
DataFrame.any
for check at least oneTrue
per row and last invert boolean mask by~
:invert
solution with changed conditions -<
to>=
and>
to<=
, chain by&
for AND and last filter byall
for check allTrue
s per rows