Problem while trying to replace outliers in pandas

108 Views Asked by At

Okay, so I've trying to clean data for the Machine Learning project. I'm using Z-Score for the outliers detection. Database contains different types of glass (from 1-7) and I want to go through each glass type, find the outliers and replace them with mean values of the sodium contained in a given type of glass ("Na" column). The weird thing is the algorithm is working for glass Type 1 and 2 but when it comes to Type 3 it gives a ValueError. Do you guys know what seems to be the problem?

z = stats.zscore(DataFrame.Na)
threshold = 1.99

for t in DataFrame.Type.unique():
    z = stats.zscore(DataFrame.Na[DataFrame.Type==t])
    print([DataFrame.Na[DataFrame.Type==t][(np.abs(z) > threshold)]])
    DataFrame.Na[DataFrame.Type==t] = DataFrame.Na[DataFrame.Type==t].replace([DataFrame.Na[DataFrame.Type==t][(np.abs(z) > threshold)]],np.mean(DataFrame.Na[DataFrame.Type==t]))

And the output is:

[17    14.36
21    14.77
Name: Na, dtype: float64]
[70     14.86
105    11.45
106    10.73
108    14.43
110    11.23
111    11.02
Name: Na, dtype: float64]
[149    12.16
Name: Na, dtype: float64]

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

KeyError: 0

Any of you guys know what could be wrong with this? If you need any additional information I will provide it, thinking about this for about 2 hours and I don't have a clue...

2

There are 2 best solutions below

1
On

I can't comment so I'll post my comment as an answer.

Are you trying to detect "outliers" or "outliners". Not just being pedantic here as they are different statistical concepts.

1
On

What is happening is that somewhere you are trying to set the value at row 0 in a dataframe that does not have a row 0. Try breaking up your long lines, and printing the results to console, you'll likely find the error that way.