Remove rows from dataset in python

150 Views Asked by At

I'm trying to take some rows that are classified as outliers, and remove these rows from the original dataset, but I can't make it work - do you guys know what goes wrong? I try to run the followin code, and get this error "ValueError: Index data must be 1-dimensional"

#identify outliers
pred = iforest.fit_predict(x)
outlier_index = np.where(pred==-1)
outlier_values = x.iloc[outlier_index]
#remove from dataset (dataset = x)
x_new = x.drop([outlier_values])

outlier_values original dataset

2

There are 2 best solutions below

0
On BEST ANSWER

The outlier_values you linked is a dataframe not a flat list of indexes, so the value error is thrown accordingly.

What you need to do is to extract the list of indexes from the outlier_values dataframe, using:

index_list = outlier_values.index.values.tolist() 

into a list of indexes and then drop those indexes from x.

as in this answer

0
On

Try this

#identify outliers
pred = iforest.fit_predict(x)

# np.where returns a tuple of ndarray we access the first dimension
outlier_index = np.where(pred==-1)[0] 

outlier_values = x.iloc[outlier_index]

#remove from dataset (dataset = x)
x_new = x.drop([outlier_values])

In your case you could directly pass outlier_index as so

#identify outliers
pred = iforest.fit_predict(x)
outlier_index = np.where(pred==-1)[0]
df = df.drop(outlier_index)