The KNN model I am using is always coming back at 100% accuracy but it shouldn't be

28 Views Asked by At

I am just getting into machine learning and I am working on using classification models. Currently I am using a mushroom classification dataset (class is poisonous or edible). The issue is that, while I am following literally the most basic possible procedure I see everyone else do, my model only returns a perfect classification. This is the code I am using to create my model.

model = KNeighborsClassifier(n_neighbors = 5)
    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)
    score = accuracy_score(y_test, y_preds)

This returns an accuracy score of 1.0, a confusion matrix showing no confusion at all (100% correctly predicted values), and this doesn't change if I change the k number or the test size. Even setting it to 50% came back the same.

I have cleaned the data to the best of my abilities and the data is entirely one-hot encoded. I think maybe this is playing into affect but I am not sure. Below is the code I used to prep the data. First I filled the missing values, then encoded ordinal data. Any input is appreciated!


    qmarks = df.loc[df['Stalk Root'].str.contains('\?')] # nan values are ? here
mode = df['Stalk Root'].mode() #most common answer is b
df_enc = df.replace('?', 'b') #replace all question marks with most common value

df_enc['Ring Number'] = df_enc['Ring Number'].replace({'n': 0, 'o': 1, 't': 2}).astype(int)
df_enc['Gill Spacing'] = df_enc['Gill Spacing'].replace({'c': 0, 'w': 1, 'd': 2}).astype(int)
df['Poisonous'] = (df['Poisonous'] == 'p').astype(int)
df_enc = pd.get_dummies(df)

And then I split the data as shown:


    y= df.iloc[:, 0:1]
    X= df.iloc[:, 2:-1]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle= True)


I have tried changing a lot of the variables but this isn't the first dataset that has done this to me. It happened with a linear regression model on a different dataset a while back and I couldn't figure out what was wrong there either. I can imagine that the data encoding, the train test split, or user error is at fault but I don't know how to go about fixing it. I am sure however that the data is split accurately, is not the same in both the train and test splits, and the dataset has a relatively even distribution of each class. Please help!

Edit: Added data prep code

1

There are 1 best solutions below

4
brschultze On

Assuming there are no errors in the code such that y_test equals y_pred, I would analyse the dataset to try and understand if the result makes sense.

Remember that the K-Neighbors algorithm will pick the most common class among the K observations closest to the point. So, if the data is already well separated, it is very possible that the K closest observations are always from the correct class.

Imagine a XY plane with two clusters really far apart from each other, in this case the K-Neighbors will almost always return a 100% accuracy.