I am just getting into machine learning and I am working on using classification models. Currently I am using a mushroom classification dataset (class is poisonous or edible). The issue is that, while I am following literally the most basic possible procedure I see everyone else do, my model only returns a perfect classification. This is the code I am using to create my model.
model = KNeighborsClassifier(n_neighbors = 5)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)
score = accuracy_score(y_test, y_preds)
This returns an accuracy score of 1.0, a confusion matrix showing no confusion at all (100% correctly predicted values), and this doesn't change if I change the k number or the test size. Even setting it to 50% came back the same.
I have cleaned the data to the best of my abilities and the data is entirely one-hot encoded. I think maybe this is playing into affect but I am not sure. Below is the code I used to prep the data. First I filled the missing values, then encoded ordinal data. Any input is appreciated!
qmarks = df.loc[df['Stalk Root'].str.contains('\?')] # nan values are ? here
mode = df['Stalk Root'].mode() #most common answer is b
df_enc = df.replace('?', 'b') #replace all question marks with most common value
df_enc['Ring Number'] = df_enc['Ring Number'].replace({'n': 0, 'o': 1, 't': 2}).astype(int)
df_enc['Gill Spacing'] = df_enc['Gill Spacing'].replace({'c': 0, 'w': 1, 'd': 2}).astype(int)
df['Poisonous'] = (df['Poisonous'] == 'p').astype(int)
df_enc = pd.get_dummies(df)
And then I split the data as shown:
y= df.iloc[:, 0:1]
X= df.iloc[:, 2:-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle= True)
I have tried changing a lot of the variables but this isn't the first dataset that has done this to me. It happened with a linear regression model on a different dataset a while back and I couldn't figure out what was wrong there either. I can imagine that the data encoding, the train test split, or user error is at fault but I don't know how to go about fixing it. I am sure however that the data is split accurately, is not the same in both the train and test splits, and the dataset has a relatively even distribution of each class. Please help!
Edit: Added data prep code
Assuming there are no errors in the code such that y_test equals y_pred, I would analyse the dataset to try and understand if the result makes sense.
Remember that the K-Neighbors algorithm will pick the most common class among the K observations closest to the point. So, if the data is already well separated, it is very possible that the K closest observations are always from the correct class.
Imagine a XY plane with two clusters really far apart from each other, in this case the K-Neighbors will almost always return a 100% accuracy.