I'm trying to do a k nearest neighbors prediction on some text recognition data I found on the UCI Machine Learning Database. (https://archive.ics.uci.edu/ml/datasets/Letter+Recognition)
I cross validated the data and tested for accuracy with no issues but I can't run the classifier.predict(). Can anyone shed light on why I'm getting this error? I read up on the curse of dimensionality on the sklearn site but I'm having trouble actually fixing my code.
My code so far is as follows:
import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
df = pd.read_csv('KMeans_letter_recog.csv')
X = np.array(df.drop(['Letter'], 1))
y = np.array(df['Letter'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.2) #20% data used
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test) #test
print(accuracy) #this works fine
example = np.array([7,4,3,2,4,5,3,6,7,4,2,3,5,6,8,4])
example = X.reshape(len(example), -1)
prediction = clf.predict(example)
print(prediction) #error
df.head() produces:
Letter x-box y-box box_width box_height on_pix x-bar_mean \
0 T 2 8 3 5 1 8
1 I 5 12 3 7 2 10
2 D 4 11 6 8 6 10
3 N 7 11 6 6 3 5
4 G 2 1 3 1 1 8
y-bar_mean x2bar_mean y2bar_mean xybar_mean x2y_mean xy2_mean \
0 13 0 6 6 10 8
1 5 5 4 13 3 9
2 6 2 6 10 3 7
3 9 4 6 4 4 10
4 6 6 6 6 5 9
x-ege xegvy y-ege yegvx
0 0 8 0 8
1 2 8 4 10
2 3 7 3 9
3 6 10 2 8
4 1 7 5 10
My error feed as as such:
Traceback (most recent call last):
File "C:\Users\jai_j\Desktop\Python Projects\K Means ML.py", line 31, in <module>
prediction = clf.predict(example)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\neighbors\classification.py", line 145, in predict
neigh_dist, neigh_ind = self.kneighbors(X)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\neighbors\base.py", line 381, in kneighbors
for s in gen_even_slices(X.shape[0], n_jobs)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 758, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 109, in apply_async
result = ImmediateResult(func)
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 326, in __init__
self.results = batch()
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\jai_j\Desktop\Python Projects\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "sklearn\neighbors\binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn\neighbors\kd_tree.c:11325)
ValueError: query data dimension must match training data dimension
Thank you in advance for any help, I'll keep searching for an answer in the meantime
Your problems are that you are not reshaping
example
and that you are reshaping to incorrect dimensions. You are reshaping yourX
array to be(16, N)
, whereN
is the number of observations inX
.As a result, when you try to predict on
example
, you end up using your classifier to predict onX
reshaped to haveN
columns, instead of 16 columns as in the one you trained on.It seems you want to predict on your single example, so you should reshape it instead of
X
. Presumably, you wantexample = example.reshape(1, -1)
instead ofexample = X.reshape(len(example), -1)
.Initially, you create
example
with shape(16,)
. You should reshape it to be(1, 16)
, by using(1, -1)
as the dimensions. This will result in an array with shape(1, 16)
, which fits your classifier.To be clear, try changing your code to this: