I want to do k-means clustering to classify Testing data based on Training data both of which have 3 classes (1,2 and 3).
How would I classify the Testing data set using a cluster size of e.g. k=10 in kmeans (e.g. using Matlab)? I know that I can have k=3 and then use nearest neighbour to identify the data based on its nearest cluster size... but not sure what I would use for values other that k=3? How would you label each of those 10 clusters?
Thanks
It is a little bit unclear what exactly you want to do, although here is an outline from what I understand.
When you are clustering data, the labels are ideally not present, as either you use the clustering to get insights from the data or use it for pre-processing.
Although, if you want to perform a clustering and then assign class id to a new datapoint based on the nearness of the cluster centers, then you can do the following.
First, you select the
k
by bootstrapping or other methods, maybe use Silhouette coefficients. Once you get the cluster centers, check which center is closest to the new datapoint and assign the class id accordingly.In such cases you might be interested to use the Rand Index or the Adjusted Rand Index, to get the cluster quality.