How to choose many initial center of k-means clustering in R

1.8k Views Asked by At

I want to run buckshot algorithm in R what combine hac(hierarchical clustering) with k-means clustering. so, I want to be select many center of k-means. For example, one of a cluster has three seed. This is my code,

iris data k-means

iristr <- read.csv("iristr.CSV", header = TRUE)
str(iristr)
iristr.m <- as.matrix(iristr[,1:4])
km <- kmeans(iristr.m, centers = 3)
km
table(km$cluster,iristr$Species)

iris data buckshot

irists <- read.csv("irists.csv", header = TRUE)
str(irists)
irists.m <- as.matrix(irists[,1:4])
dm <- dist(irists.m, method = "euclidean")
hc <- hclust(dm, method = "complete")
plot(hc)
clusterCut <- cutree(hc,3)
clusterCut
i1 <- iristr.m[c(1,4,12),] # one of cluster have many seed(center)
i1 
i2 <- iristr.m[c(2,5,8),] # one of cluster have many seed(center)
i2
i3 <- iristr.m[c(3,6,7,9,10,11),] # one of cluster have many seed(center)
i3
buckshot <- kmeans(iristr.m, centers=i1,i2,i3) # realized only "i1" centers
buckshot
table(buckshot$cluster,iristr$Species)
2

There are 2 best solutions below

0
On

Here is an example to apply the Kmeans clustering algorithm on the Iris data.

Using the Iris data, the features column 1-4 is assigned to variable x, and the class to variable y.

x = iris[,-5]
y = iris$Species

In Kmeans algorithm, the initial cluster assignments are random. Since we know that there are 3 species in that data, the total number of clusters can be specified as 3. Also since the starting assignments in Kmeans are random, the nstart can be assigned 10, meaning 10 different (random) initial center assignments will be tried and the one having lowest within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the K center) will be selected as final. You can assign a higher value to the parameter "nstart" to tell the Kmeans algorithm to try more possible random initial center assignments.

kc <- kmeans(x, centers = 3, nstart = 10)

To know the error, the clustering result is then compared with the species/classes in the iris data.

table(y,kc$cluster)

Finally the result is visualized by plotting the Sepal length as x-axis and Sepal Width as y-axis (you can choose different though).

plot(x[c("Sepal.Length", "Sepal.Width")], col = kc$cluster)
points(kc$centers[,c("Sepal.Length", "Sepal.Width")], col = 1:3, pch=23, cex=3)

enter image description here

0
On

This is for using kmeans with animation library. I am not sure if it will help you but does offer a solution if anyone else searches this topic using animation library.

#data setup
M = matrix(c(2, 2, 4, 1, 8, 3, 7, 2, 5, 9, 4, 2, 3, 1, 2, 6), ncol = 2, byrow = TRUE)
rownames(M) = c('A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8')
colnames(M) = c('x', 'y')
#Matrix for initial centers to be A2, A5, A8
A = matrix(c(4, 1, 5, 9, 8, 7), ncol = 2, byrow = TRUE)
colnames(A) = c('x', 'y')


library(animation)
oopt = ani.options(interval = 5)
#pass A for centers argument
kmeans.ani(M, centers = A)
help(kmeans.ani)

First Plot of kmeans.ani showing centers