parallel k-means in R

I am trying to understand how to parallelize some of my code using R. So, in the following example I want to use k-means to cluster data using 2,3,4,5,6 centers, while using 20 iterations. Here is the code:



parallel.function <- function(i) {
    kmeans( X[1:100,100], centers=?? , nstart=i )

out <- mclapply( c(5, 5, 5, 5), FUN=parallel.function )

How can we parallel simultaneously the iterations and the centers? How to track the outputs, assuming I want to keep all the outputs from k-means across all, iterations and centers, just to learn how?


This looked very simple to me at first ... and then i tried it. After a lot of monkey typing and face palming during my lunch break however, I arrived at this:



mc = mclapply(2:6, function(x,centers)kmeans(x, centers), x=X)

It looks right though I didn't check how sensible the clustering was.

> summary(mc)
     Length Class  Mode
[1,] 9      kmeans list
[2,] 9      kmeans list
[3,] 9      kmeans list
[4,] 9      kmeans list
[5,] 9      kmeans list

On reflection the command syntax seems sensible - although a lot of other stuff that failed seemed reasonable too...The examples in the help documentation are maybe not that great.

Hope it helps.

EDIT As requested here is that on two variables nstart and centers

(pars = expand.grid(i=1:3, cent=2:4))

  i cent
1 1    2
2 2    2
3 3    2
4 1    3
5 2    3
6 3    3
7 1    4
8 2    4
9 3    4

# zikes horrible
pars2=apply(pars,1,append, L)
mc = mclapply(pars2, function(x,pars)kmeans(x, centers=pars$cent,nstart=pars$i ), x=X)

> summary(mc)
      Length Class  Mode
 [1,] 9      kmeans list
 [2,] 9      kmeans list
 [3,] 9      kmeans list
 [4,] 9      kmeans list
 [5,] 9      kmeans list
 [6,] 9      kmeans list
 [7,] 9      kmeans list
 [8,] 9      kmeans list
 [9,] 9      means list

How'd you like them apples?


You may use parallel to try K-Means from different random starting points on multiple cores.

The code below is an example. (K=K in K-means, N= number of random starting points, C = number of cores you would like to use)

suppressMessages( library("Matrix") )
suppressMessages( library("irlba") )
suppressMessages( library("stats") )
suppressMessages( library("cluster") )
suppressMessages( library("fpc") )
suppressMessages( library("parallel") )

#Calculate KMeans results
calcKMeans <- function(matrix, K, N, C){
  #Parallel running from various of random starting points (Using C cores)
  results <- mclapply(rep(N %/% C, C), FUN=function(nstart) kmeans(matrix, K, iter.max=15, nstart=nstart), mc.cores=C);
  #Find the solution with smallest total within sum of square error
  tmp <- sapply(results, function(r){r[['tot.withinss']]})
  km <- results[[which.min(tmp)]]  
  #return cluster, centers, totss, withinss, tot.withinss, betweenss, size

runKMeans <- function(fin_uf, K, N, C, 
                      #fout_center, fout_label, fout_size, 
                      fin_record=NULL, fout_prediction=NULL){
  uf = read.table(fin_uf)
  km = calcKMeans(uf, K, N, C)
  #write.table(km$cluster, file=fout_label, row.names=FALSE, col.names=FALSE)
  #write.table(km$center, file=fout_center, row.names=FALSE, col.names=FALSE)
  #write.table(km$size, file=fout_size, row.names=FALSE, col.names=FALSE)


Hope it helps!


There's a CRAN package called knor that is derived from a research paper that improves the performance using a memory efficient variant of Elkan's pruning algorithm. It's an order of magnitude faster than everything in these answers.

iris.mat <- as.matrix(iris[,1:4])
k <- length(unique(iris[, dim(iris)[2]])) # Number of unique classes
nthread <- 4
kms <- Kmeans(iris.mat, k, nthread=nthread)