How often a pair appear together over a large number of cluster solutions

84 Views Asked by At

In order to evaluate the stability of a classification/clustering solution I am running 1,000 bootstraps of the algorithm on my data. Over these classification outcomes I would like to count how often each pair occurs in the SAME cluster. I have about 250 observations that I am clustering, making about 31k such pairs.

This is pseudo code to generate a synthetic data set:

set.seed(1)
ID <- paste ("ID",seq(1:250),sep="")
cluster1 <- sample(1:5, 250, replace=TRUE)
cluster2 <- sample(1:5, 250, replace=TRUE)
cluster3 <- sample(1:5, 250, replace=TRUE)


df <- data.frame(ID, cluster1, cluster2, cluster3)

You will see that ID3 and ID4 appear in the same cluster twice.

As with all classifications the integer used to denote the cluster membership is arbitrary.

1

There are 1 best solutions below

0
On BEST ANSWER

Since my problem isn't too large, I used code that I would easily write in C.

set.seed(1)

pairs.matrix <- matrix(0, 250, 250)
for (s in 1:1000){
  cluster=sample(1:5, 250, replace=TRUE)
  for (i in 1:(length(cluster)-1))
    for (j in (i+1):length(cluster))
      if (cluster[i] == cluster[j]) pairs.matrix[i,j] <- pairs.matrix[i,j] + 1
}