How to get member of clusters from R's hclust/heatmap.2

7.8k Views Asked by At

I have the following code that perform hiearchical clustering and plot them in heatmap.

library(gplots)
set.seed(538)
# generate data
y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# the actual data is much larger that the above

# perform hiearchical clustering and plot heatmap
test <- heatmap.2(y)

Which plot this: enter image description here

What I want to do is to get the cluster member from each hierarchy of in the plot yielding:

Clust 1: g3-g2-g4
Clust 2: g2-g4
Clust 3: g4-g7
etc
Cluster last: g1-g2-g3-g4-g5-g6-g7-g8-g9-g10

Is there a way to do it?

2

There are 2 best solutions below

2
On

I did have the answer, after all! @zkurtz identified the problem ... the data I was using were different than the data you were using. I added a set.seed(538) statement to your code to stabilize the data.

Use this code to create a matrix of cluster membership for the dendrogram of the rows using the following code:

cutree(as.hclust(test$rowDendrogram), 1:dim(y)[1])

This will give you:

    1 2 3 4 5 6 7 8 9 10
g1  1 1 1 1 1 1 1 1 1  1
g2  1 2 2 2 2 2 2 2 2  2
g3  1 2 2 3 3 3 3 3 3  3
g4  1 2 2 2 2 2 2 2 2  4
g5  1 1 1 1 1 1 1 4 4  5
g6  1 2 3 4 4 4 4 5 5  6
g7  1 2 2 2 2 5 5 6 6  7
g8  1 2 3 4 5 6 6 7 7  8
g9  1 2 3 4 4 4 7 8 8  9
g10 1 2 3 4 5 6 6 7 9 10
0
On

This solution requires computing the cluster structure using a different packags:

# Generate data
y = matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
# The new packags:
library(nnclust)
# Create the links between all pairs of points with 
#   squared euclidean distance less than threshold
links = nncluster(y, threshold = 2, fill = 1, give.up =1) 
# Assign a cluster number to each point
clusters=clusterMember(links, outlier = FALSE)
# Display the points that are "alone" in their own cluster:
nas = which(is.na(clusters))
print(rownames(y)[nas])
clusters = clusters[-nas]
# For each cluster (with at least two points), display the included points
for(i in 1:max(clusters, na.rm = TRUE)) print(rownames(y)[clusters == i])

Obviously you would want to revise this into a function of some kind to be more user friendly. In particular, this gives the clusters at only one level of the dendrogram. To get the clusters at other levels, you would have to play with the threshold parameter.