Clustering a mixed data set in R

3.5k Views Asked by At

I have a mixed data set (has factor and numeric variable types) and I want to do some clustering analysis. This is so that I will be able to study the entries in each cluster to tell what they have in common.

I know that for this type of data set, the distance to use is "Gower distance".

This what I have done so far:

cluster <- daisy(mydata, metric = c("euclidean", "manhattan", "gower"), 
               stand = FALSE, type = list())
try <- agnes(cluster)
plot(try, hang = -1)

The above gave me a dendrogram but I have 2000 entries in my data and I am unable to identify the individual entries at the end of the dendrogram. Also, I want to be able to extract the clusters from the dendrogram.

1

There are 1 best solutions below

0
On

There should be only one metric in the daisy function. The daisy function provides a distance matrix of (mixed-type) observations.

To obtain the cluster labels from the agnes, one can use the cutree function. See the following example using the mtcars data set;

Preparing of the data

The mtcars data frame has all variables on the numerical scale. However, when one looks at the description of the variables, it is apparent some of the variables cannot be used as numeric variables when clustering the data. For example, vs, the shape of the engine should be a (unordered) factor variable, while the number of gears should be an ordered factor.

# directly from the ?mtcars
mtcars2 <- within(mtcars, {
  vs <- factor(vs, labels = c("V", "S"))
  am <- factor(am, labels = c("automatic", "manual"))
  cyl  <- ordered(cyl)
  gear <- ordered(gear)
  carb <- ordered(carb)
})

Compute the dissimilarity matrix

# Compute all the pairwise dissimilarities (distances) between observations 
# in the data set.
diss_mat <- daisy(mtcars2, metric = "gower")

Clustering the dissimilarity matrix

# Computes agglomerative hierarchical clustering of the dataset.
k <- 3
agnes_clust <- agnes(x = diss_mat)
ag_clust <- cutree(agnes_clust, k)


# Clustering the dissimilarity matrix using 
# partitioning around medoids 
pam_clust <- pam(diss_mat, k)

# A comparision of the two clusterings
table(ag_clust, pam_clust=pam_clust$clustering)
#          pam_clust
# ag_clust  1  2  3
#        1  6  0  0
#        2  2 10  2
#        3  0  0 12

Other packages

A couple of other packages to cluster mixed-type data are CluMix and FD.