I have a mixed data set (has factor and numeric variable types) and I want to do some clustering analysis. This is so that I will be able to study the entries in each cluster to tell what they have in common.
I know that for this type of data set, the distance to use is "Gower distance".
This what I have done so far:
cluster <- daisy(mydata, metric = c("euclidean", "manhattan", "gower"),
stand = FALSE, type = list())
try <- agnes(cluster)
plot(try, hang = -1)
The above gave me a dendrogram but I have 2000 entries in my data and I am unable to identify the individual entries at the end of the dendrogram. Also, I want to be able to extract the clusters from the dendrogram.
There should be only one
metric
in thedaisy
function. Thedaisy
function provides a distance matrix of (mixed-type) observations.To obtain the cluster labels from the
agnes
, one can use thecutree
function. See the following example using themtcars
data set;Preparing of the data
The
mtcars
data frame has all variables on the numerical scale. However, when one looks at the description of the variables, it is apparent some of the variables cannot be used as numeric variables when clustering the data. For example,vs
, the shape of the engine should be a (unordered) factor variable, while the number of gears should be an ordered factor.Compute the dissimilarity matrix
Clustering the dissimilarity matrix
Other packages
A couple of other packages to cluster mixed-type data are
CluMix
andFD
.