Determining optimal number of clusters and with Daisy function and Gower Similarity

1.4k Views Asked by At

I am attempting to cluster the behavioral traits of 250 species into life-history strategies. The trait data consists of both numerical and nominal variables. I am relatively new to R and to cluster analysis, but I believe the best option to find the distances for these points is to use the gower similarity method within the daisy function. 1) Is that the best method?

Once I have these distances, I would like to find significant clusters. I have looked into pvclust and like its ability to give me the strength of the cluster. However, I have not been able to modify the code to accept the distance measurements previously made using daisy. I have unsuccessfully tried to follow the advice given here https://stats.stackexchange.com/questions/10347/making-a-heatmap-with-a-precomputed-distance-matrix-and-data-matrix-in-r/10349#10349 and using the code obtained here http://www.is.titech.ac.jp/~shimo/prog/pvclust/pvclust_unofficial_090824/pvclust.R

2)Can anyone help me to modify the existing code to accept my distance measurements?

3) Or, is there another better way to determine the number of significant clusters?

I thank all in advance for your help.

2

There are 2 best solutions below

0
On

You can use Zahn algorithm to find the cluster. Basically it's a minimum spanning tree and a function to remove the longest edge.

0
On

Some comments...

About 1)

It is a good way to deal with different types of data.

You could also create as many new rows in the dataset as possible nominal values and put 1/0 where it is needed. For example if there are 3 nominal values such as "reptile", "mammal" and "bird" you could change your initial dataset that has 2 columns (numeric, Nominal) for a new one with 4 columns (numeric, numeric( representing reptile), numeric(representing mammal), numeric(representing bird)) an instance (23.4,"mammal") would be mapped to (23.4,0,1,0).

Using this mapping you could work with "normal" distances (be sure to standardize the data so that no column dominates the others due to it's big/small values).

About 2)

daisy returns an element of type dissimilarity, you can use it in other clustering algorithms from the cluster package (maybe you don't have to implement more stuff). For example the function pam can get the object returned by daisy directly.

About 3)

Clusters are really subjective and most cluster algorithms depend on the initial conditions so "significant clusters" is not really a term that some people would not be comfortable using. Pam could be useful in your case because clusters are centered using medoids which is good for nominal data (because it is interpretable). K-means for example has the disadvantage that the centroids are not interpretable (what does it mean 1/2 reptile 1/2 mammal?) pam builds the clusters centered to instances which is nice for interpretation purposes.

About pam:

http://en.wikipedia.org/wiki/K-medoids

http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/pam.html