I am using sequence analyses methods in order to measure similarity between different "sequences of spatial use", represented as strings of characters. Here is a theoretical example with three classes (A: City, B: Agriculture, C: Mountain) for two sequences:
t1,t2,........tx Individual 1: A A A B B B C C Individual 2: A B B B A A C C 0 1 1 0 1 1 0 0 = **4**
The distance measure that we use to measure the similarity among the sequences is the hamming distance (i.e. measures how often a character in a sequence needs to be substituted in order to equate the sequences, in the example above 4 characters need to be substituted in order to equate the sequences). Based on our distance matrix (giving the distance, or dissimilarity, of every possible pair of sequences) obtained after calculating the hamming distance a dendrogram has been created using the clustering method of Ward (ward.D2).
Now I would also like to include a good measure of cluster robustness in order to identify relevant clusters. For this I was trying to use pvclust which contains several methods to calculate bootstrap values, however restricted to a number of distance measures. With the unreleased version of pvclust I tried to implement the right distance measure (i.e. hamming distance) and I tried to create a bootstrapping tree. The script is working, but the outcome is not correct. Applied on my dataset using a nboot of 1000, "bp" values are close to 0 and all the other values "au", "se.au", "se.bp", "v", "c", "pchi" are 0, suggesting that the clusters are artefacts.
Here I provide an example script:
The data concerns simulated sequences that are very homogeneous (e.g. continues used of 1 specific state), so that each cluster should certainly be significant. I limited the number of boots to only 10 to limit calculation time.
####################################################################
####Create the sequences####
dfr = data.frame()
a = list(dfr)
b = list(dfr)
c = list(dfr)
d = list(dfr)
data = list(dfr)
for (i in c(1:10)){
set.seed(i)
a[[i]] <- sample(c(rep('A',10),rep('B', 90)))
b[[i]] <- sample(c(rep('B',10),rep('A', 90)))
c[[i]] <- sample(c(rep('C',10),rep('D', 90)))
d[[i]] <- sample(c(rep('D',10),rep('C', 90)))
}
a = as.data.frame(a, header = FALSE)
b = as.data.frame(b, header = FALSE)
c = as.data.frame(c, header = FALSE)
d = as.data.frame(d, header = FALSE)
colnames(a) <- paste(rep('seq_urban'),rep(1:10), sep ='')
colnames(b) <- paste(rep('seq_agric'),rep(1:10), sep ='')
colnames(c) <- paste(rep('seq_mount'),rep(1:10), sep ='')
colnames(d) <- paste(rep('seq_sea'),rep(1:10), sep ='')
data = rbind(t(a),t(b),t(c),t(d))
#####################################################################
####Analysis####
## install packages if necessary
#install.packages(c("TraMineR", "devtools"))
library(TraMineR)
library(devtools)
source_url("https://www.dropbox.com/s/9znkgks1nuttlxy/pvclust.R?dl=0") # url to my dropbox for unreleased pvclust package
source_url("https://www.dropbox.com/s/8p6n5dlzjxmd6jj/pvclust-internal.R?dl=0") # url to my dropbox for unreleased pvclust package
dev.new()
par( mfrow = c(1,2))
## Color definitions and alphabet/labels/scodes for sequence definition
palet <- c(rgb(230, 26, 26, max = 255), rgb(230, 178, 77, max = 255), "blue", "deepskyblue2") # color palet used for the states
s.alphabet <- c("A", "B", "C", "D") # the alphabet of the sequence object
s.labels <- c("country-side", "urban", "sea", "mountains") # the labels of the sequence object
s.scodes <- c( "A", "U", "S", "M") # the states of the sequence object
## Sequence definition
seq_ <- seqdef(data, # data
1:100, # columns corresponding to the sequence data
id = rownames(data), # id of the sequences
alphabet = s.alphabet, states = s.scodes, labels = s.labels,
xtstep = 6,
cpal = palet) # color palet
##Substitution matrix used to calculate the hamming distance
Autocor <- seqsubm(seq_, method = "TRATE", with.missing = FALSE)
# Function with the hamming distance (i.e. counts how often a character needs to be substituted to equate two sequences to each other. Result is a distance matrix giving the distances for each pair of sequences)
hamming <- function(x,...) {
res <- seqdist(x, method = "HAM",sm = Autocor)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}
## Perform the bootstrapping using the distance method "hamming"
result <- pvclust(seq_, method.dist = hamming, nboot = 10, method.hclust = "ward")
result$hclust$labels <- rownames(test[,1])
plot(result)
To do this analysis I am using the unreleased version of the R package pvclust, which allows to use your own distance method (in this case: hamming). Does somebody has an idea how to solve this problem?
The aim of
pvclust
is to cluster variables (or attributes) not cases. This is why you have results that do not make sense. You can tryTo test the stability of a clustering of cases, you can use
clusterboot
from packagefpc
. See my answer here: Measuring reliability of tree/dendrogram (Traminer)In your example, you could use:
Using for instance,
k=10
you'll have bad results, because your data really have 4 cluster (by construction).