bootstrapping own built function pvclust does not work

Question

bootstrapping own built function pvclust does not work

552 Views Asked by jedgroev At 19 August 2025 at 05:10

I am using sequence analyses methods in order to measure similarity between different "sequences of spatial use", represented as strings of characters. Here is a theoretical example with three classes (A: City, B: Agriculture, C: Mountain) for two sequences:

                t1,t2,........tx
  Individual 1: A A A B B B C C
  Individual 2: A B B B A A C C
                0 1 1 0 1 1 0 0 = **4**

The distance measure that we use to measure the similarity among the sequences is the hamming distance (i.e. measures how often a character in a sequence needs to be substituted in order to equate the sequences, in the example above 4 characters need to be substituted in order to equate the sequences). Based on our distance matrix (giving the distance, or dissimilarity, of every possible pair of sequences) obtained after calculating the hamming distance a dendrogram has been created using the clustering method of Ward (ward.D2).

Now I would also like to include a good measure of cluster robustness in order to identify relevant clusters. For this I was trying to use pvclust which contains several methods to calculate bootstrap values, however restricted to a number of distance measures. With the unreleased version of pvclust I tried to implement the right distance measure (i.e. hamming distance) and I tried to create a bootstrapping tree. The script is working, but the outcome is not correct. Applied on my dataset using a nboot of 1000, "bp" values are close to 0 and all the other values "au", "se.au", "se.bp", "v", "c", "pchi" are 0, suggesting that the clusters are artefacts.

Here I provide an example script:

The data concerns simulated sequences that are very homogeneous (e.g. continues used of 1 specific state), so that each cluster should certainly be significant. I limited the number of boots to only 10 to limit calculation time.

####################################################################
####Create the sequences#### 
dfr = data.frame()
a = list(dfr)
b = list(dfr)
c = list(dfr)
d = list(dfr)
data = list(dfr)

for (i in c(1:10)){
set.seed(i)
a[[i]] <- sample(c(rep('A',10),rep('B', 90)))
b[[i]] <- sample(c(rep('B',10),rep('A', 90)))
c[[i]] <- sample(c(rep('C',10),rep('D', 90)))
d[[i]] <- sample(c(rep('D',10),rep('C', 90)))
}
a = as.data.frame(a, header = FALSE)
b = as.data.frame(b, header = FALSE)
c = as.data.frame(c, header = FALSE)
d = as.data.frame(d, header = FALSE)

colnames(a) <- paste(rep('seq_urban'),rep(1:10), sep ='')
colnames(b) <- paste(rep('seq_agric'),rep(1:10), sep ='')
colnames(c) <- paste(rep('seq_mount'),rep(1:10), sep ='')
colnames(d) <- paste(rep('seq_sea'),rep(1:10), sep ='')

data = rbind(t(a),t(b),t(c),t(d))
#####################################################################

####Analysis####
## install packages if necessary
#install.packages(c("TraMineR", "devtools")) 
library(TraMineR)
library(devtools)

source_url("https://www.dropbox.com/s/9znkgks1nuttlxy/pvclust.R?dl=0") # url    to my dropbox for unreleased pvclust package
source_url("https://www.dropbox.com/s/8p6n5dlzjxmd6jj/pvclust-internal.R?dl=0") # url to my dropbox for unreleased pvclust package

dev.new()
par( mfrow = c(1,2))
## Color definitions and alphabet/labels/scodes for sequence definition
palet <- c(rgb(230, 26, 26, max = 255), rgb(230, 178, 77, max = 255),     "blue", "deepskyblue2") # color palet used for the states
s.alphabet <- c("A", "B", "C", "D") # the alphabet of the sequence object
s.labels <- c("country-side", "urban", "sea", "mountains") # the labels of    the sequence object
s.scodes <- c( "A", "U", "S", "M") # the states of the sequence object

## Sequence definition
seq_ <- seqdef(data, # data  
                  1:100, # columns corresponding to the sequence data  
                  id = rownames(data), # id of the sequences
                  alphabet = s.alphabet, states = s.scodes, labels = s.labels, 
                  xtstep = 6, 
                  cpal = palet) # color palet 

##Substitution matrix used to calculate the hamming distance
Autocor <- seqsubm(seq_, method = "TRATE", with.missing = FALSE) 

# Function with the hamming distance (i.e. counts how often a character  needs to be substituted to equate two sequences to each other. Result is a  distance matrix giving the distances for each pair of sequences)
hamming <- function(x,...) {
res <- seqdist(x, method = "HAM",sm = Autocor)
res <- as.dist(res)
attr(res, "method") <- "hamming"
return(res)
}

## Perform the bootstrapping using the distance method "hamming"
result <- pvclust(seq_, method.dist = hamming, nboot = 10, method.hclust =  "ward")
result$hclust$labels <- rownames(test[,1])
plot(result)

To do this analysis I am using the unreleased version of the R package pvclust, which allows to use your own distance method (in this case: hamming). Does somebody has an idea how to solve this problem?

Original Q&A

There are 1 best solutions below

**Matthias Studer** · Answer 1

The aim of pvclust is to cluster variables (or attributes) not cases. This is why you have results that do not make sense. You can try

data(iris)
res <- pvclust(iris[, 1:4])
plot(res)

To test the stability of a clustering of cases, you can use clusterboot from package fpc. See my answer here: Measuring reliability of tree/dendrogram (Traminer)

In your example, you could use:

library(fpc)
ham <- seqdist(seq_, method="HAM",sm = Autocor)
cf2 <- clusterboot(as.dist(ham), clustermethod=disthclustCBI, k=4, cut="number", method="ward.D")

Using for instance, k=10 you'll have bad results, because your data really have 4 cluster (by construction).

bootstrapping own built function pvclust does not work

There are 1 best solutions below

Related Questions in SEQUENCE

Related Questions in STATISTICS-BOOTSTRAP

Related Questions in TRAMINER

Related Questions in PVCLUST

Trending Questions

Popular # Hahtags

Popular Questions