I am quite new to sequence analysis and trying to identify clusters in an aggregated sequence matrix, focusing on the state duration. However, when using method='CHI2'/'EUCLID' combined with step=1 (not otherwise) I am getting the error:
Error in if (SCres > currentSCres) { : missing value where TRUE/FALSE needed
Any ideas why (there are some NaN in the distance matrix, could they result from sequences being of different length)?
What the sequence object and distance matrix looks like Code:
Sequence
1 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
3 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5 b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
Distance matrix
1 2 3 4
2 NaN
3 289.92897 NaN
4 141.07472 NaN 263.22855
5 10.22425 NaN 290.10919 141.44473
Code:
library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4
SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8,
properties=c("state", "duration", "spell.age","spell.dur",
"transition","pattern", "AFtransition", "AFpattern","Complexity"))
The
"CHI2"distance between two sequences x and y computed byTraMineRis the sum of the Chi-squared distance between the state distributions over the successive periods of lengthstep. See Studer and Ritschard (2014, p 8).This means that for
step=1a Chi-squared distance is computed at each position. When one of the sequence has void values at some positions (e.g. the last position in your second sequence), the distance cannot be computed for these positions and we get aNaNvalue for theCHI2distance between this sequence and any other sequence.To avoid that, you can use the following workarounds:
1) Set a
stepvalue large enough to be sure each sequence contains at least one non-void element in each period intervals. For your example, the longest sequences are of length 25. To be sure the last period contains non void elements, you have to setstep=5.2) Drop the columns with void elements:
3) Fill the shorter sequences with missings and consider the missing value as an additional possible state. By default
right='DEL'inseqdef, which creates voids. Here we setright=NAto get missing values instead.Now, the error reported in the question is NOT an error of
seqdist, but of theseqpropclustfunction from theWeightedClusterlibrary. The error is obviously caused by theNaNin the dissimilarity matrix.