seqMD: size of substitution cost matrix

105 Views Asked by At

I'm conducting a multichannel sequence analysis with 3 channels, for which I have defined three separate substitution cost matrices as the basis for Optimal Matching.

I get the following error message, when using seqMD:

 > MDcost <- seqMD(channels, method = OM, sm=smatrix, what="cost", norm = "auto", 
 +                 indel = "auto", with.missing = TRUE)
  [!!] 3  domains with  288  sequences
  [>] building MD sequences of combined states... OK
  [>] including missing value as an additional state
  [>] checking 'sm' (size and triangle inequality)
 Error:  [!] size of substitution cost matrix must be 8x8

The substitution cost matrix for the first channel is 7x7. I assume the error is related to this message in the output: including missing value as an additional state But there are no missings in my channels. I used seqdef with left = DEL for sequences in each channel as my sequences are of varying length across my units.

I have no issues when using seqdist for the monochannels:

> ch1.dist.OM.th <- seqdist(ch1.seq, method = "OM", indel = "auto",
+                           sm = ch1.sub.th, norm = "auto")
 [>] 288 sequences with 7 distinct states
 [>] checking 'sm' (size and triangle inequality)
 [>] 265 distinct  sequences 
 [>] min/max sequence lengths: 2/48
 [>] computing distances using the OM maxlength normalized metric
 [>] elapsed time: 0.052 secs

Here, there is no indication of missing values/added states and the substitution matrices work fine. I'm grateful for any hints to solve this problem.

23-03-02 UPDATE Here is the output for three UoO for the 3 channels

> MDseq <- seqMD(channels)
 [!!] 3  domains with  3  sequences
 [>] building MD sequences of combined states... OK
> seqlength(MDseq)
  Length
1     31
2     18
3     17
> MDseq
  Sequence                                                                                                                                                                                                                                                                                                                                                                           
1 P09+R00+G00-P09+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-
P07+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-
P07+R05+G00-P07+R05+G00-P07+R05+G00-P07+R05+G00-P07+R05+G00-P07+R05+G00-
P07+R04+G00-P07+R04+G00-P07+R04+G00-P07+R04+G00-P07+R04+G00-P07+R03+G00-
P07+R03+G00-P07+R03+G00-P07+R03+G00-P07+R05+G02-P04+R05+G02-P04+R05+G02-
P04+R05+G02
2 P09+R00+G00-P09+R00+G00-P09+R00+G00-P05+R00+G00-P05+R14+G00-P05+R10+G00-
P05+R10+G00-P05+R10+G00-P05+R10+G00-P05+R10+G00-P05+R09+G00-P05+R09+G00-
P05+R09+G00-P05+R09+G00-P05+R04+G00-P05+R05+G00-P05+R05+G00-P04+R04+G00
3 P09+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-P07+R00+G00-
P07+R11+G00-P07+R11+G05-P07+R11+G05-P07+R11+G05-P05+R11+G05-P05+R08+G00-
P05+R08+G00-P04+R08+G00-P04+R08+G00-P04+R08+G00-P04+R08+G00
> seqstatl(MDseq)
 [1] "%"           "P04+R04+G00" "P04+R05+G02" "P04+R08+G00" "P05+R00+G00"
 [6] "P05+R04+G00" "P05+R05+G00" "P05+R08+G00" "P05+R09+G00" "P05+R10+G00"
[11] "P05+R11+G05" "P05+R14+G00" "P07+R00+G00" "P07+R03+G00" "P07+R04+G00"
[16] "P07+R05+G00" "P07+R05+G02" "P07+R11+G00" "P07+R11+G05" "P09+R00+G00"
> MDcost <- seqMD(channels, method = OM, sm=smatrix, what="cost", norm = "auto", 
+                 indel = "auto")
 [!!] 3  domains with  3  sequences
 [>] building MD sequences of combined states... OK
 [>] including missing value as an additional state
 [>] checking 'sm' (size and triangle inequality)
Error:  [!] size of substitution cost matrix must be 8x8
> smatrix[[1]]
    P04 P05 P06 P07 P08 P09 P00
P04   0   1   2   3   4   5   6
P05   1   0   1   2   3   4   5
P06   2   1   0   1   2   3   4
P07   3   2   1   0   1   2   3
P08   4   3   2   1   0   1   2
P09   5   4   3   2   1   0   1
P00   6   5   4   3   2   1   0
1

There are 1 best solutions below

1
Gilbert On

This is a bug in seqMD. I cannot run your example because you do not provide the data. However, I think I have identified the issue. The error occurs because of a badly specified test in seqMD, which sets with.missing as TRUE when sequences do not all have the same length as in your data.

This will be fixed within two days in the development version available on R-Forge (now released as v 2.2-7 on the CRAN).

In the meantime, the only workaround I see is to make all sequences of the same length by setting right missings as true missings, i.e., by specifying right=NA in the seqdef commands. I illustrate below with a small example:

ch <- read.table(header=FALSE,text="
                    n n n n c c c c
                    n n c c c c c c
                    n n n n n c c c
                 ")
ch[3,7:8] <- NA


work <- read.table(header=FALSE,text="
                    u w w u u u w w
                    u u u u w w w w
                    w w w u u u w w
                 ")

## turning last two elements of 3rd sequence to NA
work[3,7:8] <- NA

## sequence objects
s.ch <- seqdef(ch, cpal=c("yellow","blue"), right=NA)
s.work <- seqdef(work, cpal=c("pink","green"), right=NA)               

Now we run seqMD, first to show the multidomain (MD) sequences, and then to compute CAT costs:

## multidomain sequences
seqMD(list(s.ch,s.work)) 
#   Sequence                       
# 1 n+u-n+w-n+w-n+u-c+u-c+u-c+w-c+w
# 2 n+u-n+u-c+u-c+u-c+w-c+w-c+w-c+w
# 3 n+w-n+w-n+w-n+u-n+u-c+u-*+*-*+*      

## costs for each of the two channels
cost.ch <- seqcost(s.ch,method="INDELSLOG", with.missing=TRUE)
cost.work <- seqcost(s.work,method="INDELSLOG", with.missing=TRUE)

## CAT costs
seqMD(list(s.ch,s.work), sm=list(cost.ch$sm,cost.work$sm), 
      indel=list(cost.ch$indel,cost.work$indel), what="cost", 
      with.missing=c(TRUE,TRUE))
#          *+*       c+u       c+w       n+u       n+w
# *+* 0.000000 1.8579148 1.8579148 1.8579148 1.8579148
# c+u 1.857915 0.0000000 0.6317059 0.6317059 1.2634118
# c+w 1.857915 0.6317059 0.0000000 1.2634118 0.6317059
# n+u 1.857915 0.6317059 1.2634118 0.0000000 0.6317059
# n+w 1.857915 1.2634118 0.6317059 0.6317059 0.0000000
# attr(,"indel")
# [1] 1.2262089 0.6317059 0.6317059 0.6317059 0.6317059
# attr(,"alphabet")
# [1] "*+*" "c+u" "c+w" "n+u" "n+w"
# attr(,"cweight")
# [1] 1 1