Counting subset membership co-occurrences and group alphabet length

195 Views Asked by At

Is there a way in R to count patterns of state co-occurrences in sequences, i.e., to work on groups where element order is not necessarily important? The purpose would be to find out how common are the occurrences of the larger sub-groups within even longer groups.

For example, the input data set would be something like this ('real' data sequences would be up to ~10 columns wide and 1000s rows deep) ...

a,b,c,d
b,c,d,a
c,d,b,a
a,b,c,d,e
b,c,d,a,e
a,b,c
...

and the result would perhaps show...

abcd, abcd*  

as a set or class, with a count to indicate number of occurrences, with e.g. * indicating a subset or 'membership elsewhere' category and score based on length().

The results would also show...

abcde

as a different and slightly rarer set or class, with a higher score reflecting longer length().

And finally ...

abc*

would have a higher count score, but lower length() score.

Something like Traminer that works on unordered (disordered?) groups would be excellent. I note there may be issues with computational load, but I'll consider that (i.e. some sort of triviality threshold) if I need to cut my teeth writing a program.

1

There are 1 best solutions below

4
On BEST ANSWER

Here is a function that sorts alphabetically the elements in each sequence, then extracts the successive distinct states of the sorted sequences.

dssort <- function(seqdata){
  ssort <- t(apply(seqdata, MARGIN=1, sort))
  ssort.seq <- seqdef(ssort, states=alphabet(seqdata), labels=stlab(seqdata))
  sdss  <- seqdef(seqdss(ssort.seq), missing="%")
  sdss
} 

Using the outcome of this function you can get the frequencies of the different sets of elements that form the sequences. For example, with

library(TraMineR)
data(mvad)
shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad[,17:86], states=shortlab)

set <- dssort(mvad.seq)

seqtab(set, tlim=1:3)

you get

               Freq Percent
EM/1-FE/1        94      13
EM/1-TR/1        84      12
EM/1-JL/1-TR/1   57       8

So you know that 94 sequences contain elements FE and EM and only those two, 84 have EM and TR and no other state, and 57 have EM, JL and TR.

You can also plot the frequent sets with seqfplot(set).

Not sure if this is what you are looking for, but hope it helps.

====

Here is how you can get rid of the useless "/1"

tf <- seqtab(set, tlim=1:3, format="STS")
t <- attr(tf,"freq")
rownames(t) <- gsub("-\\*","",rownames(t))
t

that gives

            Freq   Percent
EM-FE         94 13.202247
EM-TR         84 11.797753
EM-JL-TR      57  8.005618