Searching for most common substrings into subsequences

488 Views Asked by At

I'm trying to search sequences to find the most common substrings (I.E. subsequences where all events are adjacent). The user guide says the following about their subsequence searching tools:

"The idea of subsequence is an extension of the notion of substring and is described in detail for instance in Elzinga (2008). While a substring of a sequence is necessarily constituted of adjacent symbols, this requirement is relaxed with the notion of subsequence. Thus if x = abac, λ (the empty string), u = b, v = bac and w = bc belong to the set of subsequences of x, while only λ, u = b and v = bac are substrings of x"

Is there a way to turn off that relaxation, and only look at substrings? This is specifically using the seqefsub command. I can't find anything about this in the TraMineR manual, so any help on this is appreciated! Thanks so much, Andrew

1

There are 1 best solutions below

0
On

Although TraMineR has no specific function for substrings, you can get substring-like results by playing with time constraints.

For instance, setting maxGap=1 in the constraint argument of seqefsub you get the frequent subsequences formed with events occurring at two successive time points. I illustrate below with the actcal data shipping with TraMineR.

library(TraMineR)
data(mvad)
data(actcal)
## creating a state sequence object
actcal.seq <- seqdef(actcal,13:24,
  labels=c("> 36 hours", "19 to 36 hours", "< 19 hours", "no work"))
## transforming into an event sequence object
actcal.seqe <- seqecreate(actcal.seq, tevent="state")

## frequent subsequences without constraints
fsubs <- seqefsub(actcal.seqe, pMinSupport=.01)

library(TraMineRextras)
fsubsn <- seqentrans(fsubs)
## displaying only subsequences with at least 2 events
fsubsn[fsubsn$data$nevent>1]

## Now with the maxGap=1 constraint
cstr <- seqeconstraint(maxGap=1)
fsstr <- seqefsub(actcal.seqe, pMinSupport=.01, constraint=cstr)
fsstrn <- seqentrans(fsstr)
fsstrn[fsstrn$data$nevent>1]

In that example, you get subsequences with events occurring at successive positions. To get subsequences of successive events independently of the time elapsed between them, define your event sequences with timestamps defined as successive numbers, e.g.

id event timestamp
1  A     1   
1  C     2
1  B     3
2  C     1
2  B     2
3  A     1
3  B     2
...

Hope this helps