Pairwise sequence list matching in R

68 Views Asked by At

I have a problem in doing sequence alignment/matching in R for lists. Let me explain better, my data are clickstream data and i have sequences divided in n-grams. The sequence looks something like

1. ABDCGHEI... NaNa
2. ACSNa.... NaNa

and so on where Na stays for "Not available", needed to match sequence lengths. Now i put all of these sequences in a list in a rude way like

dativec = as.vector(dataseq2)
for(i in 1:length(dativec)) {
  prova2[[i]] = dativec[i]
}
BigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
prova3 = lapply(prova2, BigramTokenizer)

and divided them in n-grams, e. g. bigrams looks like this:

[[1]] "A B" "B D" "D C".... "Na Na"
[[2]] "A C" "C S" .... "Na Na"

Now the challenge is : how can i match every bigram of each element of my list, with each bigram of the other elements in the list? I tried to use the Biostrings package but the function pairwiseAlignment only gives back a score for the first bigram of each element in the list, while i just need to know if they're identical or not, and i need it all comparisons not just the first elements. The desired result is the percentage of equal sub-ngrams without the information about positions. I only care about identity. I also tried to use setdiff function but apparently it doesn't work in the way i want.

Edited for more clarity

1

There are 1 best solutions below

3
On BEST ANSWER

You can use outer:

bigrams <- list (a = c("A B", "B D", "D C", "Na Na"),
                 b = c("A C", "C S", "Na Na"))

with(bigrams, outer(a, b, `==`))

##>       [,1]  [,2]  [,3]
##> [1,] FALSE FALSE FALSE
##> [2,] FALSE FALSE FALSE
##> [3,] FALSE FALSE FALSE
##> [4,] FALSE FALSE  TRUE