How to find number of variable sites between two DNA sequences

615 Views Asked by At

Is there a way I can find number of variable sites for a pair of nucleotide sequences using R and also which sites are variable ?

For eg.

Seq1=ATGCCCCGTAATGC
Seq2=AGGCAACGTAAAGC

I want the number of variable sites as 5 and the positions 2,3,5,6,12 as variable. Is there a way to do this in R ?

2

There are 2 best solutions below

0
On BEST ANSWER

Not sure whether Biostrings have any benefit on memory optimization. You simply use strsplit for short sequences as:

Seq1="ATGCCCCGTAATGC"
Seq2="AGGCAACGTAAAGC"

a <- unlist(strsplit(Seq1, ""))
b <- unlist(strsplit(Seq2, ""))
which(a!=b)
#[1]  2  5  6 12
2
On

Try Biostrings

library(Biostrings)
s1 <- as.integer(DNAString(Seq1))
s2 <- as.integer(DNAString(Seq2))
which(s1!=s2)
#[1]  2  5  6 12

Benchmarks

 Seq1 <- paste(rep(Seq1,1e6), collapse='')
 Seq2 <- paste(rep(Seq2,1e6), collapse='')

 f1 <- function(){a <- unlist(strsplit(Seq1, ""))
             b <- unlist(strsplit(Seq2, ""))
             which(a!=b)}

 f2 <- function(){s1 <- as.integer(DNAString(Seq1))
             s2 <- as.integer(DNAString(Seq2))
             which(s1!=s2)}

 library(microbenchmark)
 microbenchmark(f1(), f2(), unit='relative', times=20L)
 #Unit: relative
 #expr     min       lq     mean   median       uq      max neval cld
 #f1() 5.06205 5.180931 4.480188 4.633281 4.749703 2.365516    20   b
 #f2() 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 

data

Seq1 <- 'ATGCCCCGTAATGC'
Seq2 <- 'AGGCAACGTAAAGC'