How can I improve the speed of my R code regarding shortest substring implementation problem?

101 Views Asked by At

I decided to implement this problem from Rosalind in R. It works great for small sample dataset:

>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59
GCCGGAATAC

however it takes long time when I run it for real dataset (50 sequences with ~1000 characters length). Could you please take a look at my code and give some advice what shall I replace to improve the speed? I have only five minutes to post the answer... Here is what I wrote with some explanation.

library(Biostrings)
library(stringr)

fastaFile = readDNAStringSet("input.fasta")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name, sequence) #read fasta and convert it to simple df

allsubstr <- function(x, n) unique(substring(x, 1:(nchar(x) - n + 1), n:nchar(x))) #helping function, creating all substring of given string

Shortest_substring_R <- function(df){
  
tmp_ss <- df[1,2] #first element will be modified to final result
tmp_df <- df #put everything to temporary df which will be also modified (number of elements)

while(nrow(tmp_df)>1){ #loop until we heave more than one sequence in our temp df
max_ov_all <- '' #it's better to clear it
max_ov <- ''
for(i in 2:nrow(tmp_df)){ #this loop finds the maximum overlap between result and ith sequence (result - shortest substring)
res <- unlist(sapply(1:nchar(tmp_df[i,2]), function(x) allsubstr(tmp_df[i,2],x))) #produce all substrings from ith sequence
ovs <- res[str_detect(tmp_ss, res)] #only substrings of ith sequence which overlaps with first (result) sequence
max_ov <- ovs[which.max(nchar(ovs))] #give me the longest common substring of first (result) and ith sequence
if(nchar(max_ov) > nchar(max_ov_all)){ #check which overlap is the longest (between first (result) and remaining sequences)
  max_ov_all <- max_ov
  ind <- i
}
}
if(str_detect(tmp_ss, tmp_df[ind,2])){ #now three cases, our substring (coming from ith sequence) is inside our result string
  tmp_df <- tmp_df[-ind,] #so we do not have do anything, juest remove it from the df
}else{ #our first (result) sequence may start or end with longest overlap, so modify final result as needed
       #and then of course remove ith sequence (giving longest overlap), because it has been added to aour result sequence
  if(startsWith(tmp_ss, max_ov_all)){
    tmp_ss <- paste(tmp_df[ind,2],sub(max_ov_all, "", tmp_ss), sep='')
    tmp_df <- tmp_df[-ind,]
  }
  if(endsWith(tmp_ss, max_ov_all)){
    tmp_ss <- paste(sub(max_ov_all, "", tmp_ss), tmp_df[ind,2], sep='')
    tmp_df <- tmp_df[-ind,]
  }
}
}
print(tmp_ss)
}

Result:

> Shortest_substring_R(df)
[1] "ATTAGACCTGCCGGAATAC"

I think the main problem is with for() loop, sapply() and very useful functions such as str_detect() what fo you think?

Sample input:

Input = ('seq_name   sequence
1 Rosalind_56 ATTAGACCTG
2 Rosalind_57 CCTGCCGGAA
3 Rosalind_58 AGACCTGCCG
4 Rosalind_59 GCCGGAATAC')
small_df = as.data.frame(read.table(textConnection(Input), header = T, row.names = 1))
small_df
0

There are 0 best solutions below