I modified this question's answer: Find length of overlap in strings but have issues applying to big data as the iteration is slow.
How can I improve the function below which finds the longest common overlap between two strings anywhere in the two strings (disregarding case)?
Slow function that works, but I'd like to replace with a better one:
strlcs <- function(str1, str2,type="lcs") {
if(nchar(str1) < nchar(str2)) {
x <- str2
str2 <- str1
str1 <- x
}
x <- strsplit(str2, "")[[1L]]
n <- length(x)
s <- sequence(seq_len(n))
s <- split(s, cumsum(s == 1L))
s <- rep(list(s), n)
for(i in seq_along(s)) {
s[[i]] <- lapply(s[[i]], function(x) {
x <- x + (i-1L)
x[x <= n]
})
s[[i]] <- unique(s[[i]])
}
s <- unlist(s, recursive = FALSE)
s <- unique(s[order(-lengths(s))])
i <- 1L
len_s <- length(s)
while(i < len_s) {
lcs <- paste(x[s[[i]]], collapse = "")
check <- grepl(lcs, str1, fixed = TRUE)
if(check) {
if(type=="nchar"){
return(nchar(lcs))
}else{
return(lcs)
}
break
} else {
i <- i + 1L
}
}
}
Sample data:
library(data.table)
sampdata <- data.frame(
str1=c("Doug Olivas", "GRANT MANAGEMENT LLC", "LUNA VAN DERESH", "wendy t marzardo", "AMIN NYGUEN COMPANY LLC", "GERARDO CONTRARAS", "miguel martinez","albert marks porter"),
str2=c("doug olivas", "miguel grant", "LUNA VAN DERESH MANAGEMENT LLC", "marzardo", "amin nyguen llc", "gerardo contraras", "miggy martinez","albert"),
stringsAsFactors = F
)
###Create sample big data from previous sampledata and apply on huge DT
samplist <- lapply(c(1:10000),FUN=function(x){sampdata})
bigsampdata <- rbindlist(samplist)
The above function is NOT optimized for big data.
How do I make the following happen in less than the currently brutal 20+ seconds?
DESIRED FUNCTION APPLIED ON BIG DATA:
system.time(bigsampdata$desired_LCSnchar <- sapply(c(1:nrow(bigsampdata)),FUN=function(x){strlcs(tolower(bigsampdata$str1[x]),tolower(bigsampdata$str2[x]),type="lcs")}))
user system elapsed
24.290 0.008 24.313
I have found a faster solution using the
LCS
function in thequalV
package:You can speed it up further by parallelising the
mapply
withmcmapply