How can I vectorize longest common substring across data.table columns in R

98 Views Asked by At

How can I create a function that will allow me to EITHER quickly calculate the # of characters in the longest common substring OR return the longest common substring between TWO OR MORE COLUMNS in a large data.table in R?

I modified this question's answer: Find length of overlap in strings but have 1.) issues applying across vector as this fails with blanks and other string features when applied to create a new column of results using sapply, 2.) issues applying across more than 2 columns, and 3.) the given answers do not include spaces in potential matches, and I'd like to. The function also is slow, and I'd like to apply across big data.

Create Sample Data:

sampdata <- data.frame(
  str1=c("Doug Olivas", "GRANT MANAGEMENT LLC", "LUNA VAN DERESH", "wendy t marzardo", "AMIN NYGUEN COMPANY LLC", "GERARDO CONTRARAS", "miguel martinez","albert marks porter"),
  str2=c("doug olivas", "miguel grant", "LUNA VAN DERESH MANAGEMENT LLC", "marzardo", "amin nyguen llc", "gerardo contraras", "miggy martinez","albert"),
  str3=c("Martin Olivas", "GRANT PROPERTIES", "luna company", "wendy marzardo", "the company of amin nyguen llc", "gerardo c", "miguel t martinez","")
  )

MADE UP FUNCTION "lcsfoo" DESIRED FUNCTIONALITY 1:

#option type="nchar" to return number of characters INCLUDING SPACES, IGNORING CASE in max common substring
sampdata$desired_LCSnchar <- lcsfoo(sampdata$str1,sampdata$str2,sampdata$str3,type="nchar")

#option type="str" to return the string INCLUDING SPACES, IGNORING CASE of the longest common substring between the columns
sampdata$desired_LCSstr <- lcsfoo(sampdata$str1,sampdata$str2,sampdata$str3,type="str")

#DESIRED RESULTS 1: The above would return the following for the sample data

sampdata$desired_LCSnchar <- c(7,5,5,8,12,9,9,0)
sampdata$desired_LCSstr<- c(" olivas","grant","luna ","marzardo","amin nyguen ","gerardo c"," martinez","")

**IDEALLY lcsfoo would also take variable numbers of column inputs (i.e 2 columns here instead of the 3 above):

sampdata$str1str2_LCSnchar <- lcsfoo(sampdata$str1,sampdata$str2,type="nchar")
sampdata$str1str2_LCSstr <- lcsfoo(sampdata$str1,sampdata$str2,type="str")

#DESIRED RESULTS 2: The above would return the following for the sample data

sampdata$str1str2_LCSstr<- c("doug olivas","grant","luna van deresh","marzardo","amin nyguen ","gerardo contraras"," martinez","albert")
sampdata$str1str2_LCSnchar <- c(11,5,15,8,12,17,9,6)

I'd also need the function to work across BIG DATA:

library(data.table)
###Create sample big data from previous sampledata and apply on huge DT
samplist <- lapply(c(1:1000),FUN=function(x){sampdata})
bigsampdata <- rbindlist(samplist)

DESIRED FUNCTION APPLIED ON BIG DATA: 
bigsampdata$desired_LCSnchar <- lcsfoo(bigsampdata$str1,bigsampdata$str2,bigsampdata$str3,type="nchar")
bigsampdata$desired_LCSstr <- lcsfoo(bigsampdata$str1,bigsampdata$str2,bigsampdata$str3,type="str")
0

There are 0 best solutions below