Flag for partial match between 2 columns in R

142 Views Asked by At

I have a data frame and need to create a flag that indicates instances where a there is a partial match between 2 columns here is the code and some dummy data:

doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies") 
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)

The expected outcome is the same data frame with an additional column that shows if the match between word and text is a partial match

doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup") 
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)

I tried

str_detect(mydata$word, mydata$text)

and similar things using functions such as charmatch, pmatch, grep and grepl with no success.

The real data contains several thousands of records so the solution should scale.

Thanks.

1

There are 1 best solutions below

0
On BEST ANSWER

After a long time of trying, i learned a bit more about string manipulation and got it. Probably not the most efficient way but it worked.

OBS: i marked commentaries with "¹", "²", and "³" so that i can explain later.

parcial.m = numeric() # Create an empty vector

for(i in 1:nrow(mydata2)){
  pattern = paste("([^\n]*)(",mydata2$word[i],")([^\n]*)",sep="")
  # ¹

  split = unlist(strsplit(mydata2$text[i], "[ [:punct:]]"))
  # Split the text by punctuation and spaces, i.e. by words

  word = grep(mydata2$word[i], split, value=TRUE)
  # Select only the 'original' word
  
  if(length(grep(mydata2$word[i], word))==0) {parcial.m[i]=0}
  # ²

  else {parcial.m[i] = !((gsub(pattern, "\\1" , word)=="") & (gsub(pattern, "\\3" , word)==""))}}
  # ³

¹: The pattern is: a group (marked by (...)) of 0 or more (hence the *) of any character other than new line (hence the ^\n, \n is new line, ^ is everything except it), followed by a group with the searched word, and a 3rd which is equal to the first.

²: If there's no match at all, we didn't got a partial match so we want a value of 0. We select those cases by using the fact that, grep(mydata2$word[i], word) will return a numeric of length 0 when there's no match.

³: The "\\1" and "\\3" select the 1st and 3rd pre-mentioned groups of the pattern. if it's a perfect match, there wont be any "left overs" of word (what i called 'original word') after we "took away" the searched word (group 2), so group 1 and 3 will be empty (i.e. = ""). That line of code is testing if both groups are empty at the same time (full match), and negating it (hence the !). As we already marked no-matches as 0 with the if statement, what remains is partial matches.