I have a data frame and need to create a flag that indicates instances where a there is a partial match between 2 columns here is the code and some dummy data:
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","veggies")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate veggies")
mydata <- data.frame(doc_id,word,text,stringsAsFactors = FALSE)
The expected outcome is the same data frame with an additional column that shows if the match between word and text is a partial match
doc_id <- c("doc1","doc1","doc2","doc3","doc3","doc4","doc4")
word <- c("apple","apples","chicken","banana","bananas","veggie","soup")
text <- c("yesterday I ate apples", "yesterday I ate apples", "yesterday I ate chicken", "yesterday I ate bananas", "yesterday I ate bananas", "yesterday I ate veggies", "yesterday I ate soup")
partial_match <- c("1","0","0","1","0","1","0")
mydata2 <- data.frame(doc_id,word,text,partial_match,stringsAsFactors = FALSE)
I tried
str_detect(mydata$word, mydata$text)
and similar things using functions such as charmatch, pmatch, grep and grepl with no success.
The real data contains several thousands of records so the solution should scale.
Thanks.
After a long time of trying, i learned a bit more about string manipulation and got it. Probably not the most efficient way but it worked.
OBS: i marked commentaries with "¹", "²", and "³" so that i can explain later.
¹: The pattern is: a group (marked by
(...)
) of 0 or more (hence the*
) of any character other than new line (hence the^\n
,\n
is new line,^
is everything except it), followed by a group with the searched word, and a 3rd which is equal to the first.²: If there's no match at all, we didn't got a partial match so we want a value of 0. We select those cases by using the fact that,
grep(mydata2$word[i], word)
will return a numeric of length 0 when there's no match.³: The
"\\1"
and"\\3"
select the 1st and 3rd pre-mentioned groups of the pattern. if it's a perfect match, there wont be any "left overs" ofword
(what i called 'original word') after we "took away" the searched word (group 2), so group 1 and 3 will be empty (i.e. =""
). That line of code is testing if both groups are empty at the same time (full match), and negating it (hence the !). As we already marked no-matches as 0 with the if statement, what remains is partial matches.