I have two vectors of words.
Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')
Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')
I need to make the best possible match between the Lexicon and Corpus. I tried many methods. This is one of them.
library(stringr)
match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words
test<- str_extrac_all (Corpus,match,simplify= T)
test
[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"
But, the match should be:
[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"
Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.
I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.
Could you help me to solve this problem? Thank you for your help.
You can order
Lexicon
by the number of characters the patterns have, in decreasing order, so the best match comes first: