String match with R: Finding the best possible match

849 Views Asked by At

I have two vectors of words.

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

I need to make the best possible match between the Lexicon and Corpus. I tried many methods. This is one of them.

library(stringr)

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words

test<- str_extrac_all (Corpus,match,simplify= T)

test

[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"

But, the match should be:

[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"

Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.

I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.

Could you help me to solve this problem? Thank you for your help.

3

There are 3 best solutions below

1
On

You can order Lexicon by the number of characters the patterns have, in decreasing order, so the best match comes first:

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^')

test<- str_extract_all(Corpus, match, simplify= T)

test
#     [,1]       
#[1,] "animalada"
#[2,] "fe"       
#[3,] "fernandez"
#[4,] "ladrillo" 
0
On

You can just use match function.

Index <- match(Corpus, Lexicon)

Index
[1] 2 3 4 6

Lexicon[Index]
[1] "animalada"  "fe"   "fernandez"  "ladrillo"
0
On

I tried both methods and the right one was the suggested by @Psidorm. If a use the function match() this will find the match in any part of the word, not necessary the beginning. For instance:

Corpus<- c('tambien')
Lexicon<- c('bien')
match(Corpus,Lexicon)

The result is 'tambien', but this is not correct.

Again, thank you both for your help!!