String match with R: Finding the best possible match

856 Views Asked by pch919 At 22 June 2025 at 22:53

I have two vectors of words.

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

I need to make the best possible match between the Lexicon and Corpus. I tried many methods. This is one of them.

library(stringr)

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words

test<- str_extrac_all (Corpus,match,simplify= T)

test

[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"

But, the match should be:

[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"

Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.

I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.

Could you help me to solve this problem? Thank you for your help.

Original Q&A

There are 3 best solutions below

Psidom On 23 September 2017 at 01:54

You can order Lexicon by the number of characters the patterns have, in decreasing order, so the best match comes first:

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^')

test<- str_extract_all(Corpus, match, simplify= T)

test
#     [,1]       
#[1,] "animalada"
#[2,] "fe"       
#[3,] "fernandez"
#[4,] "ladrillo"

Santosh M. On 23 September 2017 at 01:59

You can just use match function.

Index <- match(Corpus, Lexicon)

Index
[1] 2 3 4 6

Lexicon[Index]
[1] "animalada"  "fe"   "fernandez"  "ladrillo"

pch919 On 27 September 2017 at 03:16

I tried both methods and the right one was the suggested by @Psidorm. If a use the function match() this will find the match in any part of the word, not necessary the beginning. For instance:

Corpus<- c('tambien')
Lexicon<- c('bien')
match(Corpus,Lexicon)

The result is 'tambien', but this is not correct.

Again, thank you both for your help!!

String match with R: Finding the best possible match

There are 3 best solutions below

Related Questions in R

Related Questions in REGEX

Related Questions in STRING

Related Questions in TEXT-MINING

Related Questions in LEXICON

Trending Questions

Popular # Hahtags

Popular Questions