I am trying to match two string columns containing food descriptions [foods1 and foods2]. I applied an algorithm weighting the word frequency so less frequent words have more weight but it fails as it does not recognise objects.
For instance, foods1 item "Bagel with raisins" gets matched to foods2 "salad with raisins" rather than to "bagel" as "raisins" is a less frequent word. However, a "bagel with raisins" is closer to being a "bagel" as an actual object than to a "salad with raisins".
Example in R:
foods1 <- c('bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins')
foods1_id <- seq.int(1,length(foods1))
foods2 <- c('bagel','pizza','salad with raisins','tuna and olives')
foods2_id <- c(letters[1:length(foods2)])
require(fedmatch)
fuzzy_result <- merge_plus(data1 = data.frame(foods1_id,foods1, stringsAsFactors = F),
data2 = data.frame(foods2_id,foods2, stringsAsFactors = F),
by.x = "foods1",
by.y = "foods2", match_type = "fuzzy",
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,maxDist = .75),
unique_key_1 = "foods1_id",
unique_key_2 = "foods2_id")
Results, see line 3 matching foods1 "bagel with raisins" to foods2 "salad with raisins". Same for last line of foods1 "bagel with raisins and olives" being matched to foods2 "tuna and olives":
fuzzy_results
$matches
foods2_id foods1_id foods1 foods2
1: a 1 bagel plain bagel
2: a 4 bagel with olives bagel
3: c 5 bagel with raisins salad with raisins
4: d 2 bagel with raisins and olives tuna and olives
Is there any fuzzy matching algorithm in R or Python able to understand what objects are being matched? [so "bagel" is recognised as closer to a "bagel with raisins" than a "salad with raisins"].
To expand on my comment, you can try using NLP concepts of word embeddings, which is just a vector/numeric representation of a word or sentence. A simplified meaning of word embedding is that they are generated in a way to kind-of capture semantic meanings between words, so similar words end up in the same cluster.
For a small database like yours it'll probably be overkill, but after generating the embeddings you can use cosine similarity to find which food item is closest to each other.
There are many pre-trained models out there that you can use, though you might have to research a little to find which is most suitable for your use case (you can also fine tune it if you have your data but that's another story).
See an unoptimized python implementation below:
Output: