The following data has the surprising result that it does not match. I was expecting the distance to be 5
, but even at 7
I get no match
library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
A.x A.y
1 Other field crops (non-organic) <NA>
Only at 10
I get a match..
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
A.x A.y
1 Other field crops (non-organic) other_field_crops_non_organic
Could someone explain to me why this distance larger than 9
? Does it have to do with the brackets? And if so how can I circumvent this issue without removing the brackets?
EDIT
library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
A.x A.y
1 Other field crops non-organic <NA>
Even without the brackets I cannot get the distance within 5
.
The problem comes down to the method you are using to calculate the string distance. You are using the
lcs
(longest common substring) method, which in effect only allows deletions and insertions rather than substitutions. From the docs:So when we convert spaces to underscores, we incur a weighting of 2 per substitution:
This is in contrast to the default 'osa' method, which like the Levenshtein distance and the R function
adist
allows direct substitutions, with only a 1-point weighting:You can compare how the different
stringdist
methods compare on your two strings. To further simplify, let's make both lowercase since you are already specifyingignore_case
in your left join:You can see that the Hamming distance is infinite, since your strings are of different length, and
osa
(the default method) is only 6, butlcs
requires 10 (4 removals of underscores, 3 additions of spaces, one addition of a hyphen, and two additions of parentheses). If this string pair is representative of your data, you might want to switch to "osa"Created on 2022-04-14 by the reprex package (v2.0.1)