This seems really simple but for some reason, I don't understand the behavior of agrep
fuzzy matching involving substitutions. Two substitutions produce a match as expected when all=2
is specified, but not when substitutions=2
. Why is this?
# Finds a match as expected
agrep("abcdeX", "abcdef", value = T,
max.distance = list(sub=1, ins=0, del=0))
#> [1] "abcdef"
# Doesn't find a match as expected
agrep("abcdXX", "abcdef", value = T,
max.distance = list(sub=1, ins=0, del=0))
#> character(0)
# Finds a match as expected
agrep("abcdXX", "abcdef", value = T,
max.distance = list(all=2))
#> [1] "abcdef"
# Doesn't find a match UNEXPECTEDLY
agrep("abcdXX", "abcdef", value = T,
max.distance = list(sub=2, ins=0, del=0))
#> character(0)
Created on 2021-06-03 by the reprex package (v2.0.0)
all
is an upper limit which always applies, regardless of othermax.distance
controls (other thancost
). It defaults to 10%.There's a bit of a gotcha that the fractional mode of setting
all
switches to the integer mode at 1.When you suppress
all
by setting it to just less than 1, the limits on the distance mode apply.The purpose of setting the cost is to allow you to move around the mutation-space at different rates in different directions. This is going to depend on your use case. For example some language dialects might be more likely to add letters. You might chose to let a deletion cost two insertions. By default, all are equally weighted when
costs = NULL
, i.e.costs = c(ins = 1, del = 1, sub = 1)
.EDIT: regarding your comment about why some patterns match and others don't, the 10% refers to the number of characters in the pattern, rounding up.
The second pattern is only 10 characters, so only one substitution is allowed.