I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can anyone help me understand where the difference comes from? Example:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
I am assuming it has to do with the weights, and I know I am using the defaults on both. However, if someone with more experience could shed light on what's going on, I would really appreciate it. Thanks!
Documentation:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf
Tucked away in the documentation for
stringdist
is the following:However, in
stringdist::stringdist
,p = 0
by default. Hence:In fact that value is hard-coded in the source of
RecordLinkage::jarowinkler
.