I am trying to achieve similarity metric between strings with Jaro Winkler Algorithm in python, I am using anaconda environment and deployed it on Alibaba Cloud ECS Instance.

The sample code I am using to find similarity:

from pyjarowinkler import distance
print ("Average Score ---->", distance.get_jaro_distance("hello", "haloa"))

Average Score ---->0.76

When I process 600k records it takes more than 20 mins. It is very slow to process large number of records. Is there any other way to find the similarity metric between the records with low overhead and high accuracy?

1

There are 1 best solutions below

0
On BEST ANSWER

Jaro Winkler Distance which indicates the similarity score between two Strings. The Jaro measure is the weighted sum of percentage of matched characters from each file and transposed characters. Winkler increased this measure for matching initial characters.

The original implementation is based on the Jaro Winkler Similarity Algorithm article that can be found on Wikipedia. This Python version of the original implementation is based on the Apache StringUtils library.

Unittest similar to what you will find in the StringUtils library were used to validate implementation.

>>> from pyjarowinkler import distance
>>> # Scaling is 0.1 by default
>>> print distance.get_jaro_distance("hello", "haloa", winkler=True, scaling=0.1)
0.76
>>> print distance.get_jaro_distance("hello", "haloa", winkler=False, scaling=0.1)
0.733333333333

Get more detailed information from this link

I hope this will help you regarding your query.