I have a huge datasets of 2 Million and i want to match the records based on the fuzzy logic I have my original dataframe like
+---------+---------------+
| name| address|
+---------+---------------+
| Arvind| Kathmandu|
| Arvind| Kathmands|
| Arbind| Kathmandu|
| Arvinds| Kathmandu|
| Arveen| Kathmandu|
| Arvins| Kathmandu|
| Arvind|Kathmandu Nepal|
| Abhishek| Pokhara|
|Abhisheks| Pokhara|
|Abhishek1| Pokhara|
|Abhishek2| Pokhara|
|Abhishek3| Pokhara|
+---------+---------------+
I tried using pyspark windows function but windows function takes partition based on the exact match and i want to have may records matched based on the fuzzy logic and wanted to have my output as a dataframe like this:-
+---------+---------------+
| name| address|uuid_for_match_record
+---------+---------------+
| Arvind| Kathmandu| uuid_1
| Arvind| Kathmands|uuid_1
| Arbind| Kathmandu|uuid_1
| Arvinds| Kathmandu|uuid_1
| Arveen| Kathmandu|uuid_1
| Arvins| Kathmandu|uuid_1
| Arvind|Kathmandu Nepal|uuid_1
| Abhishek| Pokhara|uuid_2
|Abhisheks| Pokhara|uuid_2
|Abhishek1| Pokhara|uuid_2
|Abhishek2| Pokhara|uuid_2
|Abhishek3| Pokhara|uuid_2`
How can it be achieved based on the huge 2Million datasets Here is the image of my dataframe and what i want to achieve: image of dataframe and output i want
I have taken inspiration from this blogpost to write the following code.
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
The
cluster_namesfunction just clusters the strings within the list based on thecluster_thresholdvalue. You can tweak this value to get good results. You can also play around withshingling_widthinname_to_features. You can create features of width=2,3,4,5 and so on and concatenate together.Once you are satistifed your with your clusters, then you can further do
fuzzywuzzy(this library has been renamed tothefuzz) matching to find more exact matches.https://github.com/seatgeek/thefuzz
First install
simhashpython library, then run the following code.pip install simhashOutput :