I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:
index name zipcode
------- ---------- ---------
0 john doe 12345
1 jane smith 54321
2 john dooe 12345
3 jane smtih 54321
The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?
I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:
from fuzzywuzzy import fuzz
for index, row in df.iterrows():
for index2, row2 in df.iterrows():
ratio = fuzz.partial_ratio(row['name'], row2['name'])
if ratio > 90: # A good threshold for single character typos on names
# Do something to declare a match and throw out the duplicate
Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples()
, which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2)
solution?
Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.
the package pandas-dedupe can help you with your task.
pandas-dedupe works as follows: first it asks you to label a bunch of records he is most confused about. Afterwards, he uses this knowledge to resolve duplicates entitites. And that is it :)
You can try the following:
The console will then ask you to label some example. If duplicates click 'y', otherwise 'n'. And once done click 'f' for finished. It will then perform deduplication on the entire dataframe.