I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:
index name zipcode
------- ---------- ---------
0 john doe 12345
1 jane smith 54321
2 john dooe 12345
3 jane smtih 54321
The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?
I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:
from fuzzywuzzy import fuzz
for index, row in df.iterrows():
for index2, row2 in df.iterrows():
ratio = fuzz.partial_ratio(row['name'], row2['name'])
if ratio > 90: # A good threshold for single character typos on names
# Do something to declare a match and throw out the duplicate
Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples(), which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2) solution?
Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.
For zipcodes, I can fairly confidently state that you can't detect typos without some mechanism for field validation (two zipcodes could look very close and both be valid zipcodes)
If the data is sorted, with some assumptions about where the typo is made (First letter is highly unlikely except in cases of common substitutions) you might be able to take advantage of that and search them as distinct per-letter chunks. If you assume the same for the last name, you can divide them into 26^2 distinct subgroups and only have them search within their field.
You could also try an approach just looking at the set of ORIGINAL first names and last names. If you're searching 7million items, and you have 60 thousand "Johns", you only need to compare them once against "Jhon" to find the error, then search for the "Jhon" and remove or fix it. But this is assuming, once again, that you break this up into a first-name and last-name series within the frame (using panda's str.extract(), with "([\w]+) ([\w]+)" or some such as your regex, as the data demands)