What is the most efficient way to dedupe a Pandas dataframe that has typos?

6.4k Views Asked by At

I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:

  index  name          zipcode
-------  ----------  ---------
      0  john doe        12345
      1  jane smith      54321
      2  john dooe       12345
      3  jane smtih      54321

The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?

I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:

from fuzzywuzzy import fuzz

for index, row in df.iterrows():
    for index2, row2 in df.iterrows():
        ratio = fuzz.partial_ratio(row['name'], row2['name'])

        if ratio > 90:  # A good threshold for single character typos on names
            # Do something to declare a match and throw out the duplicate

Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples(), which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2) solution?

Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.

3

There are 3 best solutions below

0
On

the package pandas-dedupe can help you with your task.

pandas-dedupe works as follows: first it asks you to label a bunch of records he is most confused about. Afterwards, he uses this knowledge to resolve duplicates entitites. And that is it :)

You can try the following:

import pandas as pd
from pandas_dedupe import dedupe_dataframe

df = pd.DataFrame.from_dict({'name':['john', 'mark', 'frank', 'jon', 'john'], 'zip':['11', '22', '33', '11', '11']})

dd = dedupe_dataframe(df, ['name', 'zip'], canonicalize=True, sample_size=1)

The console will then ask you to label some example. If duplicates click 'y', otherwise 'n'. And once done click 'f' for finished. It will then perform deduplication on the entire dataframe.

1
On

For zipcodes, I can fairly confidently state that you can't detect typos without some mechanism for field validation (two zipcodes could look very close and both be valid zipcodes)

If the data is sorted, with some assumptions about where the typo is made (First letter is highly unlikely except in cases of common substitutions) you might be able to take advantage of that and search them as distinct per-letter chunks. If you assume the same for the last name, you can divide them into 26^2 distinct subgroups and only have them search within their field.

You could also try an approach just looking at the set of ORIGINAL first names and last names. If you're searching 7million items, and you have 60 thousand "Johns", you only need to compare them once against "Jhon" to find the error, then search for the "Jhon" and remove or fix it. But this is assuming, once again, that you break this up into a first-name and last-name series within the frame (using panda's str.extract(), with "([\w]+) ([\w]+)" or some such as your regex, as the data demands)

0
On

The string-grouper package is perfect for this. It uses TF-IDF with N-Grams underneath and is much faster than levenshtein.

from string_grouper import group_similar_strings

def group_strings(strings: List[str]) -> Dict[str, str]:
    series = group_similar_strings(pd.Series(strings))

    name_to_canonical = {}
    for i, s in enumerate(strings):
        deduped = series[i]
        if (s != deduped):
            name_to_canonical[s] = deduped

    return name_to_canonical