Let's assume we hava a Dataframe looking like this:

Index FirstName Surname Adress Source
1 Paul Baggins same good
2 Paaul Baggins same bad
3 Mary Baggins same good
4 Mary Baggins same bad
5 Lucy Smith other bad

We want to clean up the Dataframe. First we filter people living at the same adress. We can be sure that adresses are unique for each household. There we want to delete potential duplicates, because we used different data sources and unfortunately there might be some typing errors in the column "FirstName".

How can we delete the duplicates (in our case index rows 2 and 4)?

I found out that we could delete "exact" duplicates by using

df.drop_duplicates(subset=['FirstName','Surname', 'Adress'], keep='first')

This way Index 4 will be deleted. But this is not what I am looking for.

To delete index row 2 I want to compare the text of "FirstName" Index 1 and 2 and tried using the following function:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar(text_a, text_b)

Using similar('Paul', 'Paaul')" results in 0.888

But I don't see how to put all this together.

The "manipulated" Dataframe should look like this:

Index FirstName Surname Adress Source Similar_to_Index Ratio
1 Paul Baggins same good 2 0.8888
2 Paaul Baggins same bad 1 0.8888
3 Mary Baggins same good 4 1.0
4 Mary Baggins same bad 3 1.0
5 Lucy Smith other bad NaN NaN

Then index rows 2 and 4 should be deleted by the rule that the relevant ratio is >0.8 and the column "Source" is labeled "bad". The problem is how to create the Column "Similar_to_Index" I guess.

The final result should be like this:

Cleaned Dataframe:

Index FirstName Surname Adress Source Similar_to_Index Ratio
1 Paul Baggins same good 2 0.8888
3 Mary Baggins same good 4 1.0
5 Lucy Smith other bad NaN NaN

Deleted_Entries_Dataframe:

Index FirstName Surname Adress Source Similar_to_Index Ratio
2 Paaul Baggins same bad 1 0.8888
4 Mary Baggins same bad 3 1.0

Thank you very much for any suggestions and help.

1

There are 1 best solutions below

3
mozway On

You can use itertools.combinations to check all pairs of names, then select the top match:

from itertools import combinations
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

# compare all combinations of names
def compare(s):
    return [similar(a, b) for a, b in combinations(s, 2)]

# compute unique combinations, assign the reverse combinations (a->b ; b->a)
# find the best match
tmp = pd.Series(compare(df['FirstName']),
                index=pd.MultiIndex.from_arrays([idx1, idx2])).unstack()
tmp = tmp.add(tmp.T, fill_value=0)

MAX = tmp.max(axis=1)
m = MAX > 0.8

out = df.assign(Similar_to_Index=tmp.idxmax(axis=1).where(m),
                Ratio=MAX.where(m)
                )

Output:

      FirstName  Surname Adress Source  Similar_to_Index     Ratio
Index                                                             
1          Paul  Baggins   same   good               2.0  0.888889
2         Paaul  Baggins   same    bad               1.0  0.888889
3          Mary  Baggins   same   good               4.0  1.000000
4          Mary  Baggins   same    bad               3.0  1.000000
5          Lucy    Smith  other    bad               NaN       NaN

Splitting the data in two:

dup = np.maximum(out.index, out['Similar_to_Index']).duplicated()

Cleaned_Dataframe = out[~dup]
#       FirstName  Surname Adress Source  Similar_to_Index     Ratio
# Index                                                             
# 1          Paul  Baggins   same   good               2.0  0.888889
# 3          Mary  Baggins   same   good               4.0  1.000000
# 5          Lucy    Smith  other    bad               NaN       NaN

Deleted_Entries_Dataframe =out[dup]
#       FirstName  Surname Adress Source  Similar_to_Index     Ratio
# Index                                                             
# 2         Paaul  Baggins   same    bad               1.0  0.888889
# 4          Mary  Baggins   same    bad               3.0  1.000000

NB. if you sort the rows to have the "good" source on the top, then those will be kept preferentially as non-duplicate.