How can I delete specific duplicates of a Pandas Dataframe according to a rule? The duplicates don't look exactly the same necessarily

Question

How can I delete specific duplicates of a Pandas Dataframe according to a rule? The duplicates don't look exactly the same necessarily

39 Views Asked by Poldi At 03 July 2023 at 07:57

Let's assume we hava a Dataframe looking like this:

Index	FirstName	Surname	Adress	Source
1	Paul	Baggins	same	good
2	Paaul	Baggins	same	bad
3	Mary	Baggins	same	good
4	Mary	Baggins	same	bad
5	Lucy	Smith	other	bad

We want to clean up the Dataframe. First we filter people living at the same adress. We can be sure that adresses are unique for each household. There we want to delete potential duplicates, because we used different data sources and unfortunately there might be some typing errors in the column "FirstName".

How can we delete the duplicates (in our case index rows 2 and 4)?

I found out that we could delete "exact" duplicates by using

df.drop_duplicates(subset=['FirstName','Surname', 'Adress'], keep='first')

This way Index 4 will be deleted. But this is not what I am looking for.

To delete index row 2 I want to compare the text of "FirstName" Index 1 and 2 and tried using the following function:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar(text_a, text_b)

Using similar('Paul', 'Paaul')" results in 0.888

But I don't see how to put all this together.

The "manipulated" Dataframe should look like this:

Index	FirstName	Surname	Adress	Source	Similar_to_Index	Ratio
1	Paul	Baggins	same	good	2	0.8888
2	Paaul	Baggins	same	bad	1	0.8888
3	Mary	Baggins	same	good	4	1.0
4	Mary	Baggins	same	bad	3	1.0
5	Lucy	Smith	other	bad	NaN	NaN

Then index rows 2 and 4 should be deleted by the rule that the relevant ratio is >0.8 and the column "Source" is labeled "bad". The problem is how to create the Column "Similar_to_Index" I guess.

The final result should be like this:

Cleaned Dataframe:

Index	FirstName	Surname	Adress	Source	Similar_to_Index	Ratio
1	Paul	Baggins	same	good	2	0.8888
3	Mary	Baggins	same	good	4	1.0
5	Lucy	Smith	other	bad	NaN	NaN

Deleted_Entries_Dataframe:

Index	FirstName	Surname	Adress	Source	Similar_to_Index	Ratio
2	Paaul	Baggins	same	bad	1	0.8888
4	Mary	Baggins	same	bad	3	1.0

Thank you very much for any suggestions and help.

Original Q&A

There are 1 best solutions below

**mozway** · Answer 1 · 2023-07-03T08:33:03.560000

You can use itertools.combinations to check all pairs of names, then select the top match:

from itertools import combinations
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

# compare all combinations of names
def compare(s):
    return [similar(a, b) for a, b in combinations(s, 2)]

# compute unique combinations, assign the reverse combinations (a->b ; b->a)
# find the best match
tmp = pd.Series(compare(df['FirstName']),
                index=pd.MultiIndex.from_arrays([idx1, idx2])).unstack()
tmp = tmp.add(tmp.T, fill_value=0)

MAX = tmp.max(axis=1)
m = MAX > 0.8

out = df.assign(Similar_to_Index=tmp.idxmax(axis=1).where(m),
                Ratio=MAX.where(m)
                )

Output:

      FirstName  Surname Adress Source  Similar_to_Index     Ratio
Index                                                             
1          Paul  Baggins   same   good               2.0  0.888889
2         Paaul  Baggins   same    bad               1.0  0.888889
3          Mary  Baggins   same   good               4.0  1.000000
4          Mary  Baggins   same    bad               3.0  1.000000
5          Lucy    Smith  other    bad               NaN       NaN

Splitting the data in two:

dup = np.maximum(out.index, out['Similar_to_Index']).duplicated()

Cleaned_Dataframe = out[~dup]
#       FirstName  Surname Adress Source  Similar_to_Index     Ratio
# Index                                                             
# 1          Paul  Baggins   same   good               2.0  0.888889
# 3          Mary  Baggins   same   good               4.0  1.000000
# 5          Lucy    Smith  other    bad               NaN       NaN

Deleted_Entries_Dataframe =out[dup]
#       FirstName  Surname Adress Source  Similar_to_Index     Ratio
# Index                                                             
# 2         Paaul  Baggins   same    bad               1.0  0.888889
# 4          Mary  Baggins   same    bad               3.0  1.000000

NB. if you sort the rows to have the "good" source on the top, then those will be kept preferentially as non-duplicate.

How can I delete specific duplicates of a Pandas Dataframe according to a rule? The duplicates don't look exactly the same necessarily

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in DUPLICATES

Related Questions in SIMILARITY

Related Questions in DIFFLIB

Trending Questions

Popular # Hahtags

Popular Questions