Duplicated lines in a dataframe, using multiple fields to check for duplicate

122 Views Asked by At

I'm trying to identify duplicates in a dataframe, based on four fields matching: 'dhid_y', 'from_y', 'to_y' and 'bound_y'. the code below, using .duplicated on the dataframe with 'subset' pointed at the four fields under consideration. The result should be that the duplicates are flagged as 'true' and the first occurrence should remain as 'false'. I'll use this info later in the script. However, not all of the duplicates are being spotted. Seems to work when just using dhid_y, but when I add additional fields, it seems to misbehave - although does run!

import pandas as pd

df_merged = pd.read_csv('merged_example_matched.csv')

conditions_2 = [(df_merged.duplicated(subset=['dhid_y', 'from_y', 'to_y', 'bound_y'], keep='first')) == True]
print(conditions_2)

Is there something obvious I'm missing here in how I'm using this duplicated option?

duplicated expected to be identified with code

rows code identifies as duplicated

highlighted entries which should have been identified as duplicates but were not by my code

1

There are 1 best solutions below

0
On

The code was correct, but needed to round to 3 d.p. before looking for duplicates in my numeric fields. There was a difference of something like 0.00000001 between my 'matching' fields, so were not being treated as duplicates.