I'm trying to identify duplicates in a dataframe, based on four fields matching: 'dhid_y', 'from_y', 'to_y' and 'bound_y'. the code below, using .duplicated on the dataframe with 'subset' pointed at the four fields under consideration. The result should be that the duplicates are flagged as 'true' and the first occurrence should remain as 'false'. I'll use this info later in the script. However, not all of the duplicates are being spotted. Seems to work when just using dhid_y, but when I add additional fields, it seems to misbehave - although does run!
import pandas as pd
df_merged = pd.read_csv('merged_example_matched.csv')
conditions_2 = [(df_merged.duplicated(subset=['dhid_y', 'from_y', 'to_y', 'bound_y'], keep='first')) == True]
print(conditions_2)
Is there something obvious I'm missing here in how I'm using this duplicated option?
duplicated expected to be identified with code
rows code identifies as duplicated
highlighted entries which should have been identified as duplicates but were not by my code
The code was correct, but needed to round to 3 d.p. before looking for duplicates in my numeric fields. There was a difference of something like 0.00000001 between my 'matching' fields, so were not being treated as duplicates.