Duplicated lines in a dataframe, using multiple fields to check for duplicate

113 Views Asked by David At 28 July 2025 at 22:12

I'm trying to identify duplicates in a dataframe, based on four fields matching: 'dhid_y', 'from_y', 'to_y' and 'bound_y'. the code below, using .duplicated on the dataframe with 'subset' pointed at the four fields under consideration. The result should be that the duplicates are flagged as 'true' and the first occurrence should remain as 'false'. I'll use this info later in the script. However, not all of the duplicates are being spotted. Seems to work when just using dhid_y, but when I add additional fields, it seems to misbehave - although does run!

import pandas as pd

df_merged = pd.read_csv('merged_example_matched.csv')

conditions_2 = [(df_merged.duplicated(subset=['dhid_y', 'from_y', 'to_y', 'bound_y'], keep='first')) == True]
print(conditions_2)

Is there something obvious I'm missing here in how I'm using this duplicated option?

duplicated expected to be identified with code

rows code identifies as duplicated

highlighted entries which should have been identified as duplicates but were not by my code

Original Q&A

There are 1 best solutions below

David On 01 October 2020 at 09:42

The code was correct, but needed to round to 3 d.p. before looking for duplicates in my numeric fields. There was a difference of something like 0.00000001 between my 'matching' fields, so were not being treated as duplicates.

Duplicated lines in a dataframe, using multiple fields to check for duplicate

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DUPLICATES

Related Questions in MULTIPLE-CONDITIONS

Trending Questions

Popular # Hahtags

Popular Questions