Pandas Drop Duplicates And Store Duplicates

465 Views Asked by corsin sauber At 31 July 2025 at 03:01

i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.

Is there a way to save the data in a new list before removing it?

I have unfortunately found in the documentation of pandas no information on this.

Thanks for the answer.

Original Q&A

There are 2 best solutions below

XYZ On 16 December 2022 at 08:06

It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

mozway On 16 December 2022 at 08:12

You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:

m = df.duplicated('group')

dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)

df = df[~m]
print(df)

Output:

# print(dropped)
group
A       [2]
B    [4, 5]
C       [7]
Name: value, dtype: object

# print(df)
  group  value
0     A      1
2     B      3
5     C      6

Used input:

  group  value
0     A      1
1     A      2
2     B      3
3     B      4
4     B      5
5     C      6
6     C      7

Pandas Drop Duplicates And Store Duplicates

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in DUPLICATES

Related Questions in DROP-DUPLICATES

Trending Questions

Popular # Hahtags

Popular Questions