Pandas Drop Duplicates And Store Duplicates

473 Views Asked by At

i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.

Is there a way to save the data in a new list before removing it?

I have unfortunately found in the documentation of pandas no information on this.

Thanks for the answer.

2

There are 2 best solutions below

0
On

It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

0
On

You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:

m = df.duplicated('group')

dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)

df = df[~m]
print(df)

Output:

# print(dropped)
group
A       [2]
B    [4, 5]
C       [7]
Name: value, dtype: object

# print(df)
  group  value
0     A      1
2     B      3
5     C      6

Used input:

  group  value
0     A      1
1     A      2
2     B      3
3     B      4
4     B      5
5     C      6
6     C      7