Remove duplicate rows in a diskframe object

40 Views Asked by At

I have a diskframe object with many duplicate rows. How could I remove them? (The original dataframe is 10 Gb size)

1

There are 1 best solutions below

0
Archeologist On

You can do it in base R:

#this removes duplicate rows across the entire data frame:
df[!duplicated(df), ]

#Or if you want to remove duplicate rows at specific column(s): 
df[!duplicated(df[c('ColumnX')]), ]

If you want to do it using dplyr, then similarly either across the entire data frame or at a specific column:

df %>% distinct(.keep_all = TRUE)

#Or: 
df %>% distinct(ColumnX, .keep_all = TRUE)