How to remove repeated rows based on last date in r?

120 Views Asked by At

I have a large dataframe with many columns and rows, where lot are my sampling units (the same lot can have more than one line of information). However, new field visits are frequently made and a new table is generated, with information that may be the same, changing only the sampling date, or small changes and/or even changes in the sample size. In this case, the two tables are joined and there may be duplication of the same sampling unit, in this case I need to exclude the oldest row based on the sampling date, but I need to do this per group. I've already made some attempts, but the code doesn't work very well, sometimes it leaves old lines or repeated lines.

  • Replicable example

    df1=data.frame(
    block=c(1,1,2,2,2,3,4,5,6,6,6,6),
    units=c(1,2,1,2,3,1,1,1,1,2,3,4),
    Sector=c(1,1,1,1,1,3,3,3,3,3,3,3),
    lot=c(25, 25,18 ,18 ,18 ,9,50,110,200,200,200,200),
    date= c("2019","2019","2023/10/24","2023/10/24","2023/10/24",
        "2023/09/24","2023/08/24","2023/10/24","2019", 
    "2019","2019","2019")
    )
    
    
    df2=data.frame(
    block=c(2, 2, 2, 4, 6, 6, 6,6 ),
    units=c(1, 2, 3, 1, 1, 2, 3,3 ),
    Sector=c(1,1,1,3,3,3,3,3),
    lot=c(18 ,18 ,18,50,200,200,200, 200),
    date= c("2023/12/24", "2023/12/24", "2023/12/24","2023/12/24",      
         "2023/11/24" ,   "2023/11/24",   "2023/11/24", "2023/11/24")
    )
    
    df_join<-rbind(df1, df2)
    
    library(tidyverse)
    
    df_clean<-df_join %>%
    group_by(Sector, block, lot,units ) %>%
    slice_max(date)
    

enter image description here

2

There are 2 best solutions below

2
Jinjin On BEST ANSWER

The exsiting answering still has a duplicate row in the last.

The biggest problem here is that your date column is not regular, which is a really bad practice. Therefore, we can first try to correct it by considering 2019 as 2019/01/01 or whatever consistent as you don't have the valid data, and convert it to date or date time.

df_join['date']%<>%apply(1,FUN= function(x){if(!x%>%str_detect('\\/')) 
  x%>%str_glue("/01/01") 
  else x})%>%
  as.Date()
#OR
df_join$date = as.character(df_join$date)
df_join%<>%
  mutate(date = if_else(!date%>%str_detect('\\/'), date%>%paste0("/01/01"), date)%>%as.Date())
##track change, both should result in
> df_join$date

 [1] "2019-01-01" "2019-01-01" "2023-10-24" "2023-10-24" "2023-10-24" "2023-09-24" "2023-08-24" "2023-10-24" "2019-01-01"
[10] "2019-01-01" "2019-01-01" "2019-01-01" "2023-12-24" "2023-12-24" "2023-12-24" "2023-12-24" "2023-11-24" "2023-11-24"
[19] "2023-11-24" "2023-11-24"

Now, use the last chunk of the code your have with adding %>% distinct() in the end, you can get the most recent record you want.

df_clean<-df_join %>% group_by(Sector, block, lot,units ) %>% slice_max(date) %>% distinct()

> df_clean
# A tibble: 12 x 5
# Groups:   Sector, block, lot, units [12]
   block units Sector   lot date      
   <dbl> <dbl>  <dbl> <dbl> <date>    
 1     1     1      1    25 2019-01-01
 2     1     2      1    25 2019-01-01
 3     2     1      1    18 2023-12-24
 4     2     2      1    18 2023-12-24
 5     2     3      1    18 2023-12-24
 6     3     1      3     9 2023-09-24
 7     4     1      3    50 2023-12-24
 8     5     1      3   110 2023-10-24
 9     6     1      3   200 2023-11-24
10     6     2      3   200 2023-11-24
11     6     3      3   200 2023-11-24
12     6     4      3   200 2019-01-01
0
Gregor Thomas On

You are putting too many columns in your group_by. As far as I can tell, you want to keep the row(s) with the maximum date within each Sector and block. If the units change, you want to drop the old ones, so units cannot be in group_by(). If the lot changes, you want to drop the old ones, so lot cannot be in group_by().

df_join %>%
  group_by(block, Sector)
  slice_max(date) %>%
  ungroup() |>
  distinct()
#    block units Sector lot       date
# 1      1     1      1  25       2019
# 2      1     2      1  25       2019
# 3      2     1      1  18 2023/12/24
# 4      2     2      1  18 2023/12/24
# 5      2     3      1  18 2023/12/24
# 6      3     1      3   9 2023/09/24
# 7      4     1      3  50 2023/12/24
# 8      5     1      3 110 2023/10/24
# 9      6     1      3 200 2023/11/24
# 10     6     2      3 200 2023/11/24
# 11     6     3      3 200 2023/11/24