How to deduplicate date sequences across non-consecutive rows in R?

Question

How to deduplicate date sequences across non-consecutive rows in R?

119 Views Asked by Ezra At 11 September 2020 at 12:27

I want to flag the first date in every window of at least 31 days for each id in my data.
Data:

library(tidyverse)
library(lubridate)
library(tibbletime)

D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
             index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
                    "2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1

# A tibble: 10 x 2
      id index_date
   <dbl> <chr>     
 1    12 2019-01-01
 2    12 2019-01-07
 3    12 2019-01-21
 4    12 2019-02-02
 5    12 2019-02-09
 6    12 2019-03-06
 7    10 2019-01-05
 8    10 2019-02-01
 9    10 2019-02-02
10    10 2019-02-08

The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:

temp <- D1 %>%
  mutate(index_date = ymd(index_date)) %>%  
  arrange(id, index_date) %>%
  as_tbl_time(index_date) %>% 
  group_by(id) %>%
  mutate(keyed_to_index_date = 
           collapse_index(index_date, period = '31 d', side = "start"),
           keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)

Result:

      id index_date keyed_to_index_date keep       
   <dbl> <date>     <date>              <lgl>
 1    12 2019-01-01 2019-01-01          TRUE 
 2    12 2019-01-07 2019-01-01          FALSE
 3    12 2019-01-21 2019-01-01          FALSE
 4    12 2019-02-02 2019-02-02          TRUE 
 5    12 2019-02-09 2019-02-02          FALSE
 6    12 2019-03-06 2019-03-06          TRUE 
 7    10 2019-01-05 2019-01-05          TRUE 
 8    10 2019-02-01 2019-02-01          TRUE 
 9    10 2019-02-02 2019-02-01          FALSE
10    10 2019-02-08 2019-02-01          FALSE

Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?

UPDATE:
Adding the option start_date = first(index_date) to collapse_index(), as suggested by @mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:

D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
             index_date = c("2019-03-04","2019-03-05","2019-03-06",
                            "2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
                            "2019-03-03","2019-03-04","2019-03-05"))

D2
   id    index_date
   <chr> <chr>     
 1 A     2019-03-04
 2 A     2019-03-05
 3 A     2019-03-06
 4 B     2019-03-01
 5 B     2019-03-02
 6 B     2019-03-04
 7 B     2019-03-05
 8 B     2019-03-06
 9 C     2019-03-03
10 C     2019-03-04
11 C     2019-03-05

I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:

t3 <- D2 %>%
  mutate(index_date = ymd(index_date)) %>%  
  arrange(id, index_date) %>%
  as_tbl_time(index_date) %>% 
  group_by(id) %>%
  mutate(keyed_to_index_date = 
           collapse_index(index_date, 
                          period = '2 d', 
                          side = "start", 
                          start_date = first(index_date)),
         keep = index_date == keyed_to_index_date) %>%
  arrange(id, index_date)

Result:

> t3
# A time tibble: 11 x 4
# Index:  index_date
# Groups: id [3]
   id    index_date keyed_to_index_date keep 
   <chr> <date>     <date>              <lgl>
 1 A     2019-03-04 2019-03-04          TRUE 
 2 A     2019-03-05 2019-03-04          FALSE
 3 A     2019-03-06 2019-03-06          TRUE 
 4 B     2019-03-01 2019-03-01          TRUE 
 5 B     2019-03-02 2019-03-01          FALSE
 6 B     2019-03-04 2019-03-04          TRUE 
 7 B     2019-03-05 2019-03-05          TRUE 
 8 B     2019-03-06 2019-03-05          FALSE
 9 C     2019-03-03 2019-03-03          TRUE 
10 C     2019-03-04 2019-03-03          FALSE
11 C     2019-03-05 2019-03-05          TRUE

Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by @tmfmnk, I get the correct result.
Code:

t4 <-
  D2 %>%
  group_by(id) %>%
  mutate(index_date = ymd(index_date),
         keep =  row_number() == 1 | 
           accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2, 
                                                        .y, 
                                                        .x + .y)
           ) >= 2
  )

Result:

> t4
# A tibble: 11 x 3
# Groups:   id [3]
   id    index_date keep 
   <chr> <date>     <lgl>
 1 A     2019-03-04 TRUE 
 2 A     2019-03-05 FALSE
 3 A     2019-03-06 TRUE 
 4 B     2019-03-01 TRUE 
 5 B     2019-03-02 FALSE
 6 B     2019-03-04 TRUE 
 7 B     2019-03-05 FALSE
 8 B     2019-03-06 TRUE 
 9 C     2019-03-03 TRUE 
10 C     2019-03-04 FALSE
11 C     2019-03-05 TRUE

What is wrong with the tibbletime approach in this example?

Original Q&A

There are 3 best solutions below

**tmfmnk** · Answer 1 · 2020-09-11T12:39:09.783000

One option utilizing dplyr, lubridate and purrr could be:

D1 %>%
 group_by(id) %>%
 mutate(index_date = ymd(index_date),
        keep =  row_number() == 1 | accumulate(c(0, diff(index_date)), ~ if_else(.x >= 31, .y, .x + .y)) >= 31)

     id index_date keep 
   <dbl> <date>     <lgl>
 1    12 2019-01-01 TRUE 
 2    12 2019-01-07 FALSE
 3    12 2019-01-21 FALSE
 4    12 2019-02-02 TRUE 
 5    12 2019-02-09 FALSE
 6    12 2019-03-06 TRUE 
 7    10 2019-01-05 TRUE 
 8    10 2019-02-01 FALSE
 9    10 2019-02-02 FALSE
10    10 2019-02-08 TRUE

**mnaR99** · Answer 2 · 2020-09-11T12:49:24.743000

You just need to add the start_date argument to collapse_index:

D1 %>%
  mutate(index_date = ymd(index_date)) %>%  
  arrange(id, index_date) %>%
  as_tbl_time(index_date) %>% 
  group_by(id) %>%
  mutate(keyed_to_index_date = 
           collapse_index(index_date, period = '31 d', side = "start", start_date = first(index_date)),
           keep = index_date == keyed_to_index_date) %>% 
  arrange(desc(id), index_date)
#> # A time tibble: 10 x 4
#> # Index:  index_date
#> # Groups: id [2]
#>       id index_date keyed_to_index_date keep 
#>    <dbl> <date>     <date>              <lgl>
#>  1    12 2019-01-01 2019-01-01          TRUE 
#>  2    12 2019-01-07 2019-01-01          FALSE
#>  3    12 2019-01-21 2019-01-01          FALSE
#>  4    12 2019-02-02 2019-02-02          TRUE 
#>  5    12 2019-02-09 2019-02-02          FALSE
#>  6    12 2019-03-06 2019-03-06          TRUE 
#>  7    10 2019-01-05 2019-01-05          TRUE 
#>  8    10 2019-02-01 2019-01-05          FALSE
#>  9    10 2019-02-02 2019-01-05          FALSE
#> 10    10 2019-02-08 2019-02-08          TRUE

^{Created on 2020-09-11 by the reprex package (v0.3.0)}

**Darren Tsai** · Answer 3 · 2020-09-11T13:01:45.333000

You can use accumulate() from purrr.

D1 %>%
  group_by(id) %>% 
  mutate(index_date = ymd(index_date),
         keep = index_date == accumulate(index_date, ~ if(.y - .x >= 31) .y else .x))

#       id index_date keep 
#    <dbl> <date>     <lgl>
#  1    12 2019-01-01 TRUE 
#  2    12 2019-01-07 FALSE
#  3    12 2019-01-21 FALSE
#  4    12 2019-02-02 TRUE 
#  5    12 2019-02-09 FALSE
#  6    12 2019-03-06 TRUE 
#  7    10 2019-01-05 TRUE 
#  8    10 2019-02-01 FALSE
#  9    10 2019-02-02 FALSE
# 10    10 2019-02-08 TRUE

The iteration rule is following:

1. 2019-01-07 -  2019-01-01   = 6  <  31 then return  2019-01-01
2. 2019-01-21 -  2019-01-01   = 20 <  31 then return  2019-01-01
3. 2019-02-02 -  2019-01-01   = 32 >= 31 then return (2019-02-02)*
4. 2019-02-09 - (2019-02-02)* = 7  <  31 then return  2019-02-02
5. etc.

How to deduplicate date sequences across non-consecutive rows in R?

There are 3 best solutions below

Related Questions in R

Related Questions in TIDYR

Related Questions in LUBRIDATE

Related Questions in DATE-MANIPULATION

Related Questions in TIBBLETIME

Trending Questions

Popular # Hahtags

Popular Questions