How to deduplicate date sequences across non-consecutive rows in R?

102 Views Asked by At

I want to flag the first date in every window of at least 31 days for each id in my data.
Data:

library(tidyverse)
library(lubridate)
library(tibbletime)

D1 <- tibble(id = c(12,12,12,12,12,12,10,10,10,10),
             index_date=c("2019-01-01","2019-01-07","2019-01-21","2019-02-02",
                    "2019-02-09","2019-03-06","2019-01-05","2019-02-01","2019-02-02","2019-02-08"))
D1

# A tibble: 10 x 2
      id index_date
   <dbl> <chr>     
 1    12 2019-01-01
 2    12 2019-01-07
 3    12 2019-01-21
 4    12 2019-02-02
 5    12 2019-02-09
 6    12 2019-03-06
 7    10 2019-01-05
 8    10 2019-02-01
 9    10 2019-02-02
10    10 2019-02-08

The desired rows to flag are rows 1, 4, 6, 7, and 10; these rows represent either the first index_date for a given id or the first index_date after a 31-day skip period from the previously flagged index_date for that given id.
Code:

temp <- D1 %>%
  mutate(index_date = ymd(index_date)) %>%  
  arrange(id, index_date) %>%
  as_tbl_time(index_date) %>% 
  group_by(id) %>%
  mutate(keyed_to_index_date = 
           collapse_index(index_date, period = '31 d', side = "start"),
           keep = index_date == keyed_to_index_date)
temp %>% arrange(desc(id), index_date)

Result:

      id index_date keyed_to_index_date keep       
   <dbl> <date>     <date>              <lgl>
 1    12 2019-01-01 2019-01-01          TRUE 
 2    12 2019-01-07 2019-01-01          FALSE
 3    12 2019-01-21 2019-01-01          FALSE
 4    12 2019-02-02 2019-02-02          TRUE 
 5    12 2019-02-09 2019-02-02          FALSE
 6    12 2019-03-06 2019-03-06          TRUE 
 7    10 2019-01-05 2019-01-05          TRUE 
 8    10 2019-02-01 2019-02-01          TRUE 
 9    10 2019-02-02 2019-02-01          FALSE
10    10 2019-02-08 2019-02-01          FALSE

Why does this code flag row 8 (which has an index_date less than 31 days after the previously flagged index_date for that id) and not row 10, and how do I fix this problem?

UPDATE:
Adding the option start_date = first(index_date) to collapse_index(), as suggested by @mnaR99, successfully flagged the correct rows in the original example. However, when I applied the same principle to new data, I ran into a problem:
Data:

D2 <- tibble(id = c("A","A","A","B","B","B","B","B","C","C","C"),
             index_date = c("2019-03-04","2019-03-05","2019-03-06",
                            "2019-03-01","2019-03-02","2019-03-04","2019-03-05","2019-03-06",
                            "2019-03-03","2019-03-04","2019-03-05"))

D2
   id    index_date
   <chr> <chr>     
 1 A     2019-03-04
 2 A     2019-03-05
 3 A     2019-03-06
 4 B     2019-03-01
 5 B     2019-03-02
 6 B     2019-03-04
 7 B     2019-03-05
 8 B     2019-03-06
 9 C     2019-03-03
10 C     2019-03-04
11 C     2019-03-05

I now want to apply a 2-day window in the same manner as I previously applied a 31-day window (that is, consecutive calendar days should not both be flagged). The desired rows to flag are Rows 1, 3, 4, 6, 8, 9, and 11, because these rows are either the first `index_date` for a particular `id` or the first after a two-day skip.
Code:
t3 <- D2 %>%
  mutate(index_date = ymd(index_date)) %>%  
  arrange(id, index_date) %>%
  as_tbl_time(index_date) %>% 
  group_by(id) %>%
  mutate(keyed_to_index_date = 
           collapse_index(index_date, 
                          period = '2 d', 
                          side = "start", 
                          start_date = first(index_date)),
         keep = index_date == keyed_to_index_date) %>%
  arrange(id, index_date)

Result:

> t3
# A time tibble: 11 x 4
# Index:  index_date
# Groups: id [3]
   id    index_date keyed_to_index_date keep 
   <chr> <date>     <date>              <lgl>
 1 A     2019-03-04 2019-03-04          TRUE 
 2 A     2019-03-05 2019-03-04          FALSE
 3 A     2019-03-06 2019-03-06          TRUE 
 4 B     2019-03-01 2019-03-01          TRUE 
 5 B     2019-03-02 2019-03-01          FALSE
 6 B     2019-03-04 2019-03-04          TRUE 
 7 B     2019-03-05 2019-03-05          TRUE 
 8 B     2019-03-06 2019-03-05          FALSE
 9 C     2019-03-03 2019-03-03          TRUE 
10 C     2019-03-04 2019-03-03          FALSE
11 C     2019-03-05 2019-03-05          TRUE 

Row 7 is incorrectly flagged as TRUE, and Row 8 is incorrectly flagged as FALSE.
When I apply the purrr solution suggested by @tmfmnk, I get the correct result.
Code:

t4 <-
  D2 %>%
  group_by(id) %>%
  mutate(index_date = ymd(index_date),
         keep =  row_number() == 1 | 
           accumulate(c(0, diff(index_date)), ~ if_else(.x >= 2, 
                                                        .y, 
                                                        .x + .y)
           ) >= 2
  )

Result:

> t4
# A tibble: 11 x 3
# Groups:   id [3]
   id    index_date keep 
   <chr> <date>     <lgl>
 1 A     2019-03-04 TRUE 
 2 A     2019-03-05 FALSE
 3 A     2019-03-06 TRUE 
 4 B     2019-03-01 TRUE 
 5 B     2019-03-02 FALSE
 6 B     2019-03-04 TRUE 
 7 B     2019-03-05 FALSE
 8 B     2019-03-06 TRUE 
 9 C     2019-03-03 TRUE 
10 C     2019-03-04 FALSE
11 C     2019-03-05 TRUE

What is wrong with the tibbletime approach in this example?

3

There are 3 best solutions below

1
On

One option utilizing dplyr, lubridate and purrr could be:

D1 %>%
 group_by(id) %>%
 mutate(index_date = ymd(index_date),
        keep =  row_number() == 1 | accumulate(c(0, diff(index_date)), ~ if_else(.x >= 31, .y, .x + .y)) >= 31)

     id index_date keep 
   <dbl> <date>     <lgl>
 1    12 2019-01-01 TRUE 
 2    12 2019-01-07 FALSE
 3    12 2019-01-21 FALSE
 4    12 2019-02-02 TRUE 
 5    12 2019-02-09 FALSE
 6    12 2019-03-06 TRUE 
 7    10 2019-01-05 TRUE 
 8    10 2019-02-01 FALSE
 9    10 2019-02-02 FALSE
10    10 2019-02-08 TRUE 
1
On

You just need to add the start_date argument to collapse_index:

D1 %>%
  mutate(index_date = ymd(index_date)) %>%  
  arrange(id, index_date) %>%
  as_tbl_time(index_date) %>% 
  group_by(id) %>%
  mutate(keyed_to_index_date = 
           collapse_index(index_date, period = '31 d', side = "start", start_date = first(index_date)),
           keep = index_date == keyed_to_index_date) %>% 
  arrange(desc(id), index_date)
#> # A time tibble: 10 x 4
#> # Index:  index_date
#> # Groups: id [2]
#>       id index_date keyed_to_index_date keep 
#>    <dbl> <date>     <date>              <lgl>
#>  1    12 2019-01-01 2019-01-01          TRUE 
#>  2    12 2019-01-07 2019-01-01          FALSE
#>  3    12 2019-01-21 2019-01-01          FALSE
#>  4    12 2019-02-02 2019-02-02          TRUE 
#>  5    12 2019-02-09 2019-02-02          FALSE
#>  6    12 2019-03-06 2019-03-06          TRUE 
#>  7    10 2019-01-05 2019-01-05          TRUE 
#>  8    10 2019-02-01 2019-01-05          FALSE
#>  9    10 2019-02-02 2019-01-05          FALSE
#> 10    10 2019-02-08 2019-02-08          TRUE

Created on 2020-09-11 by the reprex package (v0.3.0)

0
On

You can use accumulate() from purrr.

D1 %>%
  group_by(id) %>% 
  mutate(index_date = ymd(index_date),
         keep = index_date == accumulate(index_date, ~ if(.y - .x >= 31) .y else .x))

#       id index_date keep 
#    <dbl> <date>     <lgl>
#  1    12 2019-01-01 TRUE 
#  2    12 2019-01-07 FALSE
#  3    12 2019-01-21 FALSE
#  4    12 2019-02-02 TRUE 
#  5    12 2019-02-09 FALSE
#  6    12 2019-03-06 TRUE 
#  7    10 2019-01-05 TRUE 
#  8    10 2019-02-01 FALSE
#  9    10 2019-02-02 FALSE
# 10    10 2019-02-08 TRUE

The iteration rule is following:

1. 2019-01-07 -  2019-01-01   = 6  <  31 then return  2019-01-01
2. 2019-01-21 -  2019-01-01   = 20 <  31 then return  2019-01-01
3. 2019-02-02 -  2019-01-01   = 32 >= 31 then return (2019-02-02)*
4. 2019-02-09 - (2019-02-02)* = 7  <  31 then return  2019-02-02
5. etc.