Grouped tibbletime and using collapse_index, getting weird results

139 Views Asked by At

I have a file (appx 9K records) that I want to aggregate based on the group first, and then on dates that are within seven days of each other. However, I'm not understanding why the results look the way they do. I realize there are other ways I could achieve the same results with this particular example, but it's going to be much more complicated and there are other reasons I'm interested in using tibbletime. Here's a reproducible example:

library(tidyverse)
library(lubridate)  
library(tibbletime) #devtools::install_github("business-science/tibbletime")

TEST_ROLL <- as_tibble(list(
CITY_ID = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", 
"2", "2", "2"),
CAFE_ID = c("1001", "1001", "1001", "1001", "2002", "2002", "2002", "2002", 
"3003", "3003", "3003", "3003", "4004", "4004", "4005", "4005"),
HEALTH_REPORT_ID = c("1425", "1532", "1666", "1989", "2166", "2457", "2776", 
"2999", "3409", "3497", "3595", "3786", "4105", "4511", "4567", "4580"),
REPORT_SUBMIT_DATE = ymd( c( "2013-05-26", "2013-05-27", "2013-05-31", 
"2013-05-31", "2016-05-26", "2016-05-27", "2016-05-31", "2016-05-31", "2018- 
05-26", "2018-05-27", "2018-05-31", "2018-05-31", "2017-01-01", "2017-01- 
05", "2017-02-04", "2017-02-10"))))

What I want to do is, starting at the first report for each cafe, count the health report submissions that occur within seven days of each other as only one report so they're not being over counted.

First I tried using "7 day" as the period:

TEST_ROLL %>% 
  group_by(CAFE_ID) %>% 
  as_tbl_time(REPORT_SUBMIT_DATE) %>% 
  mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "7 day"))

# A time tibble: 16 x 5
# Index:  REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE 
<chr>   <chr>   <chr>            <date>             <date>    
 1 1       1001    1425             2013-05-26         2013-05-27
 2 1       1001    1532             2013-05-27         2013-05-27
 3 1       1001    1666             2013-05-31         2013-05-31
 4 1       1001    1989             2013-05-31         2013-05-31
 5 1       2002    2166             2016-05-26         2016-05-27
 6 1       2002    2457             2016-05-27         2016-05-27
 7 1       2002    2776             2016-05-31         2016-05-31
 8 1       2002    2999             2016-05-31         2016-05-31
 9 1       3003    3409             2018-05-26         2018-05-27
10 1       3003    3497             2018-05-27         2018-05-27
11 1       3003    3595             2018-05-31         2018-05-31
12 1       3003    3786             2018-05-31         2018-05-31
13 2       4004    4105             2017-01-01         2017-01-05
14 2       4004    4511             2017-01-05         2017-01-05
15 2       4005    4567             2017-02-04         2017-02-04
16 2       4005    4580             2017-02-10         2017-02-10

Which is not what I want. If it worked, all four of the reports for cafe 1001 would have the same roll date, because they are all within 7 days. So why the split into two dates in the result column?

Just playing around with it I tried using "weekly" instead of "7 days" and then I get this result:

TEST_ROLL %>% 
  group_by(CAFE_ID) %>% 
  as_tbl_time(REPORT_SUBMIT_DATE) %>%
  mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "weekly"))

# A time tibble: 16 x 5
# Index:  REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
   CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE 
   <chr>   <chr>   <chr>            <date>             <date>    
 1 1       1001    1425             2013-05-26         2013-05-31
 2 1       1001    1532             2013-05-27         2013-05-31
 3 1       1001    1666             2013-05-31         2013-05-31
 4 1       1001    1989             2013-05-31         2013-05-31
 5 1       2002    2166             2016-05-26         2016-05-27
 6 1       2002    2457             2016-05-27         2016-05-27
 7 1       2002    2776             2016-05-31         2016-05-31
 8 1       2002    2999             2016-05-31         2016-05-31
 9 1       3003    3409             2018-05-26         2018-05-26
10 1       3003    3497             2018-05-27         2018-05-31
11 1       3003    3595             2018-05-31         2018-05-31
12 1       3003    3786             2018-05-31         2018-05-31
13 2       4004    4105             2017-01-01         2017-01-05
14 2       4004    4511             2017-01-05         2017-01-05
15 2       4005    4567             2017-02-04         2017-02-04
16 2       4005    4580             2017-02-10         2017-02-10

Cafe 1001 is exactly what I wanted, but cafe 2002 and 3003 have the same dates (different year), yet the result is different.

And cafe 4004 is combined how I want, but there are only 6 days in between for cafe 4005, so those should have been combined as well. (I will summarise/count them later)

Any ideas why this might be happening? Thanks!!

1

There are 1 best solutions below

2
On

I don't know if what you are trying to accomplish is a wise choice. I think that going with the default behavior of collapse_index(REPORT_SUBMIT_DATE, "weekly") is the sane thing to do.

Here is one approach if you want to still go with what you are trying to do, though. I think you need to have a very good overview of which days are within 7 days of each other first.

Date <- TEST_ROLL$REPORT_SUBMIT_DATE
truth_mat <- abs(sapply(Date, 'difftime', Date, unit = 'day')) < 7
indices <- which(truth_mat, arr.ind = TRUE)
as_tibble(indices) %>% group_by(row) %>%
  summarise_at(vars(col), paste, collapse = ', ')

# # A tibble: 16 x 2
#      row col          
#    <int> <chr>        
#  1     1 1, 2, 3, 4   
#  2     2 1, 2, 3, 4   
#  3     3 1, 2, 3, 4   
#  4     4 1, 2, 3, 4   
#  5     5 5, 6, 7, 8   
#  6     6 5, 6, 7, 8   
#  7     7 5, 6, 7, 8   
#  8     8 5, 6, 7, 8   
#  9     9 9, 10, 11, 12
# 10    10 9, 10, 11, 12
# 11    11 9, 10, 11, 12
# 12    12 9, 10, 11, 12
# 13    13 13, 14       
# 14    14 13, 14       
# 15    15 15, 16       
# 16    16 15, 16 

We can see that {1,2,3,4}, {5,6,7,8}, {9,10,11,12}, {13,14} and {15,16} are forming clusters. Let us if hclust can detect these clusters.

hc <- hclust(dist(Date))
plot(hc)

enter image description here

Here, we can see that we can cut the tree into five branches and we get the desired grouping. We see that the dendogram suggests what we have observed so far. The good thing about going the hclust route is that we can easily specify these groupings.

TEST_ROLL$Group <- cutree(hc, 5)

TEST_ROLL
# # A tibble: 16 x 5
#    CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE  Date
#    <chr>   <chr>   <chr>            <date>             <int>
#  1 1       1001    1425             2013-05-26             1
#  2 1       1001    1532             2013-05-27             1
#  3 1       1001    1666             2013-05-31             1
#  4 1       1001    1989             2013-05-31             1
#  5 1       2002    2166             2016-05-26             2
#  6 1       2002    2457             2016-05-27             2
#  7 1       2002    2776             2016-05-31             2
#  8 1       2002    2999             2016-05-31             2
#  9 1       3003    3409             2018-05-26             3
# 10 1       3003    3497             2018-05-27             3
# 11 1       3003    3595             2018-05-31             3
# 12 1       3003    3786             2018-05-31             3
# 13 2       4004    4105             2017-01-01             4
# 14 2       4004    4511             2017-01-05             4
# 15 2       4005    4567             2017-02-04             5
# 16 2       4005    4580             2017-02-10             5

Note that hclust uses method = 'complete' as default Euclidean distance. You can experiment with other methods as you wish. Check ?hclust for details.

Edit

I just realized that you can also directly use the groupings found in truth_mat and indices this way.

groups <- as_tibble(indices) %>% group_by(row) %>%
  summarise_at(vars(col), paste, collapse = ', ') 
TEST_ROLL$group <- groups$col

Then you can group_by the group column without the need for hclust.