I have a file (appx 9K records) that I want to aggregate based on the group first, and then on dates that are within seven days of each other. However, I'm not understanding why the results look the way they do. I realize there are other ways I could achieve the same results with this particular example, but it's going to be much more complicated and there are other reasons I'm interested in using tibbletime. Here's a reproducible example:
library(tidyverse)
library(lubridate)
library(tibbletime) #devtools::install_github("business-science/tibbletime")
TEST_ROLL <- as_tibble(list(
CITY_ID = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2",
"2", "2", "2"),
CAFE_ID = c("1001", "1001", "1001", "1001", "2002", "2002", "2002", "2002",
"3003", "3003", "3003", "3003", "4004", "4004", "4005", "4005"),
HEALTH_REPORT_ID = c("1425", "1532", "1666", "1989", "2166", "2457", "2776",
"2999", "3409", "3497", "3595", "3786", "4105", "4511", "4567", "4580"),
REPORT_SUBMIT_DATE = ymd( c( "2013-05-26", "2013-05-27", "2013-05-31",
"2013-05-31", "2016-05-26", "2016-05-27", "2016-05-31", "2016-05-31", "2018-
05-26", "2018-05-27", "2018-05-31", "2018-05-31", "2017-01-01", "2017-01-
05", "2017-02-04", "2017-02-10"))))
What I want to do is, starting at the first report for each cafe, count the health report submissions that occur within seven days of each other as only one report so they're not being over counted.
First I tried using "7 day" as the period:
TEST_ROLL %>%
group_by(CAFE_ID) %>%
as_tbl_time(REPORT_SUBMIT_DATE) %>%
mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "7 day"))
# A time tibble: 16 x 5
# Index: REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE
<chr> <chr> <chr> <date> <date>
1 1 1001 1425 2013-05-26 2013-05-27
2 1 1001 1532 2013-05-27 2013-05-27
3 1 1001 1666 2013-05-31 2013-05-31
4 1 1001 1989 2013-05-31 2013-05-31
5 1 2002 2166 2016-05-26 2016-05-27
6 1 2002 2457 2016-05-27 2016-05-27
7 1 2002 2776 2016-05-31 2016-05-31
8 1 2002 2999 2016-05-31 2016-05-31
9 1 3003 3409 2018-05-26 2018-05-27
10 1 3003 3497 2018-05-27 2018-05-27
11 1 3003 3595 2018-05-31 2018-05-31
12 1 3003 3786 2018-05-31 2018-05-31
13 2 4004 4105 2017-01-01 2017-01-05
14 2 4004 4511 2017-01-05 2017-01-05
15 2 4005 4567 2017-02-04 2017-02-04
16 2 4005 4580 2017-02-10 2017-02-10
Which is not what I want. If it worked, all four of the reports for cafe 1001 would have the same roll date, because they are all within 7 days. So why the split into two dates in the result column?
Just playing around with it I tried using "weekly" instead of "7 days" and then I get this result:
TEST_ROLL %>%
group_by(CAFE_ID) %>%
as_tbl_time(REPORT_SUBMIT_DATE) %>%
mutate(ROLL_DATE = collapse_index(REPORT_SUBMIT_DATE, "weekly"))
# A time tibble: 16 x 5
# Index: REPORT_SUBMIT_DATE
# Groups: CAFE_ID [5]
CITY_ID CAFE_ID HEALTH_REPORT_ID REPORT_SUBMIT_DATE ROLL_DATE
<chr> <chr> <chr> <date> <date>
1 1 1001 1425 2013-05-26 2013-05-31
2 1 1001 1532 2013-05-27 2013-05-31
3 1 1001 1666 2013-05-31 2013-05-31
4 1 1001 1989 2013-05-31 2013-05-31
5 1 2002 2166 2016-05-26 2016-05-27
6 1 2002 2457 2016-05-27 2016-05-27
7 1 2002 2776 2016-05-31 2016-05-31
8 1 2002 2999 2016-05-31 2016-05-31
9 1 3003 3409 2018-05-26 2018-05-26
10 1 3003 3497 2018-05-27 2018-05-31
11 1 3003 3595 2018-05-31 2018-05-31
12 1 3003 3786 2018-05-31 2018-05-31
13 2 4004 4105 2017-01-01 2017-01-05
14 2 4004 4511 2017-01-05 2017-01-05
15 2 4005 4567 2017-02-04 2017-02-04
16 2 4005 4580 2017-02-10 2017-02-10
Cafe 1001 is exactly what I wanted, but cafe 2002 and 3003 have the same dates (different year), yet the result is different.
And cafe 4004 is combined how I want, but there are only 6 days in between for cafe 4005, so those should have been combined as well. (I will summarise/count them later)
Any ideas why this might be happening? Thanks!!
I don't know if what you are trying to accomplish is a wise choice. I think that going with the default behavior of
collapse_index(REPORT_SUBMIT_DATE, "weekly")
is the sane thing to do.Here is one approach if you want to still go with what you are trying to do, though. I think you need to have a very good overview of which days are within 7 days of each other first.
We can see that
{1,2,3,4}
,{5,6,7,8}
,{9,10,11,12}
,{13,14}
and{15,16}
are forming clusters. Let us ifhclust
can detect these clusters.Here, we can see that we can cut the tree into five branches and we get the desired grouping. We see that the dendogram suggests what we have observed so far. The good thing about going the
hclust
route is that we can easily specify these groupings.Note that
hclust
usesmethod = 'complete'
as default Euclidean distance. You can experiment with other methods as you wish. Check?hclust
for details.Edit
I just realized that you can also directly use the groupings found in
truth_mat
andindices
this way.Then you can
group_by
thegroup
column without the need forhclust
.