I have data that I've downloaded from a server. It has product sales per month and the products have various categories. I would like to summarise the data by category and by quarter.
I can't figure out how to keep the group_by groups when I use the timetk summarise_by_time.
I have tried reading the documentation for summarise_by_time, but I can't see anything about other groups that are not time groupings. This is what I have so far:
plot_dat<-df_m %>%
group_by(date,Area,Main_Category) %>%
summarise(sum_products=sum(Products_sold)) %>%
summarise_by_time(
.by = "quarter",
value = last(sum_products),
.type = "ceiling"
)
This gives the correct output
plot_dat<-df_m %>%
group_by(date,Area,Main_Category) %>%
summarise(sum_products=sum(Products_sold))
output
`summarise()` has grouped output by 'date', 'Area'. You can override using the `.groups` argument.
# A tibble: 900 × 4
# Groups: date, Area [240]
date Area Main_Category sum_products
<date> <chr> <chr> <int>
1 2018-11-27 Analysis "NGS" 4
2 2018-11-27 Analysis "PCR" 0
3 2018-11-27 Collection/Preservation "" 0
4 2018-11-27 Collection/Preservation "Preservation Device" 172
5 2018-11-27 Collection/Preservation "Preservation Reagent" 10
6 2018-11-27 Isolation_and_Purification "DNA kits" 96
But when I add the summarise_by_time they are only grouped by Area and I lose most of the data (ie the sum of the products sold is now only ~800 as opposed to 74,000)
output
# A tibble: 84 × 3
# Groups: date [21]
date Area value
<date> <chr> <int>
1 2019-01-01 Analysis 0
2 2019-01-01 Collection/Preservation 9
3 2019-01-01 Isolation_and_Purification 4
4 2019-01-01 Lab_Supplies 2
5 2019-04-01 Analysis 0
6 2019-04-01 Collection/Preservation 0
7 2019-04-01 Isolation_and_Purification 2
8 2019-04-01 Lab_Supplies 1
9 2019-07-01 Analysis 0
10 2019-07-01 Collection/Preservation 3
My educated guess based on your question is that you're wanting to get one row for the sum of Products, for each combination of (Main_Category, Area, Quarter) that exists in your dataset?
The issue with your code can be seen by looking at the code in your question:
You say it's giving the correct answer, but in fact, you can see that the dataframe is now only grouped by date and area, not date, area, and category, so it's really not the correct answer. Why is it doing this? To quote the documentation for
summarise()
:So in this case, it is dropping the last level of grouping, which in your code is Main_Category, hence your result.
You could fix it in a few different ways, but maybe the simplest is to simply make the date column the last one, and then use that as the
.by
argument insummarise_by_time()
:Output:
Another option would be to give
summarise()
the argument.groups = "keep"
, which would achieve the same result.Note, also, if you would like there to be a row for every combination that could possibly exist, not just those that do exist in your dataset, you can use
complete()
to add rows for every possible combination of Area, Main_Category, and Quarter. E.g. like this:Data: