How to add sub count in R dataframe

134 Views Asked by At

I have below-mentioned dataframe in R:

dput(df)

structure(list(CustID = c("C-1", "C-2", 
"C-2", "C-2", "C-3", "C-3", 
"C-3", "C-4", "C-5"), DATE = c("2021-01-02 14:13:10", "2021-01-02 13:17:07", "2021-01-02 14:15:10", "2021-01-02 16:14:08", "2021-01-02 17:11:03", "2021-01-02 12:14:24", "2021-01-02 12:33:34", "2021-01-02 10:43:55", "2021-01-03 20:23:35"), TYPE = c("Demo", 
"Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro", "Pro"
), LogCat = c(NA, "SPR,DET,RTD", "SPR,DET,RTD", "SPR,DET,RTD", 
"DET", "DET", "DET", NA, " SPR, RTD "), PriceCode = c(NA,"KR", "SR", "DE", "KL", "ZT", "KR", "KR", "KR")), class = "data.frame", row.names = c(NA, 
-9L))

CusID     Date                   Type        LogCat             PriceCode
C-1       2021-01-02 14:13:10    Demo                           
C-2       2021-01-02 13:17:07    Pro         SPR, DET, RTD      KR
C-2       2021-01-02 14:15:10    Pro         SPR, DET, RTD      SR
C-2       2021-01-02 16:14:08    Pro         SPR, DET, RTD      DE
C-3       2021-01-02 17:11:03    Pro         DET                KL
C-3       2021-01-02 12:14:24    Pro         DET                ZT
C-3       2021-01-02 12:33:34    Pro         DET                KR
C-4       2021-01-02 10:43:55    Pro                            KR
C-5       2021-01-03 20:23:35    Pro         SPR, RTD           KR

I'm using the following code which give me below-mentioned output.

library(tidyverse)
library(janitor)
df %>% mutate(DATE = as.Date(DATE)) %>% select(1:3) %>%
  unique() %>%
  tabyl(TYPE, DATE) %>%
  adorn_totals("row") %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(2) %>%
  adorn_ns("front")

  TYPE  2021-01-02  2021-01-03
  Demo 1  (25.00%) 0   (0.00%)
   Pro 3  (75.00%) 1 (100.00%)
 Total 4 (100.00%) 1 (100.00%)

In the above mentioned output, I need the distinct sub count based on column LogCat and PriceCode. For Example, in the give total distinct count for particular Type I need to get the number how many distinct CusID has LogCat and PriceCode. Also, how many distinct count doesn't have LogCat and PriceCode.

Where, For each type we need to check the following condition.

  • LogCat - Where an unique CusID has only value in LogCat colum and PriceCode is either NA or Blank
  • PriceCode - Where an unique CusID has only value in PriceCode colum and LogCat is either NA or Blank
  • Both True - Where an unique CusID has value in both LogCat and PriceCode column.
  • Both False - Where an unique CusID has no value in both LogCat and PriceCode column.

The percentage for these values should be calculated by taking the distinct count of a particular Type for which we are calculating these values. For Example, below Demo if we are calculating these value for Demo type then the denominator should be 1 for first date and 0 for the second date.

Required Dataframe<-

  TYPE           2021-01-02   2021-01-03
  Demo           1 (25.00%)   0 (0.00%)
  LogCat         0 (0.00%)    0 (0.00%)
  PriceCode      0 (0.00%)    0 (0.00%)
  Both True      0 (0.00%)    0 (0.00%)
  Both False     1 (100.00%)  0 (0.00%) 
  Pro            3 (75.00%)   1 (100.00%)
  LogCat         0 (0.00%)    0 (0.00%)
  PriceCode      1 (33.33%)   0 (0.00%)
  Both True      2 (66.66%)   1 (100.00%)
  Both False     0 (0.00%)    0 (0.00%)
  Total          4 (100.00%)  1 (100.00%)
2

There are 2 best solutions below

4
Sinh Nguyen On BEST ANSWER

Here is what I understanding from your OP - quite complicated logics ;)

library(tidyverse)
library(scales)

# grain level data-frame
grain_df <- df %>%
  # convert Date to Date
  mutate(DATE = as.Date(DATE)) %>%
  separate_rows(LogCat) %>%
  # Categorize customer into your define logic
  group_by(DATE, TYPE, CustID) %>%
  summarize(category = case_when(
    # all LogCat is missing & Price is present
    all(is.na(LogCat)) & any(!is.na(PriceCode)) ~ "Missing_LogCat",
    # LogCat is present & Price is missing
    any(!is.na(LogCat)) & all(is.na(PriceCode)) ~ "Missing_PriceCode",
    # Both present
    any(!is.na(LogCat)) & any(!is.na(PriceCode)) ~ "Both_True",
    # Both missing
    all(is.na(LogCat)) & all(is.na(PriceCode)) ~ "Both_True",
    TRUE ~ "No category yet"
  ), .groups = "drop") %>%
  # count the customer in each category by type & date
  group_by(DATE, TYPE, category) %>%
  summarize(count = n_distinct(CustID), .groups = "drop") %>%
  # compelete the data set with all combination of Date/type/category
  complete(DATE, TYPE, category, fill = list(count = 0))
# summarize and put in some formmating
summary_df <- grain_df %>%
  # calculate the percent
  group_by(DATE, TYPE) %>%
  mutate(percent = paste0(count, " (", percent(count / sum(count)), ")")) %>%
  ungroup() %>%
  select(-count) %>%
  # Pivot wider to have the format you want
  pivot_wider(names_from = DATE, values_from = percent, values_fill = "")

# calculate total row & formatting
total_rows <- grain_df %>%
  group_by(DATE) %>%
  summarize(percent = paste0(sum(count), " (100%)")) %>%
  mutate(TYPE = "Total", category = "") %>%
  pivot_wider(names_from = DATE, values_from = percent)
# combind two data to get a final df with total rows
bind_rows(summary_df,
  total_rows)
#> # A tibble: 5 x 4
#>   TYPE  category         `2021-01-02` `2021-01-03`
#>   <chr> <chr>            <chr>        <chr>       
#> 1 Demo  "Both_True"      1 (100%)     0 (NA)      
#> 2 Demo  "Missing_LogCat" 0 (0%)       0 (NA)      
#> 3 Pro   "Both_True"      2 (67%)      1 (100%)    
#> 4 Pro   "Missing_LogCat" 1 (33%)      0 (0%)      
#> 5 Total ""               4 (100%)     1 (100%)

Created on 2021-04-18 by the reprex package (v2.0.0)

And here is a way to generate some pivot style table in RStudio

library(kableExtra)
kbl(summary_df, align = "c") %>%
  kable_paper(full_width = F) %>%
  column_spec(1:2, bold = T) %>%
  collapse_rows(columns = 1:2, target = 1, valign = "top")

This will output in the View tab in RStudio this table

Pivot similar

3
AnilGoyal On

package pivottabler may help here

df %>% mutate(DATE = as.Date(DATE)) %>%
  mutate(across(.cols = c(LogCat, PriceCode), ~ +!(is.na(.) | . == ""))) %>%
  mutate(dummy = case_when(LogCat + PriceCode == 0 ~ "Both False",
                           LogCat + PriceCode == 2 ~ "Both True",
                           LogCat == 1 ~ "Logcat",
                           TRUE ~ "PriceCode")) %>%
  mutate(dummy = factor(dummy, levels = c("Both False", "Both True", "Logcat", "PriceCode")),
         d2 = 1) %>%
  group_by(CustID, DATE, TYPE) %>%
  complete(dummy) %>% ungroup() %>% unique() %>%
  select(c(2:4,7)) -> df1

library(pivottabler)  

qpvt(df1, c("TYPE", "dummy"), "DATE", "sum(d2, na.rm = T)")

                   2021-01-02  2021-01-03  Total  
Demo   Both False           1                  1  
       Both True            0                  0  
       Logcat               0                  0  
       PriceCode            0                  0  
       Total                1                  1  
Pro    Both False           0           0      0  
       Both True            2           1      3  
       Logcat               0           0      0  
       PriceCode            1           0      1  
       Total                3           1      4  
Total                       4           1      5 

or

qhpvt(df1, c("TYPE", "dummy"), "DATE", "sum(d2, na.rm = T)")

enter image description here