Is there a dplyr function to determine the most commonly encountered categorical value within a group?

Question

Is there a dplyr function to determine the most commonly encountered categorical value within a group?

488 Views Asked by Brisbane Pom At 22 January 2019 at 06:38

I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:

For example to take the table Cus1

Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
             Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
              Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
              )

And generate the table Cus_Summary:

Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
              Product = c("COKE", "BURGER", "CHICKEN"),
              Store = c("NYC", "Chicago", "LA")
              )

Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?

I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).

Original Q&A

There are 4 best solutions below

Ronak Shah On 22 January 2019 at 06:52

If you have many columns and want to find out maximum occurrence in all the columns you could use gather to convert the data in long format, count the occurrence for each column, group_by Customer and column and keep only the rows with maximum count and then spread it back to wide format.

library(tidyverse)

Cus %>%
  gather(key, value, -Customer) %>%
  count(Customer, key, value) %>%
  group_by(Customer, key) %>%
  slice(which.max(n)) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)

# Customer Product Store  
#  <fct>    <chr>   <chr>  
#1 C-01     COKE    NYC    
#2 C-02     BURGER  Chicago
#3 C-03     CHICKEN LA

EDIT

In case of ties if we want to randomly select ties we can filter all the max values and then use sample_n function to select random rows.

Cus %>%
  gather(key, value, -Customer) %>%
  count(Customer, key, value) %>%
  group_by(Customer, key) %>%
  filter(n == max(n)) %>%
  sample_n(1) %>%
  ungroup() %>%
  spread(key, value) %>%
  select(-n)


# Customer Product Store   
#  <fct>    <chr>   <chr>   
#1 C-01     COKE    NYC     
#2 C-02     BURGER  Chicago 
#3 C-03     FISH    San Fran

Paweł Chabros On 22 January 2019 at 09:25

In my solution, if there are more than one most frequent value, all are presented:

library(tidyverse)

Cus %>%
  gather('type', 'value', -Customer) %>%
  group_by(Customer, type, value) %>%
  count() %>%
  group_by(Customer) %>%
  filter(n == max(n)) %>%
  nest() %>%
  mutate(
    Product = map_chr(data, ~str_c(filter(.x, type == 'Product') %>% pull(value), collapse = ', ')),
    Store = map_chr(data, ~str_c(filter(.x, type == 'Store') %>% pull(value), collapse = ', '))
  ) %>%
  select(-data)

Result is:

# A tibble: 3 x 3
  Customer Product       Store           
  <fct>    <chr>         <chr>           
1 C-01     COKE          NYC             
2 C-02     BURGER        Chicago, Detroit
3 C-03     CHICKEN, FISH LA, San Fran

moodymudskipper On 22 January 2019 at 15:18

Using SO's favourite Mode function (though you could use any):

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

In base R

aggregate(. ~ Customer, lapply(Cus,as.character),  Mode)
#   Customer Product   Store
# 1     C-01    COKE     NYC
# 2     C-02  BURGER Chicago
# 3     C-03 CHICKEN      LA

using dplyr

library(dplyr)
Cus %>%
  group_by(Customer) %>%
  summarise_all(Mode)

# # A tibble: 3 x 3
# Customer Product   Store
# <fctr>  <fctr>  <fctr>
# 1     C-01    COKE     NYC
# 2     C-02  BURGER Chicago
# 3     C-03 CHICKEN      LA

**Maurits Evers** · Accepted Answer · 2019-01-22T06:47:25.123000

How about this?

Cus %>%
    group_by(Customer) %>%
    summarise(
        Product = first(names(sort(table(Product), decreasing = TRUE))),
        Store = first(names(sort(table(Store), decreasing = TRUE))))
## A tibble: 3 x 3
#  Customer Product Store
#  <fct>    <chr>   <chr>
#1 C-01     COKE    NYC
#2 C-02     BURGER  Chicago
#3 C-03     CHICKEN LA

Note that in the case of ties this selects the first entry in alphabetical order.

Update

To randomly select an entry from tied top frequency entries we could define a custom function

top_random <- function(x) {
    tbl <- sort(table(x), decreasing = T)
    top <- tbl[tbl == max(tbl)]
    return(sample(names(top), 1))
}

Then the following randomly selects one of the tied top entries:

Cus %>%
    group_by(Customer) %>%
    summarise(
        Product = top_random(Product),
        Store = top_random(Store))

Is there a dplyr function to determine the most commonly encountered categorical value within a group?

There are 4 best solutions below

Update

Related Questions in R

Related Questions in DPLYR

Related Questions in GROUP-SUMMARIES

Trending Questions

Popular # Hahtags

Popular Questions