What is the easiest way to group/recode multiple categories into few categories?

96 Views Asked by At

I have a column with almost a 100 string categories that I would like to group/recode into fewer categories. I am trying to figure out the easiest way to do so, I thought about turning it into factor or numeric to make it easier to make operations. They are not in any particular order, but I can't seem to find the best way to recode it. Here is an example:

Suppose I have 15 string categories:

cat1 <- LETTERS[seq(1,15)]
df <- as.data.frame(cat1)

I turned it into numeric:

df$cat2 <- as.numeric(as.factor(df$cat1))

This is what I tried to do:

df <- df %>% mutate(cat3 = case_when(cat2 == c(1:5,7,9) ~ 1,
                                     cat2 == c(6,8,10,13) ~ 2,
                                     cat2 == (11:12,14:15) ~ 3))

Or I even tried:

df$cat3[df$cat2 == c(1:5, 7,9)] <- 1

I tried other codes, but they don't seem to work. Suppose I want to group the following new categories:

(1:5, 7,9) (6,8,10,13) (11:12,14:15)

What is the best way to do it?

3

There are 3 best solutions below

0
Maël On BEST ANSWER

Your case_when syntax needs a little tweak to make it work:

df %>% mutate(cat3 = case_when(cat2 %in% c(1:5, 7, 9) ~ 1,
                               cat2 %in% c(6,8,10,13) ~ 2,
                               cat2 %in% c(11:12,14:15) ~ 3))

But you can also use the one vector version, case_match:

df %>% mutate(cat3 = case_match(cat2, 
                                c(1:5, 7, 9) ~ 1,
                                c(6,8,10,13) ~ 2,
                                c(11:12,14:15) ~ 3))
0
jeffreyohene On

you were almost there with the case_when statement. You can modify it like this and make provision for NA values if you want to apply this to a larger dataset:

df <- df %>% 
  mutate(cat3 = case_when(
    cat2 %in% c(1:5, 7, 9)   ~ 1,
    cat2 %in% c(6, 8, 10, 13) ~ 2,
    cat2 %in% c(11:12, 14:15) ~ 3,
    TRUE                      ~ NA_integer_
  ))
3
Onyambu On

Though superseded, this is one of the case uses of recode that was pretty straightforward:

cats <- list('1' = c(1:5, 7,9),
             '2' = c(6,8,10,13),
             '3' = c (11:12,14:15))

mutate(df, cat3 = recode(cat2, !!!deframe(stack(cats))))

  cat1 cat2 cat3
1     A    1    1
2     B    2    1
3     C    3    1
4     D    4    1
5     E    5    1
6     F    6    2
7     G    7    1
8     H    8    2
9     I    9    1
10    J   10    2
11    K   11    3
12    L   12    3
13    M   13    2
14    N   14    3
15    O   15    3

in base R do:

df$cat3<- do.call(setNames, unname(stack(cats))[2:1])[as.character(df$cat2)]
df
   cat1 cat2 cat3
1     A    1    1
2     B    2    1
3     C    3    1
4     D    4    1
5     E    5    1
6     F    6    2
7     G    7    1
8     H    8    2
9     I    9    1
10    J   10    2
11    K   11    3
12    L   12    3
13    M   13    2
14    N   14    3
15    O   15    3