Using mutate to paste a column name if a condition is matched

Question

Using mutate to paste a column name if a condition is matched

105 Views Asked by Luis At 13 February 2024 at 04:09

Let's say I work in a psychological context and I'm wondering how many risk factors a patient has. After that, I would like to list all the risks and then discover the most prevalent risk (mode). I'm thinking on use mutate and then paste0 and get the colname if the value of the row is "risk". However, I'm having a hard time with that. any help is appreaciated.

Code is below:

library(tidyverse)
df = data.frame(
  patient = seq(1:60),
  cancer = c("risk","ok"), 
  blood_pres = c("risk", "ok"),
  low_education = c("risk","ok")
) 

df = df %>% mutate(how_many_risks =
                     rowSums(. == "risk"))

Original Q&A

There are 5 best solutions below

SGE On 13 February 2024 at 04:23

The c_across() function is what you are missing. Using your example data:


risk_factors <- c('cancer', 'blood_pres', 'low_education')

df <- df %>%
  rowwise() %>% 
  mutate(how_many_risks = sum(c_across(all_of(risk_factors)) == "risk"),
         what_risks = paste0(risk_factors[which(c_across(all_of(risk_factors)) == "risk")], collapse = ";")) %>% 
  ungroup()

You could add an extra line of logic to report the empty cases as 'none' (as in your example) with:

df2 <- df %>% 
  mutate(what_risks = if_else(what_risks == "", "none", what_risks))

based on the OP's comments, assuming the variables all begin with "risk" and no longer requiring the name vector:

df <- df %>%
  rowwise() %>% 
  mutate(how_many_risks = sum(c_across(starts_with("risk")) == "risk"),
         what_risks = paste0(colnames(.)[which(c_across(starts_with("risk")) == "risk")], collapse = ";")
         ) %>% 
  ungroup()

Anika Islam On 13 February 2024 at 04:26

risk_factors <- c('cancer', 'blood_pres', 'low_education')

benson23 On 13 February 2024 at 04:31

I think a single mutate call is enough for this operation (data taken from @r2evans).

Here I didn't use rowwise, but used sapply to iterate through the rows to find values matching "risk".

library(dplyr)

set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))

df %>% 
  mutate(how_many_risks = rowSums(. == "risk"),
         which_risks = ifelse(how_many_risks == 0, "no risk", paste0(sapply(1:nrow(df), \(x) paste(colnames(df[x, -1])[df[x, -1] == "risk"], collapse = ", ")))))

   patient cancer blood_pres low_education how_many_risks                       which_risks
1        1     ok       risk          risk              2         blood_pres, low_education
2        2     ok       risk          risk              2         blood_pres, low_education
3        3     ok         ok            ok              0                           no risk
4        4   risk       risk          risk              3 cancer, blood_pres, low_education
5        5     ok         ok          risk              1                     low_education
6        6   risk       risk            ok              2                cancer, blood_pres
7        7     ok         ok            ok              0                           no risk
8        8     ok       risk            ok              1                        blood_pres
9        9     ok         ok            ok              0                           no risk
10      10   risk       risk          risk              3 cancer, blood_pres, low_education

AnilGoyal On 13 February 2024 at 07:21

One More solution using dplyr c_across and across both. In this strategy you do not have to create explicit vector of risk_names.

library(tidyverse)


df %>% 
  mutate(across(cancer:low_education, ~ifelse(. == 'risk', cur_column(), "No Risk"), .names = "{.col}_N")) %>% 
  rowwise() %>% 
  mutate(which_risks = paste(unique(c_across(ends_with("_N"))), collapse = ", ")) %>% 
  select(-ends_with("_N"))
#> # A tibble: 60 × 6
#> # Rowwise: 
#>    patient cancer blood_pres low_education how_many_risks which_risks           
#>      <int> <chr>  <chr>      <chr>                  <dbl> <chr>                 
#>  1       1 risk   risk       risk                       3 cancer, blood_pres, l…
#>  2       2 ok     ok         ok                         0 No Risk               
#>  3       3 risk   risk       risk                       3 cancer, blood_pres, l…
#>  4       4 ok     ok         ok                         0 No Risk               
#>  5       5 risk   risk       risk                       3 cancer, blood_pres, l…
#>  6       6 ok     ok         ok                         0 No Risk               
#>  7       7 risk   risk       risk                       3 cancer, blood_pres, l…
#>  8       8 ok     ok         ok                         0 No Risk               
#>  9       9 risk   risk       risk                       3 cancer, blood_pres, l…
#> 10      10 ok     ok         ok                         0 No Risk               
#> # ℹ 50 more rows

^{Created on 2024-02-13 with reprex v2.0.2}

**r2evans** · Accepted Answer · 2024-02-13T04:24:14.080000

Let's come up with some more interesting data.

set.seed(43)
df <- data.frame(patient = 1:10, cancer = sample(c("risk","ok"), size=10, replace=TRUE), blood_pres = sample(c("risk","ok"), size=10, replace=TRUE), low_education = sample(c("risk","ok"), size=10, replace=TRUE))
df
#    patient cancer blood_pres low_education
# 1        1     ok       risk          risk
# 2        2     ok       risk          risk
# 3        3     ok         ok            ok
# 4        4   risk       risk          risk
# 5        5     ok         ok          risk
# 6        6   risk       risk            ok
# 7        7     ok         ok            ok
# 8        8     ok       risk            ok
# 9        9     ok         ok            ok
# 10      10   risk       risk          risk

From here, we'll pivot, summarize, then join back onto the original data.

library(dplyr)
library(tidyr) # pivot_*
df %>%
  pivot_longer(cols = -patient, values_to = "risk") %>%
  filter(risk == "risk") %>%
  summarize(howmany = n(), risks = toString(name), .by = patient) %>%
  left_join(df, ., by = "patient") %>%
  mutate(howmany = coalesce(howmany, 0))
#    patient cancer blood_pres low_education howmany                             risks
# 1        1     ok       risk          risk       2         blood_pres, low_education
# 2        2     ok       risk          risk       2         blood_pres, low_education
# 3        3     ok         ok            ok       0                              <NA>
# 4        4   risk       risk          risk       3 cancer, blood_pres, low_education
# 5        5     ok         ok          risk       1                     low_education
# 6        6   risk       risk            ok       2                cancer, blood_pres
# 7        7     ok         ok            ok       0                              <NA>
# 8        8     ok       risk            ok       1                        blood_pres
# 9        9     ok         ok            ok       0                              <NA>
# 10      10   risk       risk          risk       3 cancer, blood_pres, low_education

(Note that dplyr_1.1.0 or newer is required to use .by=. If you have an older dplyr and will not update, shift to using group_by(patient) instead of .by=patient.)

Something you may want to consider: unless this is solely for presentation tables, it is occasionally advantageous to have risks as a list-column instead of a comma-delimited string. To do this, just replace toString with list, and while it may render the same on the console, it will allow things like set-ops on it (though normal column/vector operations may not work as you expect):

out <- df %>%
  pivot_longer(cols = -patient, values_to = "risk") %>%
  filter(risk == "risk") %>%
  summarize(howmany = n(), risks = list(name), .by = patient) %>%
  left_join(df, ., by = "patient") %>%
  mutate(howmany = coalesce(howmany, 0))
out
#    patient cancer blood_pres low_education howmany                             risks
# 1        1     ok       risk          risk       2         blood_pres, low_education
# 2        2     ok       risk          risk       2         blood_pres, low_education
# 3        3     ok         ok            ok       0                              NULL
# 4        4   risk       risk          risk       3 cancer, blood_pres, low_education
# 5        5     ok         ok          risk       1                     low_education
# 6        6   risk       risk            ok       2                cancer, blood_pres
# 7        7     ok         ok            ok       0                              NULL
# 8        8     ok       risk            ok       1                        blood_pres
# 9        9     ok         ok            ok       0                              NULL
# 10      10   risk       risk          risk       3 cancer, blood_pres, low_education

If this data were a tibble (tbl_df) instead, the same data would present as

tibble(out)
# # A tibble: 10 × 6
#    patient cancer blood_pres low_education howmany risks    
#      <int> <chr>  <chr>      <chr>           <dbl> <list>   
#  1       1 ok     risk       risk                2 <chr [2]>
#  2       2 ok     risk       risk                2 <chr [2]>
#  3       3 ok     ok         ok                  0 <NULL>   
#  4       4 risk   risk       risk                3 <chr [3]>
#  5       5 ok     ok         risk                1 <chr [1]>
#  6       6 risk   risk       ok                  2 <chr [2]>
#  7       7 ok     ok         ok                  0 <NULL>   
#  8       8 ok     risk       ok                  1 <chr [1]>
#  9       9 ok     ok         ok                  0 <NULL>   
# 10      10 risk   risk       risk                3 <chr [3]>

We can do things directly such as check the lengths of each row in that column; or check quickly for exact set-membership:

lengths(out$risks)
#  [1] 2 2 0 3 1 2 0 1 0 3

sapply(out$risks, `%in%`, x = "cancer")
#  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

Granted, both of those can be done with regex, but .. if the names have any ambiguity, regex carries a little overhead.

Using mutate to paste a column name if a condition is matched

There are 5 best solutions below

Related Questions in R

Related Questions in DPLYR

Related Questions in TIDYR

Related Questions in ROWWISE

Trending Questions

Popular # Hahtags

Popular Questions