Dplyr standard evaluation using a vector of multiple strings with mutate function

1.3k Views Asked by At

I am trying to supply a vector that contains multiple column names to a mutate() call using the dplyr package. Reproducible example below:

stackdf <- data.frame(jack = c(1,NA,2,NA,3,NA,4,NA,5,NA),
                      jill = c(1,2,NA,3,4,NA,5,6,NA,7),
                      jane = c(1,2,3,4,5,6,NA,NA,NA,NA))
two_names <- c('jack','jill')
one_name <- c('jack')

#   jack jill jane
#    1    1    1
#   NA    2    2
#    2   NA    3
#   NA    3    4
#    3    4    5
#   NA   NA    6
#    4    5   NA
#   NA    6   NA
#    5   NA   NA
#   NA    7   NA

I am able to figure out how to use the "one variable" versions, but do not know how to extend this to multiple variables?

# the below works as expected, and is an example of the output I desire
stackdf %>% rowwise %>% mutate(test = anyNA(c(jack,jill)))

# A tibble: 10 x 4
    jack  jill  jane  test
   <dbl> <dbl> <dbl> <lgl>
 1     1     1     1 FALSE
 2    NA     2     2  TRUE
 3     2    NA     3  TRUE
 4    NA     3     4  TRUE
 5     3     4     5 FALSE
 6    NA    NA     6  TRUE
 7     4     5    NA FALSE
 8    NA     6    NA  TRUE
 9     5    NA    NA  TRUE
10    NA     7    NA  TRUE


# using the one_name variable works if I evaluate it and then convert to 
# a name before unquoting it
stackdf %>% rowwise %>% mutate(test = anyNA(!!as.name(eval(one_name))))

# A tibble: 10 x 4
    jack  jill  jane  test
   <dbl> <dbl> <dbl> <lgl>
 1     1     1     1 FALSE
 2    NA     2     2  TRUE
 3     2    NA     3 FALSE
 4    NA     3     4  TRUE
 5     3     4     5 FALSE
 6    NA    NA     6  TRUE
 7     4     5    NA FALSE
 8    NA     6    NA  TRUE
 9     5    NA    NA FALSE
10    NA     7    NA  TRUE

How can I extend the above approach so that I could use the two_names vector? Using as.name only takes a single object so it does not work.

This question here is similar: Pass a vector of variable names to arrange() in dplyr. That solution "works" in that I can use the below code:

two_names2 <- quos(c(jack, jill))
stackdf %>% rowwise %>% mutate(test = anyNA(!!!two_names2))

But it defeats the purpose if I have to type c(jack, jill) directly rather than using the two_names variable. Is there some similar procedure where I can use two_names directly? This answer How to pass a named vector to dplyr::select using quosures? uses rlang::syms but though this works for selecting variables (ie stackdf %>% select(!!! rlang::syms(two_names)) it does not seem to work for supplying arguments when mutating (ie stackdf %>% rowwise %>% mutate(test = anyNA(!!! rlang::syms(two_names))). This answer is similar but does not work: How to evaluate a constructed string with non-standard evaluation using dplyr?

2

There are 2 best solutions below

0
On BEST ANSWER

There are several keys to solving this question:

  • Accessing the strings within a character vector and using these with dplyr
  • The formatting of arguments provided to the function used with mutate, here the anyNA

The goal here is to replicate this call, but using the named variable two_names instead of manually typing out c(jack,jill).

stackdf %>% rowwise %>% mutate(test = anyNA(c(jack,jill)))

# A tibble: 10 x 4
    jack  jill  jane  test
   <dbl> <dbl> <dbl> <lgl>
 1     1     1     1 FALSE
 2    NA     2     2  TRUE
 3     2    NA     3  TRUE
 4    NA     3     4  TRUE
 5     3     4     5 FALSE
 6    NA    NA     6  TRUE
 7     4     5    NA FALSE
 8    NA     6    NA  TRUE
 9     5    NA    NA  TRUE
10    NA     7    NA  TRUE

1. Using dynamic variables with dplyr

  1. Using quo/quos: Does not accept strings as input. The solution using this method would be:

    two_names2 <- quos(c(jack, jill))
    stackdf %>% rowwise %>% mutate(test = anyNA(!!! two_names2))
    

    Note that quo takes a single argument, and thus is unquoted using !!, and for multiple arguments you can use quos and !!! respectively. This is not desirable because I do not use two_names and instead have to type out the columns I wish to use.

  2. Using as.name or rlang::sym/rlang::syms: as.name and sym take only a single input, however syms will take multiple and return a list of symbolic objects as output.

    > two_names
    [1] "jack" "jill"
    > as.name(two_names)
    jack
    > syms(two_names)
    [[1]]
    jack
    
    [[2]]
    jill
    

    Note that as.name ignores everything after the first element. However, syms appears to work appropriately here, so now we need to use this within the mutate call.

2. Using dynamic variables within mutate using anyNA or other variables

  1. Using syms and anyNA directly does not actually produce the correct result.

    > stackdf %>% rowwise %>% mutate(test = anyNA(!!! syms(two_names)))
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3 FALSE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA FALSE
    10    NA     7    NA  TRUE
    

    Inspection of the test shows that this is only taking into account the first element, and ignoring the second element. However, if I use a different function, eg sum or paste0, it is clear that both elements are being used:

    > stackdf %>% rowwise %>% mutate(test = sum(!!! syms(two_names), 
                                                na.rm = TRUE))
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <dbl>
     1     1     1     1     2
     2    NA     2     2     2
     3     2    NA     3     2
     4    NA     3     4     3
     5     3     4     5     7
     6    NA    NA     6     0
     7     4     5    NA     9
     8    NA     6    NA     6
     9     5    NA    NA     5
    10    NA     7    NA     7
    

    The reason for this becomes clear when you look at the arguments for anyNA vs sum.

    function (x, recursive = FALSE) .Primitive("anyNA")

    function (..., na.rm = FALSE) .Primitive("sum")

    anyNA expects a single object x, whereas sum can take a variable list of objects (...).

  2. Simply supplying c() fixes this problem (see answer from alistaire).

    > stackdf %>% rowwise %>% mutate(test = anyNA(c(!!! syms(two_names))))
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3  TRUE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA  TRUE
    10    NA     7    NA  TRUE
    
  3. Alternately... for educational purposes, one could use a combination of sapply, any, and anyNA to produce the correct result. Here we use list so that the results are provided as a single list object.

    # this produces an error an error because the elements of !!!
    # are being passed to the arguments of sapply (X =, FUN = )
    > stackdf %>% rowwise %>% 
        mutate(test = any(sapply(!!! syms(two_names), anyNA)))
    Error in mutate_impl(.data, dots) : 
      Evaluation error: object 'jill' of mode 'function' was not found.
    

    Supplying list fixes this problem because it binds all the results into a single object.

    # the below table is the familiar incorrect result that uses only the `jack`
    > stackdf %>% rowwise %>% 
        mutate(test = any(sapply(X=as.list(!!! syms(two_names)), 
                                 FUN=anyNA)))
    
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3 FALSE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA FALSE
    10    NA     7    NA  TRUE
    
    # this produces the correct answer
    > stackdf %>% rowwise %>% 
        mutate(test = any(X = sapply(list(!!! syms(two_names)), 
                          FUN = anyNA)))
    
    jack  jill  jane  test
    <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3  TRUE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA  TRUE
    10    NA     7    NA  TRUE
    

    Understanding why these two perform differently make sense when their behavior is compared!

    > as.list(two_names)
    [[1]]
    [1] "jack"
    
    [[2]]
    [1] "jill"
    
    > list(two_names)
    [[1]]
    [1] "jack" "jill"
    
4
On

You can use rlang::syms (which is reexported by dplyr; alternately call it directly) to coerce strings to quosures, so

library(dplyr)

stackdf <- data.frame(jack = c(1,NA,2,NA,3,NA,4,NA,5,NA),
                      jill = c(1,2,NA,3,4,NA,5,6,NA,7),
                      jane = c(1,2,3,4,5,6,NA,NA,NA,NA))
two_names <- c('jack','jill')

stackdf %>% rowwise %>% mutate(test = anyNA(c(!!!syms(two_names))))
#> Source: local data frame [10 x 4]
#> Groups: <by row>
#> 
#> # A tibble: 10 x 4
#>     jack  jill  jane test 
#>    <dbl> <dbl> <dbl> <lgl>
#>  1    1.    1.    1. FALSE
#>  2   NA     2.    2. TRUE 
#>  3    2.   NA     3. TRUE 
#>  4   NA     3.    4. TRUE 
#>  5    3.    4.    5. FALSE
#>  6   NA    NA     6. TRUE 
#>  7    4.    5.   NA  FALSE
#>  8   NA     6.   NA  TRUE 
#>  9    5.   NA    NA  TRUE 
#> 10   NA     7.   NA  TRUE

Alternatively, using a little base R instead of tidy eval:

stackdf %>% mutate(test = rowSums(is.na(.[two_names])) > 0)
#>    jack jill jane  test
#> 1     1    1    1 FALSE
#> 2    NA    2    2  TRUE
#> 3     2   NA    3  TRUE
#> 4    NA    3    4  TRUE
#> 5     3    4    5 FALSE
#> 6    NA   NA    6  TRUE
#> 7     4    5   NA FALSE
#> 8    NA    6   NA  TRUE
#> 9     5   NA   NA  TRUE
#> 10   NA    7   NA  TRUE

...which will likely be a lot faster, as iterating rowwise makes n calls instead of one vectorized one.