Passing column name as parameter to a function using dplyr

4.4k Views Asked by At

I have a dataframe like below :

transid<-c(1,2,3,4,5,6,7,8)
accountid<-c(a,a,b,a,b,b,a,b)
month<-c(1,1,1,2,2,3,3,3)
amount<-c(10,20,30,40,50,60,70,80)
transactions<-data.frame(transid,accountid,month,amount)

I am trying to write function for total monthly amount for each accountid using dplyr package verbs.

my_sum<-function(df,col1,col2,col3){
df %>% group_by_(col1,col2) %>%summarise_(total_sum = sum(col3))
}

my_sum(transactions, "accountid","month","amount")

To get the result like below:

accountid   month  total_sum
a            1       30
a            2       40
a            3       70
b            1       30
b            2       50
b            3       140

I am getting error like:- Error in sum(col3) : invalid 'type' (character) of argument.How to pass column name as parameter without quote in summarise function?

2

There are 2 best solutions below

0
On

You can pass quosure objects as arguments using quo() and then evaluate them lazily using some kind of unquote function, in this example i use !!

library(tidyverse)
my_sum<-function(df,col1,col2,col3){
df %>% group_by(!!col1,!!col2) %>%summarise(total_sum = sum(!!col3))
}

my_sum(transactions, quo(accountid),quo(month),quo(amount))
0
On

I would suggest the following solution:

my_sum <- function(df, col_to_sum,...) {

    col_to_sum <- enquo(col_to_sum)
    group_by <- quos(...)

    df %>%
        group_by(!!!group_by) %>%
        summarise(total_sum = sum(!!col_to_sum)) %>% 
        ungroup()
}

transactions %>% my_sum(amount, accountid, month)

Results

>> transactions %>% my_sum(amount, accountid, month)
# A tibble: 6 x 3
  accountid month total_sum
     <fctr> <dbl>     <dbl>
1         a     1        30
2         a     2        40
3         a     3        70
4         b     1        30
5         b     2        50
6         b     3       140

Data

In you original answer you have passed unqoted strings, I've solved that using Hmisc:Cs function but, on principle, you should surround your strings with ""; unless, of course, you are calling some objects named a, b and so forth. It wasn't clear from the original question.

Used data:

transid <- c(1, 2, 3, 4, 5, 6, 7, 8)
accountid <- Hmisc::Cs(a, a, b, a, b, b, a, b)
month <- c(1, 1, 1, 2, 2, 3, 3, 3)
amount <- c(10, 20, 30, 40, 50, 60, 70, 80)
transactions <- data.frame(transid, accountid, month, amount)

Notes

  • If you look at the Capturing multiple variables section of the Programming with dplyr article you will see that very similar problem is solved with use of quos() function. In effect, your task is a perfect example how the quos() function should be used.

  • The ellipsis ... should then come at the end as the assumption is that the function will be used to group data with multiple column. Naturally, if desired you you could pass columns one bye one enquo() every single column and so forth but using ... is more natural and consistent with the recommended solution discussed in the article linked above. Please note that this approach changes the order of arguments in your function call as ... should come at the end.

  • If you are using summarise() you don't have to ungroup() your data as in my example. For instance the code:

    mtcars %>% group_by(am) %>% summarise(mean_disp = mean(disp)) %>% mutate(am = am + 1) 
    

    will work; whereas the code:

    mtcars %>% group_by(am)  %>% mutate(am = am + 1)
    

    will return the expected error:

    Error in mutate_impl(.data, dots) : Column am can't be modified because it's a grouping variable

    You should use ungroup() if you are going to mutate() your original data or do other operations that keep your grouping variable intact. passing grouped variable may later prove problematic, it would say it's mostly a matter of taste/order in your dplyr workflow. If you and other function users are going to remember that the tibble may be carrying grouping variable then there is no issue; personally, I tend to forget about that so my preference is to ungroup() the data if I'm not interested in carrying grouping variable.