applying a function with multiple arguments over multiple paired variables in R

2.2k Views Asked by At

I have a function like this which im using to clean data and works correctly.

my_fun <- function (x, y){
    y <- ifelse(str_detect(x, "-*\\d+\\.*\\d*"), 
        as.numeric(str_extract(x, "-*\\d+\\.*\\d*")),
        as.numeric(y))
}

It takes numbers that have been entered in the wrong column and reassigns them to the correct column. It is used as follows to clean the y variable:

df$y <- my_fun(x, y)

I have many columns/variables (more than 10) that are paired in the same format something like this

x_vars <- c("x_1", "x_2", "x_3", "x_4", "x_5", "x_6")
y_vars <- c("y_1", "y_2", "y_3", "y_4", "y_5", "y_6")

My question is. Is there a way to apply this function across all the variables in my data set that need to be cleaned in the same way? I can easily do this in other instances where my data cleaning function has only one argument using lapply but am struggling in this case.

I have tried mapply but could not get it to work, this might be because I'm still quite a novice in R. Any advice would be much appreciated.

2

There are 2 best solutions below

0
On BEST ANSWER

We can use mapply/Map. We need to extract the columns based on the column names by passing the 'x_vars', 'y_vars' as arguments to Map, apply the my_fun on the extracted the vectors, and assign it back to 'y_vars' in the original dataset

df[y_vars] <- Map(function(x,y) my_fun(df[,x], df[,y]), x_vars, y_vars)

Or this can be also written as

df[y_vars] <- Map(my_fun, df[x_vars], df[y_vars])
 

NOTE: Here, we are assuming that all the elements in 'x_vars' and 'y_vars' are columns in the original dataset. We would also state that using Map will be much more faster and efficient than reshaping it to long and then do some conversion.


To provide a different approach, we can use the melt from data.table

library(data.table)
dM <- melt(setDT(df), measure = list(x_vars, y_vars))[,
               value3 := my_fun(value1, value2), variable]

Then, again, we need to dcast it back to 'wide' format. So, it is requires more steps and not much easy

setnames(dcast(dM, rowid(variable)~variable, 
  value.var = c("value1", "value3"))[,variable := NULL][], c(x_vars, y_vars))[]

data

set.seed(24)
df <- as.data.frame(matrix(sample(c(1:5, "something 10.5",
   "this -4.5", "what -5.2 value?"),
          12*10, replace=TRUE), ncol=12, dimnames = 
     list(NULL, c(x_vars, y_vars))), stringsAsFactors=FALSE)
0
On

B/c I always think it's good to know how to do this stuff in base R, I've got exmaples of how to use mapply() and lapply().

## first generate some data
df <- data.frame(replicate(12, rnorm(5)))
my_fun <- function (x, y){
    ifelse(stringr::str_detect(x, "-*\\d+\\.*\\d*"),
        as.numeric(stringr::str_extract(x, "-*\\d+\\.*\\d*")),
        as.numeric(y))
}
df <- data.frame(replicate(12, rnorm(3)))
df[, sample(1:6, 3)] <- letters[1:3]
## not function of interest, but good mapply() example
names(df) <- c(
               mapply(paste0, rep("x_", 6), 1:6),
               mapply(paste0, rep("y_", 6), 1:6))

## print data with problem variables (cols with letters)
#df
#         x_1 x_2 x_3 x_4        x_5        x_6       y_1
#1 -0.2184993   a   a   a -0.1587070 0.37795630 0.6162796
#2  0.8511775   b   b   b  0.5743287 0.15291219 1.0594502
#3  0.8183208   c   c   c  1.8923812 0.07156925 0.8613535
#         y_2        y_3        y_4       y_5        y_6
#1  0.3240393 -1.1084067  0.5233168 0.3712705 -0.3911407
#2  0.3044824 -0.2286032 -1.0019870 1.2156441  0.4010163
#3 -1.0920677  1.3408504  1.3339865 0.3270800 -0.8416253



## if you wrote a for loop, it'd look like this maybe
out <- vector("list", 6)
for (i in seq_len(6)) {
    out[[i]] <- my_fun(df[, i], df[, i + 6])
}

## same construction can be used with lapply
dfy <- lapply(seq_len(6), function(i)
    my_fun(df[, 1:6][[i]],
           df[, 7:12][[i]]))
matrix(unlist(dfy), 5, 6)
#           [,1]       [,2]       [,3]        [,4]       [,5]
#[1,] -0.2184993 -1.0920677 -1.0019870  0.37795630  0.8183208
#[2,]  0.8511775 -1.1084067  1.3339865  0.15291219  0.3240393
#[3,]  0.8183208 -0.2286032 -0.1587070  0.07156925  0.3044824
#[4,]  0.3240393  1.3408504  0.5743287 -0.21849928 -1.0920677
#[5,]  0.3044824  0.5233168  1.8923812  0.85117750 -1.1084067
#           [,6]
#[1,] -0.2286032
#[2,]  1.3408504
#[3,]  0.5233168
#[4,] -1.0019870
#[5,]  1.3339865

Warning message: In matrix(unlist(dfy), 5, 6) : data length [18] is not a sub-multiple or multiple of the number of rows [5]

## and mapply makes this even easier
mapply(my_fun, df[, 1:6], df[, 7:12])
#            x_1        x_2        x_3        x_4        x_5
#[1,] -0.2184993  0.3240393 -1.1084067  0.5233168 -0.1587070
#[2,]  0.8511775  0.3044824 -0.2286032 -1.0019870  0.5743287
#[3,]  0.8183208 -1.0920677  1.3408504  1.3339865  1.8923812
#            x_6
#[1,] 0.37795630
#[2,] 0.15291219
#[3,] 0.07156925