Slow function, how can I remove the for loop from it in R

315 Views Asked by At

I have a function in R that compares a smaller vector to a larger one and then finds where there are matches and uses that information to extract data from a larger data frame.

compare_masses <- function(mass_lst){
  for (i in seq_along(mass_lst)) {
    positions <- which(abs(AB_massLst_numeric - mass_lst[i]) < 0.02)
    rows <- AB_lst[positions,]
    match_df <- rbind(match_df, rows)
   }
}

where mass_lst is a list of compound masses:

ex: mass_lst <- c(315, 243, 484, 121)

AB_massLst_numeric is the larger list of masses:

ex: AB_massLst_numeric <- c(323, 474, 812, 375, 999, 271, 676, 232)

AB_lst is a larger data frame that I am extracting that data from with the positions vector.

match_df is an empty data frame I do rbind the data to.

The problem is that this function has a for loop in it and takes so long even when I use

test <- sapply(mass_lst, compare_masses)

So my question is how can I make this function faster and potentially remove the for loop? My data is much bigger in real life than the examples I provided are. I cant think a way to not iterate to make this function work.

4

There are 4 best solutions below

2
On BEST ANSWER

Use vector recycling feature of R. First construct your positions vector of length N*m, where N is the number of rows in AB_lst and m is length(mass_lst). Then select rows from your data frame using this vector.

See complete runnable example below.

positions <- c()
compare_masses <- function(mass_lst){
  for (i in seq_along(mass_lst)) {
    positions <- c(positions, which(abs(AB_massLst_numeric - mass_lst[i]) < 0.02))
   }
   return(AB_lst[positions,])
}

mass_lst <- c(375, 243, 676, 121)
AB_massLst_numeric <- c(323, 474, 812, 375, 999, 271, 676, 232, 676)

AB_lst <- data.frame(x=1,y=AB_massLst_numeric)
match_df <- AB_lst[c(),]

compare_masses(mass_lst)
5
On

Try wrapping it all in one call and using do.call so it does all of the rbind calls together and not one at a time.

match_df <- do.call(rbind.data.frame, lapply(
    mass_lst, function(x)
        AB_lst[abs(AB_lst_numeric - x) < 0.02,]))

In response to comments concerning speed of do.call in comparison to dplyr::bind_rows, I created an AB_lst_numeric with 1k values between 0 and 1000 and corresponding AB_lst data.frame as well as mass_lst vector with 100 elements. Here are the results of this test using rbenchmarkand as you can see the do.call and bind_rows calls are quite comparable (bind_rows is 36% more efficient as compared to 110% efficiency gain compared to original solution).

benchmark(
  match_df <- compare_masses(mass_lst),
  match_df <- do.call(rbind.data.frame, lapply(
    mass_lst, function(x)
    AB_lst[abs(AB_lst_numeric - x) < 0.02,])),
  match_df <- bind_rows(lapply(
    mass_lst, function(x)
    AB_lst[abs(AB_lst_numeric - x) < 0.02,])))

    ## 3   match_df <- bind_rows(lapply(mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02, ]))
    ## 1   match_df <- compare_masses(mass_lst)
    ## 2   match_df <- do.call(rbind.data.frame, lapply(mass_lst, function(x) AB_lst[abs(AB_lst_numeric - x) < 0.02, ]))
    ##     replications elapsed relative user.self sys.self user.child sys.child
    ## 3   100          1.453   1.000    1.387     0.059    0          0
    ## 1   100          3.050   2.099    2.983     0.051    0          0
    ## 2   100          1.974   1.359    1.905     0.060    0          0
0
On

This should be a vectorized solution. Use the compare_masses function posted. It is significantly faster than the other solutions here.

Write an anonymous function to vectorize. Does the same comparison you do in your loop.

pos = Vectorize(FUN = function(y) {abs(AB_massLst_numeric-y) < 0.02}, vectorize.args = "y")

Find the index you want to subset, this step replaces the do.call(rbind,...) or bind_rows. This step should be quick since it merely does a logical comparison on a matrix of size length(AB_massLst_numeric) x length(mass_lst). Need this step because I wasn't able to get the vectorize function to work nicely with which.

i = unlist(apply(X = matrix(sample(c(T,F), 100, r = T), nrow = 10), MARGIN = 2, FUN = which))

Subset and store

AB_lst[i,]

Edit: use the compare_masses function posted. It is significantly faster than the other solutions here.

Unit: microseconds
           expr      min       lq      mean   median       uq      max neval  cld
      Vectorize  318.595  327.280  358.9813  355.112  386.892  413.739    10  b  
        do.call 1418.473 1510.853 1569.7161 1578.954 1635.606 1744.173    10    d
      bind_rows  744.570  801.420  813.9346  815.435  836.161  871.297    10   c 
 compare_masses  135.808  138.176  158.0344  158.508  169.365  197.395    10 a  

Even larger test data set

Unit: nanoseconds
           expr      min       lq         mean   median       uq       max neval cld
      Vectorize   239242   292341   342314.079   324714   359455   3480844  1000 a  
 compare_masses      395     1975     3674.669     3554     4738     19346  1000 a  
        do.call 16570424 18223007 21092022.254 20921183 22194176 159718470  1000   c
      bind_rows 13423572 14869680 17027330.356 17008639 18061341 116983885  1000  b 
0
On

You can loop to find the row indices you want, and then select the rows based on that data:

set.seed(1)
DF <- data.frame(x=runif(1e2), y=sample(letters, 1e2, rep=T))
LIST <- list(0, 0.2, 0.4, 0.5)
DF[unlist(lapply(LIST, function(y) which(abs(DF$x - y) < .02))), ]

For our dummy data this produces:

            x y
24 0.01017122 b
70 0.01065314 d
5  0.19193779 e
40 0.21181133 l
65 0.21488963 q
80 0.20122201 q
16 0.39572663 e
23 0.41434742 x
30 0.41330587 t
67 0.40899105 p
73 0.40808877 x
78 0.49894035 o
79 0.49745918 o

Notice how the values we pick are indeed within 0.02 of the target.