How to improve the speed of rowbinds when the size of the final dataframe is unknown

69 Views Asked by At

I want to rowbind multiple dataframes each with a varying number of rows. I understand that using a for loop in which something like a final_df is overwritten in each iteration is very slow as R has to keep a copy of every alteration. Usually, this problem can be solved by pre-allocating the correct number of rows and columns to the dataframe and then modifying it in place for each iteration of the loop. However, in my situation this is a bit more tricky as each individual dataframe may have a different number of rows compared to the previous one. (In my actual code I am dealing with a long list of XML files, from which I am extracting certain bits of information. Depending on the file, I can end up with more rows or fewer rows.)

My attempt so far is to use dplyr::bind_rows() or data.table::rbindlist(), which seem to be performing similarly well and both outclass do.call("rbind") by a lot. However, I notice that even with these approaches, the computation speed will still increase nonlinearly if I increase the number of dataframes.

Do you have suggestions on how I can further improve the speed of my code? Thanks a lot in advance!

create_df <- function() {
  nrow <- sample(12:15, 1)
  ncol <- 10
  randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
  return(randomdf)
}

approach1 <- function(n) {
  final_df <<- matrix(ncol=ncol, nrow=0)
  for(i in 1:n) {
    current_df <- create_df()
    final_df <<- rbind(final_df, current_df)
  }
}

approach2 <- function(n) {
  df_list <<- vector("list", n)
  for(i in 1:n) {
    df_list[[i]] <<- create_df()
  }
  final_df <<- do.call("rbind", df_list)
}

approach3 <- function(n) {
  df_list <<- vector("list", n)
  for(i in 1:n) {
    df_list[[i]] <<- create_df()
  }
  final_df <<- dplyr::bind_rows(df_list)
}

approach4 <- function(n) {
  df_list <<- vector("list", n)
  for(i in 1:n) {
    df_list[[i]] <<- create_df()
  }
  final_df <<- data.table::rbindlist(df_list)
}


microbenchmark::microbenchmark(
  approach1(5),
  approach2(5),
  approach3(5),
  approach4(5),
  approach1(50),
  approach2(50),
  approach3(50),
  approach4(50),
  approach1(500),
  approach2(500),
  approach3(500),
  approach4(500),
  times = 10
  )
Unit: microseconds
           expr      min       lq      mean    median       uq      max neval
   approach1(5)   1173.5   1201.1   1317.12   1285.30   1402.2   1557.0    10
   approach2(5)    771.6    781.8   1121.18    829.15    944.6   3573.1    10
   approach3(5)    543.7    613.4    966.10    672.15    952.4   3131.8    10
   approach4(5)    520.8    586.5    641.18    621.65    663.8    818.8    10
  approach1(50)  12186.9  12381.4  13932.40  12760.10  14518.8  18537.4    10
  approach2(50)   6497.6   6766.0   7160.26   6967.55   7230.3   8390.6    10
  approach3(50)   3681.3   4143.1   4258.44   4233.10   4347.8   5022.8    10
  approach4(50)   3806.7   3821.8   4166.71   3962.95   4190.6   5900.4    10
 approach1(500) 275530.0 285779.1 326732.16 294302.30 304461.0 622130.3    10
 approach2(500)  65243.8  67456.7  72789.76  74422.30  77063.0  79485.0    10
 approach3(500)  38600.0  39328.4  41372.67  41215.80  42345.2  47488.8    10
 approach4(500)  32496.5  36788.1  41160.35  39940.10  46043.2  49752.9    10
2

There are 2 best solutions below

0
jblood94 On BEST ANSWER

approach3 and approach4 are spending most of their time in create_df, so you're not getting a good idea of the speed of the binding operation. Better to time just the binding:

library(dplyr)
library(data.table)

create_df <- function(n) {
  nrow <- sample(12:15, 1)
  ncol <- 10
  randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
  return(randomdf)
}

df_list <- lapply(c(5, 50, 500), \(n) lapply(1:n, create_df))

approach2 <- function(i) do.call("rbind", df_list[[i]])
approach3 <- function(i) bind_rows(df_list[[i]])
approach4 <- function(i) rbindlist(df_list[[i]])
approach5 <- function(i) rbindlist(df_list[[i]], FALSE)

microbenchmark::microbenchmark(
  approach2(1),
  approach3(1),
  approach4(1),
  approach5(1),
  approach2(2),
  approach3(2),
  approach4(2),
  approach5(2),
  approach2(3),
  approach3(3),
  approach4(3),
  approach5(3)
)
#> Unit: microseconds
#>          expr     min       lq      mean   median       uq     max neval
#>  approach2(1)   321.1   360.40   389.968   377.25   406.65   601.5   100
#>  approach3(1)    89.9   118.85   157.806   135.80   191.45   690.2   100
#>  approach4(1)    77.2    89.05   176.894   103.05   161.15  4250.6   100
#>  approach5(1)    61.8    70.10   100.532    94.15   120.60   223.7   100
#>  approach2(2)  3070.4  3228.40  3735.250  3352.30  3574.90  8796.5   100
#>  approach3(2)   348.3   408.35   470.308   440.50   514.70   931.6   100
#>  approach4(2)   136.7   169.65   204.703   189.25   222.40   362.6   100
#>  approach5(2)   111.5   133.85   194.793   150.10   199.50  2957.8   100
#>  approach2(3) 31565.1 34130.30 36182.204 35523.60 36503.40 89033.4   100
#>  approach3(3)  3008.7  3268.30  3785.467  3440.65  3714.85  7923.1   100
#>  approach4(3)   794.4   913.45  1009.823   966.20  1054.20  1692.0   100
#>  approach5(3)   655.8   767.35   870.240   822.45   894.95  2124.1   100

Now it is clear that rbindlist is the fastest for larger lists of tables. If your process is taking a long time, the binding operation probably isn't the first place I would look.

If you know the table columns all line up, you can squeeze a little more performance out of rbindlist by setting the use.names argument to FALSE.

0
sconfluentus On

If you have many, many data frames, more than 100-150 this function is a good option: rbind_listdf() from ecospace package it is very efficient given the scale.

If you are using a loop, then I would say if you can think about managing the memory within the for loop, it would help.

Using gc() as the final step in each iteration will clear out any large memory hogs before heading back to the next iteration.

If you end up with an intermediate dataframe, and can assign it a variable name, instead of creating an ephemeral dataframe that lingers in memory after binding, you can use rm(temp_df) followed by gc() to clear everything unnecessary out.

What becomes hard is when the size of the cumulative dataframe gets sufficiently large as to soak up most of your operational RAM then there is not much memory left to perform your binds.

In that case something to consider is creating a .csv file, and writing it to disk (saving it) and then using the loop to go throug your dataframes and write the rows to that csv, an appending action as opposed to opening, reading and saving. I would still use rm() & gc() to clean out memory before the next iteration. This may or may not be faster because writing is slower than binding, but it does mean not carrying that mass of the main dataframe in memory if you garbage collect often.

Once done creating the csv, you can clear out your whole workspace and then read in the csv or a database to work on the data.