I have a list of ffdf, it takes up about 76GB of RAM if it is loaded to RAM instead of using ff package. The following is their respective dim()
> ffdfs |> sapply(dim)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 11478746 12854627 10398332 404567958 490530023 540375993 913792256
[2,] 3 3 3 3 3 3 3
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 15296863 11588739 547337574 306972654 11544523 255644408 556900805
[2,] 3 3 3 3 3 3 3
[,15] [,16] [,17]
[1,] 13409223 900436690 15184264
[2,] 3 3 3
I want to check the number of duplication in each ffdf, so I did the following:
check_duplication <- sample_cols |> sapply(function(df) {
df[c("chr","pos")] |> duplicated() |> sum()
})
It works but it is extremely slow.
I am on a HPC, I have about 110GB RAM and 18CPU.
Will there be any other option or setting I could adjust to speed up the process? Thank you.
Parallelization is a natural way to speed this up. It can be done at C level via
data.table:The benchmark here shows that
duplicatedis much faster when applied to adata.tableinstead of an equivalent data frame. Of course, how much faster depends on the number of CPUs that you make available todata.table(see?setDTthreads).If you go the
data.tableroute, then you would process your 17 data frames like so:Here, we are using
setDTrather thanas.data.tableto perform an in-place coercion from data frame todata.table, and we are usingrmandgcto free the memory occupied byxbefore reading another data frame into memory.If, for whatever reason,
data.tableis not an option, then you can stick to using theduplicatedmethod for data frames, namelyduplicated.data.frame. It is not parallelized at C level, so you would need to parallelize at R level, using, e.g.,mclapplyto assign your 17 data frames to batches and process those batches in parallel:This option is slower and consumes more memory than you might expect. Fortunately, there is room for optimization. The rest of this answer highlights some of the main issues and ways to get around them. Feel free to stop reading if you've already settled on
data.table.Since you have 18 CPUs, you can try to process all 17 data frames simultaneously, but you might encounter out-of-memory issues as a result of reading all 17 data frames into memory at once. Increasing the batch size (i.e., distributing the 17 jobs across fewer than 17 CPUs) should help.
Since your 17 data frames vary widely in length (number of rows), randomly assigning them to roughly equally sized batches is probably not a good strategy. You could decrease the overall run time by batching shorter data frames together and not batching longer data frames together.
mclapplyhas anaffinity.listargument giving you this control. Ideally, each batch should require the same amount of processing time.The amount of memory that each job uses is actually at least two times greater than the amount needed to store the data frame
x, becauseduplicated.data.framecopies its argument:The copy happens inside of the
vapplycall in the body of the method:That
vapplycall is completely avoidable: you should already know whetherchrandposare factors. I would suggest defining a replacement forduplicated.data.framethat does only what is necessary given your use case. For example, if you know thatchrandposare not factors, then you might assignand compute
sum(duped(x))instead ofsum(duplicated(x)). In fact, you could do slightly better by replacinglistwithc:Using
chere causes rows of the data framexto be stored and compared as atomic vectors rather than as lists. In other words,fastduped(x)is doingwhereas
duped(x)is doingwhere
m = nrow(x)andn = length(x). The latter is slower and consumes more memory, and there is a warning in?duplicatedsaying as much:Computing
sum(fastduped(x))instead ofsum(duplicated(x))should increase the number of data frames that you can process simultaneously without running out of memory. FWIW, here is a benchmark comparing the run times ofduplicated,duped,fastduped(saying nothing about memory usage):