Using disk.frame, but still reaching memory limit issue

196 Views Asked by At

Problem:

I am trying to perform a correlation test on a large dataset: the data.table can exist in memory, but operating on it with Hmisc::rcorr() or corrr::correlate() eventually runs into the memory limit.

> Error: cannot allocate vector of size 1.1 Gb

So, I moved to using the filebacked disk.frame package to solve this, but I still am reaching the memory limit.

Any advice on how to use disk.frame or another package dealing with big memory to achieve this is much appreciated.

Both rcorr() and correlate() take and operate on the whole dataset. The dataset contains NA values, hence my need to use these functions as they allow handling of missing values with "pairwise.complete.obs".

Attempts:

# Packages ----
library(corrr)
library(Hmisc)
library(disk.frame)
library(data.table)


# Initialise parallel processing backend
setup_disk.frame()

# Enable large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)


# test_DT is a data.table of ~18000 columns and ~800 rows
# of type `num` (`double`) 


# Create filebacked disk.frame ----
test_DT_df <- as.disk.frame(
  test_DT, 
  outdir = file.path(tempdir(), "test_tmp.df"),
  nchunks = recommend_nchunks(test_DT, conservatism = 4),
  overwrite = TRUE
)


# `Hmisc` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
  cmap(
    .x = test_DT_df,
    .f = function(.x) {
      Hmisc::rcorr(
        x = as.matrix(.x),
        type = "pearson"
      )
    }
  ),
  overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)


# `corrr` correlation test by chunks ----
# DOES NOT WORK (memory limit issue)
test_cor <- write_disk.frame(
  cmap(
    .x = test_DT_df,
    .f = function(.x) {
      corrr::correlate(
        x = .x,
        use = "pairwise.complete.obs",
        method = "pearson"
      )
    }
  ),
  overwrite = TRUE
)
# Bring into R (above code fails before this line is reached)
test_cor_collect <- disk.frame::collect(test_cor)


# Cleanup ----
delete(test_DT_df)
delete(test_cor)
rm(test_DT_df, test_cor, test_cor_collect)
gc()
1

There are 1 best solutions below

5
F. Privé On

An answer to explain my comment "Then you can loop over all the pairwise variables and store the result in the on-disk matrix.":

res <- bigstatsr::FBM(4, 4)
for (j in seq_len(4)) {
  for (i in seq_len(j - 1)) {
    corr <- Hmisc::rcorr(iris[[j]], iris[[i]])
    res[i, j] <- res[j, i] <- corr$r[1, 2]
  }
  res[j, j] <- 1
} 
res[]