How to use and understand rstats parallel processing with HPC Acenet Siku cluster, SLURM, futures, rslurm, and lapply() equivalents

Question

How to use and understand rstats parallel processing with HPC Acenet Siku cluster, SLURM, futures, rslurm, and lapply() equivalents

131 Views Asked by Mark Thompson At 13 June 2023 at 18:51

Background: I am using the adehabitatHR package to create utilize distribution estimates (UDEs) of wildlife populations.

Problem: How to manage memory with parallel processing. A huge amount of memory is used when I parallel home range estimations using 'adehabitatHR' on multiple nodes or cluster.

Question: How is memory managed using parallel methods on a high performance cluster to parallel process the analysis in the example provided?

I have a list of UDEs that can be converted into shapefiles using package 'terra' vect(). See lines 657-705 for current workflow.

If I were to do this on my local machine:

spatvec_UDEs <- lapply(UDEs, function(x){vect(x)})

I have thousands of UDEs to calculate and have used the parallel package to calculate the UDEs for some animals (e.g., winter range in one study area from 2002-2021), but have been unable to get that result saved and exported using the Acenet Siku cluster.

Notes from my work, progress, and minimal reproduceable example are included below. My question is about clarification on how the memory is mapping in my SLURM requests. I am seeking a workable example of parLapply(), mclapply(), or future::plan(multicore) that can help me to understand how the memory is being used and how to get the parallel function operating for either a single or multiple core. I believe that multiple cores are needed in my case to gain the necessary memory requirements to run the adehabitatHR functions on my data.

My understanding is as follows, I call a function to process elements in a list of adehabitatHR estUD objects. The memory requirements for processing get as large as 2.5G and I am running about 200 tasks in my foreach loop. I am not sure I have the correct usage of task here, but I think of this a each iteration through the loop that does a single calculation is a task (e.g., 1+i would be one task if i=1, two tasks if i=2...). I believe that one node means that I will have access to ~720G, which is the SIKU storage. If each task of the estUD object creation is ~4G, then I should have sufficient room to run 180 tasks (720G / 4 = 180).

For example, does R have to be initialized on each node in interactive mode or does this result in shared memory (e.g., 720G x 2 nodes = 1440G)? I am confused on the distributed versus shared memory and how it works in the SLURM, interactive, and bash job submissions in the Siku HPC.

I was using interactive job submissions to learn and here are my notes:

#BRB1: Results in weird infinite "Selection" repeating screen for
#mclapply

    salloc --time=3:00:0 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=23875M

#BRB2: Elapsed 4619 (1h16), not valid cluster for parLapply()

    salloc --time=3:00:0 --ntasks=1 --cpus-per-task=16 --mem-per-cpu=11937M

#BRB3: Elapsed 4700 (slower than b4). In this case N=16,  N is the num
#of (OpenMP) parallel threads #on --cpus-per-task. Add cl=cl in
#parLapply(), failed.

    salloc --time=3:00:0 --nodes=1 --ntasks=1 --cpus-per-task=16 --mem-per-cpu=11937M

#BRB4: Elapsed 4262.420 Fastest. Failed parLapply()

    salloc --time=3:00:0 --nodes=1 --cpus-per-task=20 --mem-per-cpu=11937M

#BRB5: Elapsed 4565.692, so extra mem-per-cpu did not make it faster exactly.
#Attempted:
BRBs_BM_WIv <- slurm_map(BRBs_BM_WI,
    function (x) {vect(x)},
                  nodes=1,
                  cpus_per_node = 1)
#Result:
"sbatch: error: Batch job submission failed: Requested partition
configuration not available now."

    salloc --time=3:00:0 --nodes=1 --cpus-per-task=20 --mem-per-cpu=18500M

This is a minimal reproduceable example where I attempt to save as an Rds:

#BRB6: 

salloc --time=3:00:0 --nodes=2 --cpus-per-task=20 --mem-per-cpu=18500M

#For BRB6 using min code example. Example dataset is available
#in the adehabitatHR package and presented here:

library('adehabitatHR')
library('foreach')

data(puechcirc)
pc1 <- puechcirc[1]
pc2 <- puechcirc[2]
pc3 <- puechcirc[3]
Traj_li <- vector("list", 3)
Traj_li[[1]] <- pc1
Traj_li[[2]] <- pc2
Traj_li[[3]] <- pc3
DLik <- c(2.1, 2.2, 4)

system.time({
BRBs_PC <- foreach(i = 1:length(Traj_li),
                      .combine = c,
                      .packages = c("adehabitatHR","adehabitatLT", "terra")) %dopar% {
                          BRB(Traj_li[[i]][1],
                          D = DLik[i],
                          Tmax = 1500*60,
                          Lmin = 2,
                          hmin = 20,
                          type = "UD",
                          grid = 4000)
                      }
    })

thenames <- c("pn1", "pn2", "pn3")
WRds <- function (x) {saveRDS(x,
                        paste0(here("BRB_UDs"), "/",
                        thenames,".Rds"))
                        }
BRBs_BM_WIv <- slurm_map(BRBs_BM_WI, f=WRds, nodes=2, cpus_per_node = 1)

The minimal reproduceable BRB6 result (2 nodes):

sbatch: error: Batch job submission failed: Requested partition configuration not available now

Error in strsplit(sys_out, " ")[[1]] : subscript out of bounds
In addition: Warning message:
In system("sbatch submit.sh", intern = TRUE) :
  running command 'sbatch submit.sh' had status 1

These are some of the types of things I have tried as I read and learn about parallel processing on a cluster:

mclapply(X = BRBs_BM_WIv, FUN=WV, mc.cores = n.cores)

my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "FORK"
  )
doParallel::registerDoParallel(cl = my.cluster)
WVec <- function (x) {vect(x)}
BRBs_BM_WIv <- parLapply(cl=my.cluster,
                         BRBs_BM_WI,
                         FUN = MVec)
stopCluster(my.cluster)


my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "FORK"
  )
doParallel::registerDoParallel(cl = my.cluster)
WV <- function (x) {writeVector(x,
                            paste0(here("BRB_UDs"), "/",
                            thenames,".shp"),
                            overwrite=TRUE)}
parLapply(cl=my.cluster, X = BRBs_BM_WIv, FUN = MV)
stopCluster(my.cluster)

I have been reading extensively on this topic (for example), but have been unable to find specific examples that can help me with the issues I am facing.

Original Q&A

There are 2 best solutions below

**Mark Thompson** · Answer 1 · 2023-06-18T16:08:11.723000

I received some help and have a partial answer. One part of the confusion is that CPUs in SLURM terminology refers to cores, so --cpus-per-task = N is actually N = cores. Ensuring that I worked on a single node produced a result with the following SLURM request:

salloc --time=3:00:0 --nodes=1 --cpus-per-task=40 --mem=0

Using the minimal example code in my question, the following was able to produce the first result I was after:

n.cores <- future::availableCores() - 1
my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "PSOCK"
  )

#register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)

system.time({
BRBs_BM_WI <- foreach(i = 1:length(Traj_li),
                      .combine = c,
                      .packages = c("adehabitatHR","adehabitatLT", "here")) %dopar% {
                            saveRDS(BRB(Traj_li[[i]][1],
                            D = DLik[i],
                            Tmax = 1500*60,
                            Lmin = 2,
                            hmin = 20,
                            type = "UD",
                            grid = 4000),paste0(here("BRB_UDs"),"/",thenames[i],"toy.Rds"))
                        }})
stopCluster(my.cluster)

Note that I have a vector called thenames that was created do address a naming struggle I have had with the adehabitatHR UDest or BRBest objects (see here). Also note that I used the future function future::availableCores() to get a measure of the cores on the network, which is not the same as parallel::detectCores(). The next step in my process takes the calculated object to estimate the home range size and convert these into other types of spatial objects. I was able to use the foreach approach above, but found that the future package approach (see here and here) was faster:

### future approach, works fastest 311 sec
plan(multicore)
BRBs_BM_WI <- lapply(BRBs_BM_WIf, function(x){readRDS(x)})

system.time(BRBs_BM_WIv <- future_lapply(BRBs_BM_WI, FUN = function(x) {getverticeshr.estUD(x)}))
name_burst <- list()
system.time(for(i in 1:length(thenames)){
    vect(BRBs_BM_WIv[i],paste0(here("BRB_UDs"),"/",thenames[i],"toy_hr.Rds"))
        homerangedf <- as.data.frame(BRBs_BM_WIv[i])
        name_burst[[i]] <- adehabitatLT::burst(Traj_li[[i]][1])
        BRB_area[i,c(1:4)] <- rbind(data.frame(id = name_burst[[i]],
                                                    year = substr(names(Traj_li[i]),7,11),
                                                    area = homerangedf[,2],
                                                    nb.reloc = nrow(Traj_li[[i]][[1]])))
     })


## parallel approach,370.487 Sec
n.cores <- as.vector(future::availableCores()) - 1
my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "PSOCK"
  )

#register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)

BRBs_BM_WI <- lapply(BRBs_BM_WIf, function(x){readRDS(x)})

system.time({
homerange <- foreach(i = 1:length(BRBs_BM_WI),
                      .combine = c,
                      .packages = c("adehabitatHR","here")) %dopar% {
                      saveRDS(getverticeshr.estUD(BRBs_BM_WI[[i]]),
                              paste0(here("BRB_UDs"),"/",thenames[i],"toy_hr.Rds"))
                        }    
    })
# Stop the parallel backend
stopCluster(my.cluster)

The remaining issue is that this works on my minimal example code, but not my actual data. I run into memory issues to calculate getverticeshr.estUD() or getvolumeUD(), which either produces a killed output or:

Error in mcfork(detached) :
  unable to fork, possible reason: Cannot allocate memory

My understanding from --mem=0 is that this reserves all available memory (see here). I do not know how to solve the memory issues and think that I might need to put this onto multiple nodes to achieve the functionality I require. It would be helpful to have additional guidance on how to more dynamically allocate memory and processing tasks as needed to run these types of analyses for wildlife studies.

**Robert Hijmans** · Answer 2 · 2023-06-18T17:04:18.557000

I am unsure if I understand what you are asking (the question is too long and involved); but it seems that your main question is how to run R in parallel using SLURM. You seem to focus on using multiple CPUs whereas I would first make sure to use multiple nodes.

I do these things like this (there may be better ways):

Create a file bashR.sh with these contents (or something similar that works on your system)

#!/bin/bash -l

module load R
Rscript --vanilla ${1} ${2} ${3}

Write an R script (here test.R) with this general form

fun <- function(x, parameter) {
    print(x)
    print(x * parameter)
    print("done")
}

arg <- as.integer(commandArgs(trailingOnly=TRUE)[1])
i <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))

n <- 10 
if (i <= n) {
    fun(i, arg)
} else {
    print("all done")
}

Where i represents a case that can be processed in parallel (independently from all other cases). You probably do not need argument(s) arg

Now run the R script using sbatch:

sbatch --array=1-10 --time=60 --mem=32192 bashR.sh test.R 42
#Submitted batch job 66585267

You may want to add parameters such as -p (partition) and change the time and mem needs. The array argument is like a for loop iterator (in this example, in R speak, for (i in 1:10)

The output will look something like this; your filename (job number) will of course be different.

cat slurm-66585267_5.out
#==========================================
#SLURM_JOB_ID = 66585272
#SLURM_NODELIST = cpu-11-97
#==========================================
#Unloading openmpi/4.1.5
#Unloading slurm/22.05.8
#Loading slurm/22.05.8
#Loading openmpi/4.1.5
#Loading conda/base/latest <aL>
#
#Loading R/4.2.3
#  Loading requirement: conda/base/latest
#[1] 5
#[1] 210
#[1] "done"

And, at some point, remove the slurm log files

rm slurm-*

If this is what you are looking for, I would implement this with baby steps. First change the function fun to make sure that it identifies the right case, finds the input data, can write some output data.

How to use and understand rstats parallel processing with HPC Acenet Siku cluster, SLURM, futures, rslurm, and lapply() equivalents

There are 2 best solutions below

Related Questions in R

Related Questions in PARALLEL-PROCESSING

Related Questions in TERRA

Related Questions in ADEHABITATHR

Trending Questions

Popular # Hahtags

Popular Questions