Background: I am using the adehabitatHR package to create utilize distribution estimates (UDEs) of wildlife populations.
Problem: How to manage memory with parallel processing. A huge amount of memory is used when I parallel home range estimations using 'adehabitatHR' on multiple nodes or cluster.
Question: How is memory managed using parallel methods on a high performance cluster to parallel process the analysis in the example provided?
I have a list of UDEs that can be converted into shapefiles using package 'terra' vect(). See lines 657-705 for current workflow.
If I were to do this on my local machine:
spatvec_UDEs <- lapply(UDEs, function(x){vect(x)})
I have thousands of UDEs to calculate and have used the parallel package to calculate the UDEs for some animals (e.g., winter range in one study area from 2002-2021), but have been unable to get that result saved and exported using the Acenet Siku cluster.
Notes from my work, progress, and minimal reproduceable example are included below. My question is about clarification on how the memory is mapping in my SLURM requests. I am seeking a workable example of parLapply(), mclapply(), or future::plan(multicore) that can help me to understand how the memory is being used and how to get the parallel function operating for either a single or multiple core. I believe that multiple cores are needed in my case to gain the necessary memory requirements to run the adehabitatHR functions on my data.
My understanding is as follows, I call a function to process elements in a list of adehabitatHR estUD objects. The memory requirements for processing get as large as 2.5G and I am running about 200 tasks in my foreach loop. I am not sure I have the correct usage of task here, but I think of this a each iteration through the loop that does a single calculation is a task (e.g., 1+i would be one task if i=1, two tasks if i=2...). I believe that one node means that I will have access to ~720G, which is the SIKU storage. If each task of the estUD object creation is ~4G, then I should have sufficient room to run 180 tasks (720G / 4 = 180).
For example, does R have to be initialized on each node in interactive mode or does this result in shared memory (e.g., 720G x 2 nodes = 1440G)? I am confused on the distributed versus shared memory and how it works in the SLURM, interactive, and bash job submissions in the Siku HPC.
I was using interactive job submissions to learn and here are my notes:
#BRB1: Results in weird infinite "Selection" repeating screen for
#mclapply
salloc --time=3:00:0 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=23875M
#BRB2: Elapsed 4619 (1h16), not valid cluster for parLapply()
salloc --time=3:00:0 --ntasks=1 --cpus-per-task=16 --mem-per-cpu=11937M
#BRB3: Elapsed 4700 (slower than b4). In this case N=16, N is the num
#of (OpenMP) parallel threads #on --cpus-per-task. Add cl=cl in
#parLapply(), failed.
salloc --time=3:00:0 --nodes=1 --ntasks=1 --cpus-per-task=16 --mem-per-cpu=11937M
#BRB4: Elapsed 4262.420 Fastest. Failed parLapply()
salloc --time=3:00:0 --nodes=1 --cpus-per-task=20 --mem-per-cpu=11937M
#BRB5: Elapsed 4565.692, so extra mem-per-cpu did not make it faster exactly.
#Attempted:
BRBs_BM_WIv <- slurm_map(BRBs_BM_WI,
function (x) {vect(x)},
nodes=1,
cpus_per_node = 1)
#Result:
"sbatch: error: Batch job submission failed: Requested partition
configuration not available now."
salloc --time=3:00:0 --nodes=1 --cpus-per-task=20 --mem-per-cpu=18500M
This is a minimal reproduceable example where I attempt to save as an Rds:
#BRB6:
salloc --time=3:00:0 --nodes=2 --cpus-per-task=20 --mem-per-cpu=18500M
#For BRB6 using min code example. Example dataset is available
#in the adehabitatHR package and presented here:
library('adehabitatHR')
library('foreach')
data(puechcirc)
pc1 <- puechcirc[1]
pc2 <- puechcirc[2]
pc3 <- puechcirc[3]
Traj_li <- vector("list", 3)
Traj_li[[1]] <- pc1
Traj_li[[2]] <- pc2
Traj_li[[3]] <- pc3
DLik <- c(2.1, 2.2, 4)
system.time({
BRBs_PC <- foreach(i = 1:length(Traj_li),
.combine = c,
.packages = c("adehabitatHR","adehabitatLT", "terra")) %dopar% {
BRB(Traj_li[[i]][1],
D = DLik[i],
Tmax = 1500*60,
Lmin = 2,
hmin = 20,
type = "UD",
grid = 4000)
}
})
thenames <- c("pn1", "pn2", "pn3")
WRds <- function (x) {saveRDS(x,
paste0(here("BRB_UDs"), "/",
thenames,".Rds"))
}
BRBs_BM_WIv <- slurm_map(BRBs_BM_WI, f=WRds, nodes=2, cpus_per_node = 1)
The minimal reproduceable BRB6 result (2 nodes):
sbatch: error: Batch job submission failed: Requested partition configuration not available now
Error in strsplit(sys_out, " ")[[1]] : subscript out of bounds
In addition: Warning message:
In system("sbatch submit.sh", intern = TRUE) :
running command 'sbatch submit.sh' had status 1
These are some of the types of things I have tried as I read and learn about parallel processing on a cluster:
mclapply(X = BRBs_BM_WIv, FUN=WV, mc.cores = n.cores)
my.cluster <- parallel::makeCluster(
n.cores,
type = "FORK"
)
doParallel::registerDoParallel(cl = my.cluster)
WVec <- function (x) {vect(x)}
BRBs_BM_WIv <- parLapply(cl=my.cluster,
BRBs_BM_WI,
FUN = MVec)
stopCluster(my.cluster)
my.cluster <- parallel::makeCluster(
n.cores,
type = "FORK"
)
doParallel::registerDoParallel(cl = my.cluster)
WV <- function (x) {writeVector(x,
paste0(here("BRB_UDs"), "/",
thenames,".shp"),
overwrite=TRUE)}
parLapply(cl=my.cluster, X = BRBs_BM_WIv, FUN = MV)
stopCluster(my.cluster)
I have been reading extensively on this topic (for example), but have been unable to find specific examples that can help me with the issues I am facing.
I received some help and have a partial answer. One part of the confusion is that CPUs in SLURM terminology refers to cores, so
--cpus-per-task = Nis actually N = cores. Ensuring that I worked on a single node produced a result with the following SLURM request:Using the minimal example code in my question, the following was able to produce the first result I was after:
Note that I have a vector called
thenamesthat was created do address a naming struggle I have had with the adehabitatHRUDestorBRBestobjects (see here). Also note that I used the future functionfuture::availableCores()to get a measure of the cores on the network, which is not the same asparallel::detectCores(). The next step in my process takes the calculated object to estimate the home range size and convert these into other types of spatial objects. I was able to use theforeachapproach above, but found that thefuturepackage approach (see here and here) was faster:The remaining issue is that this works on my minimal example code, but not my actual data. I run into memory issues to calculate
getverticeshr.estUD()orgetvolumeUD(), which either produces akilledoutput or:My understanding from
--mem=0is that this reserves all available memory (see here). I do not know how to solve the memory issues and think that I might need to put this onto multiple nodes to achieve the functionality I require. It would be helpful to have additional guidance on how to more dynamically allocate memory and processing tasks as needed to run these types of analyses for wildlife studies.