how can I parallelize scaling a matrix (using the Seurat package) across multiple computing nodes?

181 Views Asked by At

I am working on scaling and clustering a matrix of single-nucleus RNA sequencing data (genes x cells) using the R package Seurat. My data is large, containing 11,500 genes and ~1.5mil cells. Due to the size of the data, the fastest way to scale the matrix would be to parallelize over multiple nodes (each containing 40 cores). I am computing on the Niagara cluster and can request as many cores as needed. My problem is that I can't figure out a way to effectively parallelize my code. I tried using the future package (which is recommended by Seurat) but that confines my data to one node, which is not enough. I also tried Rmpi, however that seemed to assign the same task to all the spawned workers, which was to scale the whole matrix and took too long. I have read about future.batchtools, but haven't been able to figure out the syntax.

I'll include the code I used for Rmpi and future.batchtools. I would appreciate any troubleshooting/alternative strategies to try.

Rmpi:

Seuratdata<-readRDS("/path/seuratobject.RDS")
mpi.universe.size()
mpi.spawn.Rslaves(nslaves=60)
mpi.bcast.cmd( id <- mpi.comm.rank() )
mpi.bcast.cmd( np <- mpi.comm.size() )
mpi.bcast.cmd( host <- mpi.get.processor.name() )

myfunc(data){
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(data, features=all.genes)
}
Seuratdata<-mpi.remote.exec(cmd=myfunc, data=Seuratdata)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")
mpi.close.Rslaves()
mpi.exit()

future.batchtools:

plan(tweak(batchtools_slurm, workers=80,resources=list(ncpus = 1, memory=10*1024^3,
walltime=10*60*60, partition='batch'), template = "./slurm.tmpl"))
Seuratdata<-readRDS("/path/seuratobject.RDS")
all.genes<-rownames(x=data)
Seuratdata<-ScaleData(Seuratdata, features=all.genes)
saveRDS(Seuratdata, file = "scaled_Seuratdata.rds")

1

There are 1 best solutions below

0
On

If you've got SSH permission between compute nodes, then you can submit a main job to scheduler:

$ sbatch --partiton=batch --ntasks=100 --time=10:00:00 --mem=10G script.sh

which then calls your script.R, e.g. Rscript script.R, that looks like:

library(future)
plan(cluster)
...

This will spin up 100 PSOCK cluster workers on whatever compute nodes Slurm has allocated the job. This works, because plan(cluster) defaults to plan(cluster, workers = availableWorkers()) and availableWorkers() picks up the information in SLURM_JOB_NODELIST set by Slurm. You can add:

print(parallelly::availableWorkers())`

at the top to log which compute nodes.

However, there are two limitations:

  1. plan(cluster) requires SSH access to the hosts in order to spin up the parallel workers on those hosts
  2. R has a maximum number of 125 workers this way, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/28.