- Using RHEL7.3
- Using R 3.3.2
- Installed Rmpi_0.6-6.tar.gz and doMPI_0.2.1.tar.gz
- Installed mpich-3.0-3.0.4-10.el7 RPM for x86_64
I created a cluster of three machines (aml1,2,3). I can run the /examples/cpi example from the mpich installation and the processes run without issue on all three machines.
I can also run an R script that needs to be run multiple times, which is discussed on the doMPI documentation -- so the script runs on all clusters.
My problem is when my R script has code prior to the %dopar% that needs to be run once on the master(aml1), and have the %dopar% run on the cluster (aml2,aml3). It only runs on the master. And doMPI says Size of MPI universe: 0
and doesn't recognize aml2 or aml3.
For example:
Run: mpirun -np 1 --hostfile ~/projects/hosts R --no-save -q < example6.R
(and my ~/projects/hosts
file is defined to use 8 cores)
example6.R:
library(doMPI) #load doMPI library
cl <- startMPIcluster(verbose=TRUE)
#load data
#clean data
#perform some functions
#let's say I want to have this done in the script and only parallelize this
x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
set.seed(seed)
rnorm(3)
}
x
closeCluster(cl)
Output of example6.R:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
Spawning 2 workers using the command:
/usr/lib64/R/bin/Rscript /usr/lib64/R/library/doMPI/RMPIworker.R WORKDIR=/home/spark LOGDIR=/home/spark MAXCORES=1 COMM=3 INTERCOMM=4 MTAG=10 WTAG=11 INCLUDEMASTER=TRUE BCAST=TRUE VERBOSE=TRUE
2 slaves are spawned successfully. 0 failed.
If I define cl <- startMPIcluster(count=34, verbose=TRUE)
I still get the following but at least I can run 34 slaves:
Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
34 slaves are spawned successfully. 0 failed.
How can I troubleshoot this? I would like to run the R script so it runs the first portion once on the master, and then do %dopar% on the cluster.
Thanks!!
Update 1
Since the last update, I tried running an older version of OpenMPI:
[spark@aml1 ~]$ which mpirun
/opt/openmpi-1.8.8/bin/mpirun
Per @SteveWeston, I created the following script and ran it:
[spark@aml1 ~]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit()
With the following output:
[spark@aml1 ~]$ mpirun -np 3 --hostfile ~/projects/hosts R --slave -f sanity_check.R
FIPS mode initialized
master (rank 0, comm 1) of size 3 is running on: aml1
slave1 (rank 1, comm 1) of size 3 is running on: aml1
slave2 (rank 2, comm 1) of size 3 is running on: aml1
[1] 0
Here it just hangs -- and nothing happens.
I've already accepted @SteveWeston's answer as it helped me in better understanding my original question.
I commented to his answer that I was still having issues with my R script hanging; the scripts would run, but it would never finish on its own or close its own clusters and I would have to kill it with ctrl-C.
I ultimately set up an nfs environment, build and installed openmpi-1.10.5 there, and installed my R libraries there as well. R is installed separately on both machines, but they share the same library in my nfs directory. Previously I had installed and managed everything under root, including the R libraries (I know). I'm not sure if this what caused complications, but my issues seem to be resolved.