doMPI not recognizing other nodes in cluster for R script

533 Views Asked by At
  • Using RHEL7.3
  • Using R 3.3.2
  • Installed Rmpi_0.6-6.tar.gz and doMPI_0.2.1.tar.gz
  • Installed mpich-3.0-3.0.4-10.el7 RPM for x86_64

I created a cluster of three machines (aml1,2,3). I can run the /examples/cpi example from the mpich installation and the processes run without issue on all three machines.

I can also run an R script that needs to be run multiple times, which is discussed on the doMPI documentation -- so the script runs on all clusters.

My problem is when my R script has code prior to the %dopar% that needs to be run once on the master(aml1), and have the %dopar% run on the cluster (aml2,aml3). It only runs on the master. And doMPI says Size of MPI universe: 0 and doesn't recognize aml2 or aml3.

For example:

Run: mpirun -np 1 --hostfile ~/projects/hosts R --no-save -q < example6.R

(and my ~/projects/hosts file is defined to use 8 cores)

example6.R:

library(doMPI) #load doMPI library
cl <- startMPIcluster(verbose=TRUE)
#load data
#clean data
#perform some functions

#let's say I want to have this done in the script and only parallelize this
x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
 set.seed(seed)
 rnorm(3)
 }
x
closeCluster(cl)

Output of example6.R:

Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
Spawning 2 workers using the command:
  /usr/lib64/R/bin/Rscript /usr/lib64/R/library/doMPI/RMPIworker.R WORKDIR=/home/spark LOGDIR=/home/spark MAXCORES=1 COMM=3 INTERCOMM=4 MTAG=10 WTAG=11 INCLUDEMASTER=TRUE BCAST=TRUE VERBOSE=TRUE
 2 slaves are spawned successfully. 0 failed.

If I define cl <- startMPIcluster(count=34, verbose=TRUE) I still get the following but at least I can run 34 slaves:

Master processor name: aml1; nodename: aml1
Size of MPI universe: 0
34 slaves are spawned successfully. 0 failed.

How can I troubleshoot this? I would like to run the R script so it runs the first portion once on the master, and then do %dopar% on the cluster.

Thanks!!

Update 1

Since the last update, I tried running an older version of OpenMPI:

[spark@aml1 ~]$ which mpirun
/opt/openmpi-1.8.8/bin/mpirun

Per @SteveWeston, I created the following script and ran it:

[spark@aml1 ~]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit()

With the following output:

[spark@aml1 ~]$ mpirun -np 3 --hostfile ~/projects/hosts R --slave -f sanity_check.R
FIPS mode initialized
master (rank 0, comm 1) of size 3 is running on: aml1
slave1 (rank 1, comm 1) of size 3 is running on: aml1
slave2 (rank 2, comm 1) of size 3 is running on: aml1
[1] 0

Here it just hangs -- and nothing happens.

1

There are 1 best solutions below

2
On BEST ANSWER

I've already accepted @SteveWeston's answer as it helped me in better understanding my original question.

I commented to his answer that I was still having issues with my R script hanging; the scripts would run, but it would never finish on its own or close its own clusters and I would have to kill it with ctrl-C.

I ultimately set up an nfs environment, build and installed openmpi-1.10.5 there, and installed my R libraries there as well. R is installed separately on both machines, but they share the same library in my nfs directory. Previously I had installed and managed everything under root, including the R libraries (I know). I'm not sure if this what caused complications, but my issues seem to be resolved.

[master@aml1 nfsshare]$ cat sanity_check.R
library(Rmpi)
print(mpi.comm.rank(0))
mpi.quit(save= "no")

[master@aml1 nfsshare]$ mpirun -np 3 --hostfile hosts R --slave -f sanity_check.R
FIPS mode initialized
[1] 1
[1] 0
[1] 2
# no need to ctrl-C here. It no longer hangs