I've been trying to run Rmpi and snowfall on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as
--hostfileto tell it what nodes to use.In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
Since we build Open MPI with support for Torque, I don't use the
--hostfileoption: mpirun figures out what nodes to use from thePBS_NODEFILEenvironment variable automatically. The use of-np 1may seem strange, but is needed if your program is going to spawn workers, which is typically done when using thesnowpackage. I've never usedsnowfall, but after looking over the source code, it appears to me thatsfInitalways callsmakeMPIclusterwith a "count" argument which will causesnowto spawn workers, so I think that-np 1is required for MPI clusters withsnowfall. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set thesfInit"cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find theRmpimpi.universe.sizefunction useful for that.If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.