I have an outer foreach/dopar parallel loop containing an inner loop. Every instance of the inner loop should work on the same set of random numbers. The rest, i.e. the remaining parts of the outer body and the parallel instances should work as usual, i.e. with independent random numbers.
I can achieve this in a non-parallel implementation by saving the state of the RNG before the start of the inner loop and restoring this state after execution of each instance of the inner loop. See the following example:
library(doSNOW)
seed = 4711
cl = makeCluster(2)
registerDoSNOW(cl)
clusterSetupRNGstream (cl, seed=rep(seed,6))
erg = foreach(irun = 1:3,.combine = rbind) %dopar% {
#do some random stuff in outer loop
smp = runif(1)
# save current state of RNG
s = .Random.seed
# inner loop, does some more random stuff
idx = numeric(5)
for(ii in seq.int(5)) {
idx[ii] = sample.int(10, 1)
# reset RNG for next loop iteration
set.seed(s)
}
c(smp,idx)
}
> print(erg)
[,1] [,2] [,3] [,4] [,5] [,6]
result.1 0.5749162 7 6 2 3 7
result.2 0.1208910 4 3 6 8 9
result.3 0.3491315 7 2 7 6 10
My desired output were constant integers along each row, different from row to row. So this does not work in parallel. The reason is quite clear: snow uses a different random generator (rlecuyer) and has to deal with parallel streams.
The question is: How can I achieve this reset of seed(s) in snow with the random number generator rlecuyer? One of the the challenges for me is to identify the proper substream one is in when the seed should be reset.
My current work-around is to pre-calculate all random stuff (in the example the idx vector) once for the inner loop and then use this constant data in all inner instances. This is not optimal since the random data in total becomes very large and it is much better to (re)generate it on the fly in smaller chunks.
Edit
It is of utmost importance that any setting/resetting keeps the parallel streams independent. As I currently understand the mechanism this requires:
For and within each parallel process:
- Detect the relevant substream the process is working on.
- At convenient times: Find and store the random seed of THIS substream.
- At convenient times: Reset substream from 1) with seed from 2) and leave all other streams undisturbed.
There are a few things worth mentioning here. First of all, you're saving
.Random.seedand then passing it directly toset.seed, which wouldn't give you the desired result even if you weren't having problems with the RNG. The reason for this is that, as documented,set.seed'sseedargument only cares about a single integer, so if you pass a vector, it will still use only the 1st value, and the documentation of.Random.seedstates:So all your parallel streams would end up with the same integers because the first value of
.Random.seedis always the same for the same RNG kind.Secondly, the documentation also states:
Which means you should definitely know what RNG you're using if you're going to be manipulating more internal values. If you call
clusterEvalQ(cl, RNGkind())after your call toclusterSetupRNGstream, you'll see that it returns:So you can't assume that
.Random.seedwill be enough to save the RNG's state. In fact, I'm not even sure it plays nicely withdoSNOW, see this discrepancy:Finally, it's clear from your example that
set.seedis not really working in this scenario. The documentation ofclusterSetupRNGstreamstates that therlecuyerpackage is used, and I don't know enough details about that package to say whether it supportsset.seedor not, but I suppose it doesn't.The only alternative I can give you is one that I've used before that is a bit more verbose:
AFAIK, the state of R's own L'Ecuyer-CMRG is indeed held in
.Random.seed.