Mulitprocessing and rpy2 (with ape)

572 Views Asked by At

I ran into this today and can't figure out why. I have several functions chained together that perform some time consuming operations as part of a larger pipeline. I've included these here, pared down to a test example, as best as I could. The issue is that when I call a function directly, I get the expected output (e.g., 5 different trees). However, when I call the same function in a multiprocessing pool with apply_async (or apply, doesn't matter), I get 5 trees, but they are all the same.

I've documented this in an IPython notebook, which can be viewed here: http://nbviewer.ipython.org/gist/cfriedline/0e275d528ff1a8d674c6

In cell 91, I create 5 trees (each with 10 tips), and return two lists. The first containing the non-multiprocessing trees, and the second from apply_async.

In cell 92, you can see the results of creating trees without multiprocessing, and in 93, with multiprocessing.

What I expect is that there would be a total of 10 different trees between the two tests, but instead all of the multiprocessing trees are identical. Makes little sense to me.

Relevant versions of things:

  • Linux 2.6.18-238.12.1.el5 x86_64 GNU/Linux
  • Python 2.7.6 :: Anaconda 1.9.2 (64-bit)
  • IPython 2.0.0
  • Rpy2 2.3.9

Thanks! Chris

2

There are 2 best solutions below

3
On

I'm not 100% familiar with these libraries, however, on Linux, (IIRC) multiprocessing uses os.fork. This means that the state of the random module (which you're using) will also be forked and that each of your processes will generate the same sequence of random numbers resulting in a not-so-random _get_random_string function.

If I'm right, and you make the pool smaller than the number of trees that you want, you should see that you get groups of N identical trees (where N is the number of pools).

I think that probably the ideal solution is to re-seed the random number generator inside of each of the processes. It's unlikely that they'll run at exactly the same time, so you should get differing results.

0
On

I solved this one, with a point in the right direction from @mgilson. In fact, it was a random number problem, just not in python - in R (sigh). The state of R is copied when the Pool is created, meaning so is its random seed. To fix, just a little rpy2 as below calling R's set.seed function (with some process specific stuff for good measure):

def create_tree(num_tips, type):
    """
    creates the taxa tree in R
    @param num_tips: number of taxa to create
    @param type: type for naming (e.g., 'taxa')
    @return: a dendropy Tree
    @rtype: dendropy.Tree
    """
    r = rpy2.robjects.r
    set_seed = r('set.seed')
    set_seed(int((time.time()+os.getpid()*1000)))
    rpy2.robjects.globalenv['numtips'] = num_tips
    rpy2.robjects.globalenv['treetype'] = type
    name = _get_random_string(20)
    if type == "T":
        r("%s = rtree(numtips, rooted=T, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
    else:
        r("%s = rtree(numtips, rooted=F, tip.label=paste(treetype, seq(1:(numtips)), sep=''))" % name)
    tree = r[name]
    return ape_to_dendropy(tree)