Celery for Map-Reduce, or other alternatives in Python?

2.9k Views Asked by At

I have expensive jobs that are very suited to be run under map-and-reduce model (long story short, it is to aggregate a few hundred rankings that are previously calculated via some time-consuming algorithm).

I wanted to parallelize the jobs on clusters (not merely multiprocessing), and focused on 2 implementations: Celery and Disco. Celery does not support naive map-and-reduce out of the box, and although the "map" part is easily done using TaskSets, how do you implement the "reduce" part efficiently?

(My problem with disco is that it does not run on Windows, and I have already setup celery for another part of the program, so running another framework for map-reduce seems to be rather inelegant.)

2

There are 2 best solutions below

0
On
0
On

Basically you need to take the output of one tasks and apply the output as input to another task. celery is not handy in this.

In celery way, you can have a Periodic Task scheduler that execute the jobs (map part) in the async manner and keep the task reference itself if it is single computer or post the reference to DB backend(redis/mongo/etc). You might need schedulers to collect this result and apply on reduce function(s).

I would say that you run your own python processes for map and reduce on all the clusters and make sure that you store the result in memory db like redis and use the celery to execute the tasks on map and reduce. Your main process would collect and combine the results.