Disco/MapReduce: Using chain_reader on split data

490 Views Asked by At

My algorithm currently uses nr_reduces 1 because I need to ensure that the data for a given key is aggregated.

To pass input to the next iteration, one should use "chain_reader". However, the results from a mapper are as a single result list, and it appears this means that the next map iteration takes place as a single mapper! Is there a way to split the results to trigger multiple mappers?

1

There are 1 best solutions below

0
On

I could give a long answer but since this question is 3 years old: check out this page: http://discoproject.org/doc/disco/howto/dataflow.html#single-partition-map

In short: When there is N input for the mapper function, the output will be N and by setting merge_partitions=False your reduce will output N blobs. Now if you want to generate more outputs than inputs you can pass partions=N. But when your disco job consists of just a mapper function and you want to generate partitioned output, then add the simplest reduce fase combined with the params stated above to get that partitioned output.

@staticmethod
def reduce(iter, out, params):
    for (key, value) in iter:
        out.add(key, value)