How is input data distributed across nodes for EMR [using MRJob]?

824 Views Asked by At

I'm looking into using Yelp's MRJob to compute using Amazon's Elastic Map Reduce. I will need to read and write a large amount of data during the computationally intensive job. Each node should only get a part of the data, and I'm confused about how this is done. Currently, my data is in a MongoDB and is stored on a persistent EBS drive.

When using EMR, how is the data factored over the nodes? How should one tell MRJob which key to partition the data over? The MRJob EMR documentation leaves the factoring step implicit: if you open a file or connection to a S3 key-value store, how does it divide the keys? Does it assume that the input is a sequence and partition it automatically on that basis?

Perhaps someone can explain how input data is propagated to nodes using the MRJob wordcount example. In that example, the input is a text file -- is it copied to all nodes, or read serially by one node and distributed in pieces?

1

There are 1 best solutions below

2
On

That example assumes you are working with text files. I'm not sure you can pass in a parameter to use the MongoDB hadoop driver.

What are you trying to do here? I'm working on the MongoDB hadoop driver and I'm looking for examples and test cases.