How to select top rows in hadoop?

167 Views Asked by At

I am reading a 138MB file from Hadoop and trying to assign sequence numbers to each record. Below is the approach I followed.

I read the entire file using cascading, assigned current slice number and current record counter to each record. This was expected to run in parallel for each block and assign unique sequence numbers and slice number depending on which block it was present i.e. block0 of the file should go to mapper number 0 and slice number would be '0' and for block1 mapper no 1 would assign slice number as '1'(Slice in Cascading is same as input split in MapReduce). It is also expected that records with slice number '0' should be drastically more than records with slice number '1', as block 0 will be 128 MB and block 1 will be 10 MB.

But when i see the output, I see that both sets have almost same number of records input records i.e. records are evenly distributed among 2 mappers.

I can also see that the first record of the file was read by mapper1 instead of mapper0.

Could you please help me in understanding why records are getting distributed evenly between mappers?

0

There are 0 best solutions below