Does combiner runs conditionally

1.5k Views Asked by At

min.num.spills.for.combine (default 3)

What does it signify?

a) The min no. of a map spills to have for a combiner to run? So even though we have specified a combiner, its not guaranteed to run?

b) The min no. of spills to have before the combiner runs on the merged/sorted single file created via io.sort.factor. So each time a new file is created by merging, the combiner runs onto it, provided the no. of spills is min 3

I feel the correct answer is a) , but can anyone confirm that.

2

There are 2 best solutions below

0
On

When the map function generate the intermediate result and first sent them to buffer, the partitioning and sorting will start and , if a combiner is specified, it will be invoked at this time. This process is in parallel with the map function. When map function finishes, all the spills on disk will be merged, and combiners will also be invoked at this time too.

The buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk.

So to answer your question: you are right it is the choice a) .

Ref : This mail thread.

2
On

I feel the same :)

min.num.spills.for.combine(default 3) signifies that if you have a combiner in your job and the number of spills is three or more(at least 3) then it'll get called before the map output is written onto the local disk.

See this para from the Definitive Guide :

If a combiner function has been specified, and the number of spills is at least three (the value of the min.num.spills.for.combine property), then the combiner is run before the output file is written. Recall that combiners may be run repeatedly over the input without affecting the final result. The point is that running combiners makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.