min.num.spills.for.combine (default 3)
What does it signify?
a) The min no. of a map spills to have for a combiner to run? So even though we have specified a combiner, its not guaranteed to run?
b) The min no. of spills to have before the combiner runs on the merged/sorted single file created via io.sort.factor. So each time a new file is created by merging, the combiner runs onto it, provided the no. of spills is min 3
I feel the correct answer is a) , but can anyone confirm that.
When the map function generate the intermediate result and first sent them to buffer, the partitioning and sorting will start and , if a combiner is specified, it will be invoked at this time. This process is in parallel with the map function. When map function finishes, all the spills on disk will be merged, and combiners will also be invoked at this time too.
The buffer threshold is limited by
io.sort.spill.percent
, during which spills are created. If the number of spills is more thanmin.num.spills.for.combine
, combiner gets invoked on the spills created before writing to disk.So to answer your question: you are right it is the choice a) .
Ref : This mail thread.