From the Apache doc on the Hadoop MapReduce InputFormat Interface:
"[L]ogical splits based on input-size is insufficient for many applications since record boundaries are to be respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task."
Is the WordCount example application one in which logical splits based on input size are insufficient? If so, where in the source code is the implementation of a RecordReader found?
Input splits are logical references to data. If you look at the API, you can see that it doesn't know anything about the record boundaries. A mapper is launched for every input split. A mapper's
map()
is run for every record(In a WordCount program, every line in a file).But how does a mapper know where the record boundaries are?
This is where your quote from Hadoop MapReduce InputFormat Interface comes in -
Every mapper is associated with an InputFormat. That
InputFormat
has information on whichRecordReader
to use. Look at the API, you will find that it knows about the input splits and what record reader to use. If you want to know some more about input splits and Record reader, you should read this answer.A
RecordReader
defines what the record boundaries are; TheInputFormat
defines whatRecordReader
is used.The WordCount program does not specify any
InputFormat
, it therefore defaults toTextInputFormat
which uses LineRecordReader and gives out every line as a different record. And this your source codeWhat this means is that, for an example file such as
and we want every line to be a record. when the logical splits are based on input size, it is possible there may be two splits such as:
and
If it wasn't for the
RecordReader
, it would've consideredf g
andh i j
to be different records; Clearly, this isn't what most applications want.Answering your question, in the WordCount program, it does not really matter what the record boundaries are but there is a possibility that the same word is split into different logical splits. Therefore, logical splits based on size are not sufficient for WordCount program.
Every MapReduce program 'respects' record boundaries. Otherwise, it is of not much use.