When processing text file how does hadoop identify records ? Is it based on newline characters or full stops ?
If I have a text file list of 5000 words, all on single line, separated by space; no new line characters, commas or full stops. How will RecordReader behave ?
e.g. abc pqr xyz lmn qwe rew poio kjkh ascd lkyg ......
You can set the delimiter in the config with
textinputformat.record.delimiter
.If it isn't supplied it will fallback to split the lines based on one of the following:
'\n' (LF) , '\r' (CR), or '\r\n' (CR+LF)
. So your example line will be read as a single record.You can read through the code of the LineReader, TextInputFormat and LineRecordReader for more details.