Use sparkContext to read sequence file, as follows:
Method 1:
val rdd = sc.sequenceFile(path, classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
Method 2:
val rdd = sc.hadoopFile(path,
classOf[SequenceFileAsBinaryInputFormat],
classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
Method 1 end up with EOFException, but method 2 works. What's the differences of these two methods?
The difference starts where "Method 1" immediately makes the call
hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)which usesSequenceFileInputFormat[BytesWritable, BytesWritable], but "Method 2" makes the same call except of course usesSequenceFileAsBinaryInputFormat.Then to continue, even though
SequenceFileAsBinaryInputFormatextendsSequenceFileInputFormat[BytesWritable, BytesWritable],SequenceFileAsBinaryInputFormathas it's own inner class calledSequenceFileAsBinaryRecordReaderand although it works similarly toSequenceFileRecordReader[BytesWritable, BytesWritable], there are differences. When we take a look at the code, they are doing some different implementations, namely the former is handling compression better. So if your sequence file is either record compressed or block compressed then it makes sense thatSequenceFileInputFormat[BytesWritable, BytesWritable]is not iterating with the same dependability asSequenceFileAsBinaryInputFormat.SequenceFileAsBinaryInputFormatwhich usesSequenceFileAsBinaryRecordReader(lines 102-115) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.javaSequenceFileRecordReader(lines 79 - 91) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java