Two ways to read hdfs sequece file, but one failed with EOFException

280 Views Asked by zqiang0zh At 13 August 2019 at 11:29

Use sparkContext to read sequence file, as follows:

Method 1:

val rdd = sc.sequenceFile(path, classOf[BytesWritable], 
          classOf[BytesWritable])
rdd.count()

Method 2:

val rdd = sc.hadoopFile(path,
          classOf[SequenceFileAsBinaryInputFormat],
          classOf[BytesWritable],
          classOf[BytesWritable])
rdd.count()

Method 1 end up with EOFException, but method 2 works. What's the differences of these two methods?

Original Q&A

There are 1 best solutions below

uh_big_mike_boi On 13 August 2019 at 15:57 BEST ANSWER

The difference starts where "Method 1" immediately makes the call hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions) which uses SequenceFileInputFormat[BytesWritable, BytesWritable], but "Method 2" makes the same call except of course uses SequenceFileAsBinaryInputFormat.
Then to continue, even though SequenceFileAsBinaryInputFormat extends SequenceFileInputFormat[BytesWritable, BytesWritable], SequenceFileAsBinaryInputFormat has it's own inner class called SequenceFileAsBinaryRecordReader and although it works similarly to SequenceFileRecordReader[BytesWritable, BytesWritable], there are differences. When we take a look at the code, they are doing some different implementations, namely the former is handling compression better. So if your sequence file is either record compressed or block compressed then it makes sense that SequenceFileInputFormat[BytesWritable, BytesWritable] is not iterating with the same dependability as SequenceFileAsBinaryInputFormat.

SequenceFileAsBinaryInputFormat which uses SequenceFileAsBinaryRecordReader (lines 102-115) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.java

SequenceFileRecordReader (lines 79 - 91) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java

Two ways to read hdfs sequece file, but one failed with EOFException

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in SEQUENCEFILE

Trending Questions

Popular # Hahtags

Popular Questions