Can I create sequence file using spark dataframes?

4.2k Views Asked by At

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?

1

There are 1 best solutions below

9
On BEST ANSWER

AFAIK there is no native api available directly in DataFrame except the below approach


Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :

Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable

object driver extends App {

   val conf = new SparkConf()
        .setAppName("HDFS writable test")
   val sc = new SparkContext(conf)

   val empty = sc.emptyRDD[Any].repartition(10)

   val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }

   val seq = new SequenceFileRDDFunctions(data)

   // seq.saveAsSequenceFile("/tmp/s1", None)

   seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
   sc.stop()
}

Further information pls see ..