Most efficient way to store spark streaming window in table incrementally with Spark

417 Views Asked by At

I would like to use spark-streaming to insert windows of events to daily table, while making that table always up to date to the last second.

Essentially I have this with spark 1.4.1:

val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
lines.window(Seconds(1), Seconds(1))
.foreachRDD { (rdd, time) =>
  if (rdd.count > 0) {
    hiveContext.read.json(rdd).toDF().write.mode(SaveMode.Append).save(s"tachyon://192.168.1.12:19998/persistedI")
    hiveContext.sql(s"CREATE TABLE IF NOT EXISTS persistedI USING org.apache.spark.sql.parquet OPTIONS ( path 'tachyon://192.168.1.12:19998/persistedI')")
    hiveContext.sql(s"REFRESH TABLE persistedI")
  }
}

However this get slower over time as I can in see in the logs that on every insert, all previous parts are opened (to read the parquet footers I presume)

I tried the following, but that make the refresh slower.

parquet.enable.summary-metadata false spark.sql.hive.convertMetastoreParquet.mergeSchema false

What would be the best setup for such case?

(I am pretty flexible with what is used as long as I can meet requirements)

0

There are 0 best solutions below