JavaPairRDD to Dataset<Row> in SPARK

1k Views Asked by At

I have data in JavaPairRDD in format

JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>

I tried using below code

 Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
 Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
 Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");

But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???

1

There are 1 best solutions below

2
On

Try to run printSchema - you will see, that value2 is a complex type.

Having such information, you can write:

Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")

value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only

Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value